Toward static test flakiness prediction: a feasibility study

VALERIA PONTILLO

doi:10.1145/3472674.3473981

Outline

Title

Abstract

Toward static test flakiness prediction: a feasibility study

VALERIA PONTILLO

Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution

https://doi.org/10.1145/3472674.3473981

visibility

…

description

6 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Flaky tests are tests that exhibit both a passing and failing behavior when run against the same code. While researchers attempted to define approaches for detecting and addressing test flakiness, most of them suffer from scalability issues. This limitation has been recently targeted through machine learning solutions that could predict the flakiness of tests using a set of both static and dynamic metrics that would avoid the re-execution of tests. Recognizing the effort spent so far, this paper poses the first steps toward an orthogonal view of the problem, namely the classification of flaky tests using only statically computable software metrics. We propose a feasibility study on 72 projects of the iDFlakies dataset, and investigate the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells. First, we statistically assess those differences. Second, we build a logistic regression model to verify if the differences observed are still significant when the metrics are considered together. The results show a relation between test flakiness and a number of test and production code factors, indicating the possibility to build classification approaches that exploit those factors to predict test flakiness. CCS CONCEPTS • Software and its engineering → Software testing and debugging; Empirical software validation.

Related papers

What is the Vocabulary of Flaky Tests?

Supun Chathuranga dissanayake

Proceedings of the 17th International Conference on Mining Software Repositories, 2020

Flaky tests are tests whose outcomes are non-deterministic. Despite the recent research activity on this topic, no effort has been made on understanding the vocabulary of flaky tests. This work proposes to automatically classify tests as flaky or not based on their vocabulary. Static classification of flaky tests is important, for example, to detect the introduction of flaky tests and to search for flaky tests after they are introduced in regression test suites. We evaluated performance of various machine learning algorithms to solve this problem. We constructed a data set of flaky and non-flaky tests by running every test case, in a set of 64k tests, 100 times (6.4 million test executions). We then used machine learning techniques on the resulting data set to predict which tests are flaky from their source code. Based on features, such as counting stemmed tokens extracted from source code identifiers, we achieved an F-measure of 0.95 for the identification of flaky tests. The best prediction performance was obtained when using Random Forest and Support Vector Machines. In terms of the code identifiers that are most strongly associated with test flakiness, we noted that job, action, and services are commonly associated with flaky tests. Overall, our results provides initial yet strong evidence that static detection of flaky tests is effective. CCS CONCEPTS • Software and its engineering → Software testing and debugging.

downloadDownload free PDF View PDFchevron_right

Shaker: a Tool for Detecting More Flaky Tests Faster

Gauss Cordeiro

2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2021

A test case that intermittently passes or fails when performed under the same version of source code and test code is said to be flaky. The presence of flaky tests wastes testing time and effort. The most popular approach in industry to detect flakiness is ReRun. The idea behind ReRun is very simple: failing test cases are re-executed many times looking for inconsistencies in the output. Despite its simplicity, the ReRun strategy is very expensive both in terms of time and in terms of computational resources. This is particularly true for contexts where thousands of test cases are performed on a daily basis. Reducing the rerunning overhead is, thus, of utmost importance. This paper presents SHAKER, an open-source tool for detecting flakiness in time-constrained tests by adding noise in the execution environment. The main idea behind SHAKER is to add stressing tasks that compete with the test execution for the use of resources (CPU or memory). SHAKER is available as a GitHub Actions workflow that can be seamlessly integrated with any GitHub project. Alternatively, SHAKER can also be used via its provided Command Line Interface. In our evaluation, SHAKER was able to discover more flaky tests than ReRun and in a faster way (less re-executions); besides, our approach revealed tens of new flaky tests that went undetected by ReRun even after 50 re-executions. Thanks to its flexibility and ease of use, we believe that SHAKER can be useful for both practitioners and researchers.

downloadDownload free PDF View PDFchevron_right

On the use of test smells for prediction of flaky tests

Silvia Regina Vergilio

Brazilian Symposium on Systematic and Automated Software Testing

Regression testing is an important phase to deliver software with quality. However, flaky tests hamper the evaluation of test results and can increase costs. This is because a flaky test may pass or fail non-deterministically and to identify properly the flakiness of a test requires rerunning the test suite multiple times. To cope with this challenge, approaches have been proposed based on prediction models and machine learning. Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting, presenting low performance when executed in a cross-project scenario. To overcome these limitations, we investigate the use of test smells as predictors of flaky tests. We conducted an empirical study to understand if test smells have good performance as a classifier to predict the flakiness in the cross-project context, and analyzed the information gain of each test smell. We also compared the test smell-based approach with the vocabularybased one. As a result, we obtained a classifier that had a reasonable performance (Random Forest, 0.83) to predict the flakiness in the testing phase. This classifier presented better performance than vocabulary-based model for crossproject prediction. The Assertion Roulette and Sleepy Test test smell types are the ones associated with the best information gain values.

downloadDownload free PDF View PDFchevron_right

Black-Box Prediction of Flaky Test Fix Categories Using Language Models

sakina fatima

arXiv (Cornell University), 2023

Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky test cases where the root cause of flakiness is in the test case itself and not in the production code. Our key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, in addition to informing testers, we augment a Large Language Model (LLM) like GPT with such extra knowledge to ask the LLM for repair suggestions. The results show that our suggested fix category labels significantly enhance the capability of GPT 3.5 Turbo, in generating fixes for flaky tests.

downloadDownload free PDF View PDFchevron_right

Estimating software fault-proneness for tuning testing activities

Giovanni Denaro

Proceedings of the 22nd international conference on Software engineering - ICSE '00, 2000

downloadDownload free PDF View PDFchevron_right

An ensemble meta-estimator to predict source code testability

saeed parsa

Applied Soft Computing, 2022

Unlike most other software quality attributes, testability cannot be evaluated solely based on the characteristics of the source code. The effectiveness of the test suite and the budget assigned to the test highly impact the testability of the code under test. The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness. Therefore, testability can be measured based on the coverage and number of test cases provided by a test suite, considering the test budget. This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite. The equation has been used to label 23,000 classes belonging to 110 Java projects with their testability measure. The labeled classes were vectorized using 262 metrics. The labeled vectors were fed into a family of supervised machine learning algorithms, regression, to predict testability in terms of the source code metrics. Regression models predicted testability with an R 2 of 0.68 and a mean squared error of 0.03, suitable in practice. Fifteen software metrics highly affecting testability prediction were identified using a feature importance analysis technique on the learned model. The proposed models have improved mean absolute error by 38% due to utilizing new criteria, metrics, and data compared with the relevant study on predicting branch coverage as a test criterion. As an application of testability prediction, it is demonstrated that automated refactoring of 42 smelly Java classes targeted at improving the 15 influential software metrics could elevate their testability by an average of 86.87%.

downloadDownload free PDF View PDFchevron_right

Towards Evidence-Based Testability Measurements

Giovanni Denaro

2021 IEEE/ACM 43rd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)

Evaluating Software testability can assist software managers in optimizing testing budgets and identifying opportunities for refactoring. In this paper, we abandon the traditional approach of pursuing testability measurements based on the correlation between software metrics and test characteristics observed on past projects, e.g., the size, the organization or the code coverage of the test cases. We propose a radically new approach that exploits automatic test generation and mutation analysis to quantify the amount of evidence about the relative hardness of identifying effective test cases. We introduce two novel evidence-based testability metrics, describe a prototype to compute them, and discuss initial findings on whether our measurements can reflect actual testability issues.

downloadDownload free PDF View PDFchevron_right

Measuring Software Testability via Automatically Generated Test Cases

Giovanni Denaro

arXiv (Cornell University), 2023

Estimating software testability can crucially assist software managers to optimize test budgets and software quality. In this paper, we propose a new approach that radically differs from the traditional approach of pursuing testability measurements based on software metrics, e.g., the size of the code or the complexity of the designs. Our approach exploits automatic test generation and mutation analysis to quantify the evidence about the relative hardness of developing effective test cases. In the paper, we elaborate on the intuitions and the methodological choices that underlie our proposal for estimating testability, introduce a technique and a prototype that allows for concretely estimating testability accordingly, and discuss our findings out of a set of experiments in which we compare the performance of our estimations both against and in combination with traditional software metrics. The results show that our testability estimates capture a complementary dimension of testability that can be synergistically combined with approaches based on software metrics to improve the accuracy of predictions.

downloadDownload free PDF View PDFchevron_right

PISCES: A tool for predicting software testability

Keith W Miller

Assessment of Quality Software …

Before a program can fail, a software fault must be executed, that execution must alter the data state, and the incorrect data state must propagate to a state that results directly in an incorrect output. This paper describes a tool called PISCES developed by Reliable Software T echnologies Corporation for predicting the probability that faults in a particular program location will accomplish all three of these steps causing program failure. PISCES is a tool that is used during software veri cation and validation to predict a program's testability.

downloadDownload free PDF View PDFchevron_right

Revisiting Test Smells in Automatically Generated Tests: Limitations, Pitfalls, and Opportunities

Sebastiano Panichella

2020 IEEE International Conference on Software Maintenance and Evolution (ICSME)

Test smells attempt to capture design issues in test code that reduce their maintainability. Previous work found such smells to be highly common in automatically generated test-cases, but based this result on specific static detection rules; although these are based on the original definition of "test smells", a recent empirical study showed that developers perceive these as overly strict and non-representative of the maintainability and quality of test suites. This leads us to investigate how effective such test smell detection tools are on automatically generated test suites. In this paper, we build a dataset of 2,340 test cases automatically generated by EVOSUITE for 100 Java classes. We performed a multi-stage, cross-validated manual analysis to identify six types of test smells and label their instances. We benchmark the performance of two test smell detection tools: one widely used in prior work, and one recently introduced with the express goal to match developer perceptions of test smells. Our results show that these test smell detection strategies poorly characterized the issues in automatically generated test suites; the older tool's detection strategies, especially, misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice; and (ii) more accurate detection strategies, to be evaluated primarily in industrial contexts.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Breno Miranda

IEEE Access, 2021

Context: Flaky tests plague regression testing in Continuous Integration environments by slowing down change releases and wasting testing time and effort. Despite the growing interest in mitigating the burden of test flakiness, how to efficiently and effectively detect flaky tests is still an open problem. Objective: In this study, we present and evaluate FLAST, an approach designed to statically predict test flakiness. FLAST leverages vector-space modeling, similarity search, dimensionality reduction, and k-Nearest Neighbor classification in order to timely and efficiently detect test flakiness. Method: In order to gain insights into the efficiency and effectiveness of FLAST, we conduct an empirical evaluation of the approach by considering 13 real-world projects, for a total of 1,383 flaky and 26,702 non-flaky tests. We carry out a quantitative comparison of FLAST with the state-of-the-art methods to detect test flakiness, by considering a balanced dataset comprising 1,402 real-world flaky and as many non-flaky tests. Results: From the results we observe that the effectiveness of FLAST is comparable with the state-of-the-art, while providing considerable gains in terms of efficiency. In addition, the results demonstrate how by tuning the threshold of the approach FLAST can be made more conservative, so to reduce false positives, at the cost of missing more potentially flaky tests. Conclusion: The collected results demonstrate that FLAST provides a fast, low-cost and reliable approach that can be used to guide test rerunning, or to gate the inclusion of new potentially flaky tests.

downloadDownload free PDF View PDFchevron_right

Flakify: A Black-Box, Language Model-based Predictor for Flaky Tests

sakina fatima

ArXiv, 2021

Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky, i.e., passing and failing across executions, even for the same version of the source code. Flaky tests introduce overhead to software development as they can lead to unnecessary attempts to debug production or testing code. Besides rerunning test cases multiple times, which is time-consuming and computationally expensive, flaky tests can be predicted using machine learning (ML) models. However, the state-of-the-art ML-based flaky test predictors rely on pre-defined sets of features that are either project-specific, i.e., inapplicable to other projects, or require access to production code, which is not always available to software test engineers. Moreover, given the non-deterministic behavior of flaky tests, it can be challenging to determine a complete set of features that could potentially be associated with test flakiness. Therefore, in this paper, we prop...

downloadDownload free PDF View PDFchevron_right

Providing test quality feedback using static source code and automatic test suite metrics

Jason Osborne

… , 2005. ISSRE 2005. …, 2005

A classic question in software development is" How much testing is enough?" Aside from dynamic coverage-based metrics, there are few measures that can be used to provide guidance on the quality of an automatic test suite as development proceeds. This paper utilizes the software testing and reliability early warning (STREW) static metric suite to provide a developer with indications of changes and additions to their automated unit test suite and code for added confidence that product quality will be high. Retrospective case ...

downloadDownload free PDF View PDFchevron_right

An Evaluation of Machine Learning Methods for Predicting Flaky Tests

Azeem Ahmad

2020

In this paper we have investigated as a means of prevention the feasibility of using machine learning (ML) classiers for aky test prediction in project written with Python. This study compares the predictive accuracy of the three machine learning classiers (Naive Bayes, Support Vector Machines, and Random Forests) with each other. We compared our ndings with the earlier investigation of similar ML classiers for projects written in Java. Authors in this study investigated if test smells are good predictors of test akiness. As developers need to trust the predictions of ML classiers, they wish to know which types of input data or test smells cause more false negatives and false positives. We concluded that RF performed better when it comes to precision (> 90%) but provided very low recall (< 10%) as compared to NB (i.e., precision < 70% and recall >30%) and SVM (i.e., precision < 70% and recall >60%).

downloadDownload free PDF View PDFchevron_right

What is the Vocabulary of Flaky Tests? An Extended Replication

Silvia Regina Vergilio

2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC), 2021

Software systems have been continuously evolved and delivered with high quality due to the widespread adoption of automated tests. A recurring issue hurting this scenario is the presence of flaky tests, a test case that may pass or fail non-deterministically. A promising, but yet lacking more empirical evidence, approach is to collect static data of automated tests and use them to predict their flakiness. In this paper, we conducted an empirical study to assess the use of code identifiers to predict test flakiness. To do so, we first replicate most parts of the previous study of Pinto et al. (MSR 2020). This replication was extended by using a different ML Python platform (Scikit-learn) and adding different learning algorithms in the analyses. Then, we validated the performance of trained models using datasets with other flaky tests and from different projects. We successfully replicated the results of Pinto et al. (2020), with minor differences using Scikit-learn; different algorithms had performance similar to the ones used previously. Concerning the validation, we noticed that the recall of the trained models was smaller, and classifiers presented a varying range of decreases. This was observed in both intraproject and inter-projects test flakiness prediction.

downloadDownload free PDF View PDFchevron_right

Empirical analysis of practitioners' perceptions of test flakiness factors

Azeem Ahmad

Software Testing, Verification and Reliability

Developers always wish to ensure that their latest changes to the code base do not break existing functionality. If test cases fail, they expect these failures to be connected to the submitted changes. Unfortunately, a flaky test can be the reason for a test failure. Developers spend time to relate possible test failures to the submitted changes only to find out that the cause for these failures is test flakiness. The dilemma of an identification of the real failures or flaky test failures affects developers' perceptions about what is test flakiness. Prior research on test flakiness has been limited to test smells and tools to detect test flakiness. In this paper, we have conducted a multiple case study with four different industries in Scandinavia to understand practitioners' perceptions about test flakiness and how this varies between industries. We observed that there are little differences in how the practitioners perceive test flakiness. We identified 23 factors that are perceived to affect test flakiness. These perceived factors are categorized as 1) Software test quality, 2) Software Quality, 3) Actual Flaky test and 4) Company-specific factors. We have studied the nature of effects such as whether factors increase, decrease or affect the ability to detect test flakiness. We validated our findings with different participants of the 4 companies to avoid biases. The average agreement rate of the identified factors and their effects are 86% and 86% respectively, among participants.

downloadDownload free PDF View PDFchevron_right

Applying Machine Learning Analysis for Software Quality Test

Remudin Mekuria

2023 International Conference on Code Quality (ICCQ)

One of the biggest expense in software development is the maintenance. Therefore, it's critical to comprehend what triggers maintenance and if it may be predicted. Numerous research outputs have demonstrated that specific methods of assessing the complexity of created programs may produce useful prediction models to ascertain the possibility of maintenance due to software failures. As a routine it is performed prior to the release, and setting up the models frequently calls for certain, object-oriented software measurements. It's not always the case that software developers have access to these measurements. In this paper, machine learning is applied on the available data to calculate the cumulative software failure levels. A technique to forecast a software's residual defectiveness using machine learning can be looked into as a solution to the challenge of predicting residual flaws. Software metrics and defect data were separated out of the static source code repository. Static code is used to create software metrics, and reported bugs in the repository are used to gather defect information. By using a correlation method, metrics that had no connection to the defect data were removed. This makes it possible to analyze all the data without pausing the programming process. Large, sophisticated software's primary issue is that it is impossible to control everything manually, and the cost of an error can be quite expensive. Developers may miss errors during testing as a consequence, which will raise maintenance costs. Finding a method to accurately forecast software defects is the overall objective.

downloadDownload free PDF View PDFchevron_right

Can testedness be effectively measured?

Alex Groce

2016

Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of "testedness" have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure. We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a "poorly tested" element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation. •Software and its engineering → Empirical software validation;

downloadDownload free PDF View PDFchevron_right

Quality Metrics of Test Suites in Testdriven Designed Applications

Kristen R Walcott

International Journal of Software Engineering & Applications, 2018

New techniques for writing and developing software have evolved in recent years. One is Test-Driven Development (TDD) in which tests are written before code. No code should be written without first having a test to execute it. Thus, in terms of code coverage, the quality of test suites written using TDD should be high. In this work, we analyze applications written using TDD and traditional techniques. Specifically, we demonstrate the quality of the associated test suites based on two quality metrics: 1) structure-based criterion, 2) fault-based criterion. We learn that test suites with high branch test coverage will also have high mutation scores, and we especially reveal this in the case of TDD applications. We found that Test-Driven Development is an effective approach that improves the quality of the test suite to cover more of the source code and also to reveal more.

downloadDownload free PDF View PDFchevron_right