Toward static test flakiness prediction: a feasibility study
Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution
https://doi.org/10.1145/3472674.3473981…
6 pages
1 file
Sign up for access to the world's latest research
Abstract
Flaky tests are tests that exhibit both a passing and failing behavior when run against the same code. While researchers attempted to define approaches for detecting and addressing test flakiness, most of them suffer from scalability issues. This limitation has been recently targeted through machine learning solutions that could predict the flakiness of tests using a set of both static and dynamic metrics that would avoid the re-execution of tests. Recognizing the effort spent so far, this paper poses the first steps toward an orthogonal view of the problem, namely the classification of flaky tests using only statically computable software metrics. We propose a feasibility study on 72 projects of the iDFlakies dataset, and investigate the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells. First, we statistically assess those differences. Second, we build a logistic regression model to verify if the differences observed are still significant when the metrics are considered together. The results show a relation between test flakiness and a number of test and production code factors, indicating the possibility to build classification approaches that exploit those factors to predict test flakiness. CCS CONCEPTS • Software and its engineering → Software testing and debugging; Empirical software validation.
Related papers
Proceedings of the 17th International Conference on Mining Software Repositories, 2020
Flaky tests are tests whose outcomes are non-deterministic. Despite the recent research activity on this topic, no effort has been made on understanding the vocabulary of flaky tests. This work proposes to automatically classify tests as flaky or not based on their vocabulary. Static classification of flaky tests is important, for example, to detect the introduction of flaky tests and to search for flaky tests after they are introduced in regression test suites. We evaluated performance of various machine learning algorithms to solve this problem. We constructed a data set of flaky and non-flaky tests by running every test case, in a set of 64k tests, 100 times (6.4 million test executions). We then used machine learning techniques on the resulting data set to predict which tests are flaky from their source code. Based on features, such as counting stemmed tokens extracted from source code identifiers, we achieved an F-measure of 0.95 for the identification of flaky tests. The best prediction performance was obtained when using Random Forest and Support Vector Machines. In terms of the code identifiers that are most strongly associated with test flakiness, we noted that job, action, and services are commonly associated with flaky tests. Overall, our results provides initial yet strong evidence that static detection of flaky tests is effective. CCS CONCEPTS • Software and its engineering → Software testing and debugging.
2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2021
A test case that intermittently passes or fails when performed under the same version of source code and test code is said to be flaky. The presence of flaky tests wastes testing time and effort. The most popular approach in industry to detect flakiness is ReRun. The idea behind ReRun is very simple: failing test cases are re-executed many times looking for inconsistencies in the output. Despite its simplicity, the ReRun strategy is very expensive both in terms of time and in terms of computational resources. This is particularly true for contexts where thousands of test cases are performed on a daily basis. Reducing the rerunning overhead is, thus, of utmost importance. This paper presents SHAKER, an open-source tool for detecting flakiness in time-constrained tests by adding noise in the execution environment. The main idea behind SHAKER is to add stressing tasks that compete with the test execution for the use of resources (CPU or memory). SHAKER is available as a GitHub Actions workflow that can be seamlessly integrated with any GitHub project. Alternatively, SHAKER can also be used via its provided Command Line Interface. In our evaluation, SHAKER was able to discover more flaky tests than ReRun and in a faster way (less re-executions); besides, our approach revealed tens of new flaky tests that went undetected by ReRun even after 50 re-executions. Thanks to its flexibility and ease of use, we believe that SHAKER can be useful for both practitioners and researchers.
Brazilian Symposium on Systematic and Automated Software Testing
Regression testing is an important phase to deliver software with quality. However, flaky tests hamper the evaluation of test results and can increase costs. This is because a flaky test may pass or fail non-deterministically and to identify properly the flakiness of a test requires rerunning the test suite multiple times. To cope with this challenge, approaches have been proposed based on prediction models and machine learning. Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting, presenting low performance when executed in a cross-project scenario. To overcome these limitations, we investigate the use of test smells as predictors of flaky tests. We conducted an empirical study to understand if test smells have good performance as a classifier to predict the flakiness in the cross-project context, and analyzed the information gain of each test smell. We also compared the test smell-based approach with the vocabularybased one. As a result, we obtained a classifier that had a reasonable performance (Random Forest, 0.83) to predict the flakiness in the testing phase. This classifier presented better performance than vocabulary-based model for crossproject prediction. The Assertion Roulette and Sleepy Test test smell types are the ones associated with the best information gain values.
arXiv (Cornell University), 2023
Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky test cases where the root cause of flakiness is in the test case itself and not in the production code. Our key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, in addition to informing testers, we augment a Large Language Model (LLM) like GPT with such extra knowledge to ask the LLM for repair suggestions. The results show that our suggested fix category labels significantly enhance the capability of GPT 3.5 Turbo, in generating fixes for flaky tests.
Proceedings of the 22nd international conference on Software engineering - ICSE '00, 2000
Applied Soft Computing, 2022
Unlike most other software quality attributes, testability cannot be evaluated solely based on the characteristics of the source code. The effectiveness of the test suite and the budget assigned to the test highly impact the testability of the code under test. The size of a test suite determines the test effort and cost, while the coverage measure indicates the test effectiveness. Therefore, testability can be measured based on the coverage and number of test cases provided by a test suite, considering the test budget. This paper offers a new equation to estimate testability regarding the size and coverage of a given test suite. The equation has been used to label 23,000 classes belonging to 110 Java projects with their testability measure. The labeled classes were vectorized using 262 metrics. The labeled vectors were fed into a family of supervised machine learning algorithms, regression, to predict testability in terms of the source code metrics. Regression models predicted testability with an R 2 of 0.68 and a mean squared error of 0.03, suitable in practice. Fifteen software metrics highly affecting testability prediction were identified using a feature importance analysis technique on the learned model. The proposed models have improved mean absolute error by 38% due to utilizing new criteria, metrics, and data compared with the relevant study on predicting branch coverage as a test criterion. As an application of testability prediction, it is demonstrated that automated refactoring of 42 smelly Java classes targeted at improving the 15 influential software metrics could elevate their testability by an average of 86.87%.
2021 IEEE/ACM 43rd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)
Evaluating Software testability can assist software managers in optimizing testing budgets and identifying opportunities for refactoring. In this paper, we abandon the traditional approach of pursuing testability measurements based on the correlation between software metrics and test characteristics observed on past projects, e.g., the size, the organization or the code coverage of the test cases. We propose a radically new approach that exploits automatic test generation and mutation analysis to quantify the amount of evidence about the relative hardness of identifying effective test cases. We introduce two novel evidence-based testability metrics, describe a prototype to compute them, and discuss initial findings on whether our measurements can reflect actual testability issues.
arXiv (Cornell University), 2023
Estimating software testability can crucially assist software managers to optimize test budgets and software quality. In this paper, we propose a new approach that radically differs from the traditional approach of pursuing testability measurements based on software metrics, e.g., the size of the code or the complexity of the designs. Our approach exploits automatic test generation and mutation analysis to quantify the evidence about the relative hardness of developing effective test cases. In the paper, we elaborate on the intuitions and the methodological choices that underlie our proposal for estimating testability, introduce a technique and a prototype that allows for concretely estimating testability accordingly, and discuss our findings out of a set of experiments in which we compare the performance of our estimations both against and in combination with traditional software metrics. The results show that our testability estimates capture a complementary dimension of testability that can be synergistically combined with approaches based on software metrics to improve the accuracy of predictions.
Assessment of Quality Software …
Before a program can fail, a software fault must be executed, that execution must alter the data state, and the incorrect data state must propagate to a state that results directly in an incorrect output. This paper describes a tool called PISCES developed by Reliable Software T echnologies Corporation for predicting the probability that faults in a particular program location will accomplish all three of these steps causing program failure. PISCES is a tool that is used during software veri cation and validation to predict a program's testability.
2020 IEEE International Conference on Software Maintenance and Evolution (ICSME)
Test smells attempt to capture design issues in test code that reduce their maintainability. Previous work found such smells to be highly common in automatically generated test-cases, but based this result on specific static detection rules; although these are based on the original definition of "test smells", a recent empirical study showed that developers perceive these as overly strict and non-representative of the maintainability and quality of test suites. This leads us to investigate how effective such test smell detection tools are on automatically generated test suites. In this paper, we build a dataset of 2,340 test cases automatically generated by EVOSUITE for 100 Java classes. We performed a multi-stage, cross-validated manual analysis to identify six types of test smells and label their instances. We benchmark the performance of two test smell detection tools: one widely used in prior work, and one recently introduced with the express goal to match developer perceptions of test smells. Our results show that these test smell detection strategies poorly characterized the issues in automatically generated test suites; the older tool's detection strategies, especially, misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice; and (ii) more accurate detection strategies, to be evaluated primarily in industrial contexts.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.