In the Marvel Universe, mutants known as the X-Men wield superhuman abilities ranging from shape-shifting to storm-summoning.
In the software universe, mutants may not bring the thunder, but they are no less marvelous. In 2014, Allen School professors René Just and Michael Ernst, along with their collaborators, demonstrated that mutants function as an effective substitute for real defects in software testing. Their work, which spawned a robust line of follow-on research over the ensuing decade, earned them the Most Influential Paper Award at the ACM International Conference on the Foundations of Software Engineering (FSE 2024) last month in Porto de Galinhas, Brazil.
Mutants are artificial defects (bugs) intentionally embedded throughout a program. If a test suite is good at detecting these artificial defects, it may be good at detecting real defects. Testing is an important element of the software development cycle; buggy code can be annoying, like when a video game glitches, or it can grind industries to a halt, like the world witnessed during the recent CrowdStrike incident. According to the Consortium for Information & Software Quality (CISQ), the costs associated with defective software surpassed $2 trillion in 2022 in the United States alone.
Among the solutions CISQ emphasized in its report were “tools for understanding, finding and fixing deficiencies.” Mutants play an integral role in the development and evaluation of such tools. While the software community historically had assumed such artificial defects were valid stand-ins for real ones, no one had empirically established that this was, indeed, the case.
“We can’t know what real errors might be in a program’s code, so researchers and practitioners relied on mutants as a proxy. But there was very little evidence to support that approach,” Ernst said. “So we decided to test the conventional wisdom and determine whether the practice held up under scrutiny.”
Ernst, Just and their colleagues applied this scrutiny through a series of experiments using 230,000 mutants and over 350 real defects contained in five open-source Java programs comprising 321,000 lines of code. To reassemble the real defects, which had already been identified and fixed by developers, the researchers examined the version history for bug-fixing commits. They then ran both developer-written and automatically generated test suites to ascertain how their ability to find known mutants in a program correlated with their ability to identify the real defects. During their testing, the researchers controlled for code coverage, or the proportion of each program’s code that was executed during the test, which otherwise could confound the results.
Those results revealed a statistically significant relationship between a test suite’s effectiveness at detecting mutants and its effectiveness at detecting real defects. But while the team’s findings confirmed the conventional wisdom in one respect, it upended it in another.
“Our findings validated the use of mutants in software test development,” said Just, who was first author of the paper and a postdoctoral researcher in the Allen School at the time of publication. “It also yielded a number of other new and practical insights — one being that a test suite’s ability to detect mutants is a better predictor of its performance on real defects than code coverage.”
Another of the paper’s insights was confirmation that a coupling effect exists between mutants and real defects. This effect is observed between a complex defect and a set of simple defects when a test that detects the latter also succeeds in detecting the former. While prior work had shown that the same effect exists between simple and complex mutants, it was unclear whether a similar coupling effect applied between real defects and simple mutants. The researchers found that this was, indeed, the case, identifying 73% of real defects that were coupled to mutants. Based on an analysis of the 27% that did not exhibit this coupling effect, the team recommended a set of concrete approaches for improving mutation analysis — and by extension, the effectiveness of test suites.
In addition to Just and Ernst, co-authors of the paper include Allen School alum Darioush Jalali (M.S., ‘14), now a software engineer at Ava Labs; then-Ph.D. student Laura Inozemtseva and professor Reid Holmes of the University of Waterloo, now a senior software engineer at Karius and a faculty member at the University of British Columbia, respectively; and University of Sheffield professor Gordon Fraser, now a faculty member at the University of Passau.