The State of Evidence Today

Looking back, we now realize how naive our definitions of convincing evidence were. Elegant, statistically strong, replicated evidence turns out to be harder to find than we ever thought, and moreover, often doesn’t satisfy the more relevant goals we have for our research.

Challenges to the Elegance of Studies

We’ve found that elegant studies may be persuasive to those with sufficient training in research, but can be hard to explain to practitioners because they generally introduce constraints and simplifications that make the context less representative of real development environments. For example, the Basili and Selby study applied the techniques under study only to “toy” problems, none more than 400 lines of code, in an artificial environment. This study is often cited, and although it has been the subject of many replications, it does not seem that any of them used substantially larger or more representative applications [Runeson et al. 2006]. Although this elegant study has made a strong contribution to our understanding of the strengths and weaknesses of different approaches for removing defects from code, it is perhaps not ideal for a substantial part of our thinking on this subject to come from relatively small code segments.

Challenges to Statistical Strength

There is surprisingly little consensus about what constitutes “strong” statistics for real-world problems. First, there is the issue of external validity: whether the measures that are being tested adequately reflect the real-world phenomena of interest. Showing statistical significance is of no help when the measures are meaningless. For example, Foss et al. show that commonly used assessment measures for effort estimation are fundamentally broken [Foss et al. 2003]. Worse, they argue that there is no single fix:

[I]t is futile to search for the Holy Grail: a single, simple-to-use, universal goodness-of-fit kind of metric, which can be applied with ease to compare (different methods).

Moreover, different authors apply very different statistical analyses, and no one method is generally accepted as “strongest” by a broad community:

  • Demsar documents the enormous range of statistical methods seen at one prominent international conference focusing on learning lessons from data [Demsar 2006].

  • Cohen discusses the use of standard statistical hypothesis testing for making scientific conclusions. He scathingly describes such testing as a “potent but sterile intellectual rake who leaves...no viable scientific offspring” [Cohen 1988].

In support of Cohen’s thesis, we offer the following salutary lesson. Writing in the field of marketing, Armstrong [Armstrong 2007] reviews one study that, using significance testing, concludes that estimates generated from multiple sources do no better than those generated from a single source. He then demolishes this conclusion by listing 31 studies where multiple source prediction consistently out-performs single source prediction by 3.4% to 23.4% (average = 12.5%). In every study surveyed by Armstrong, these improvements are the exact opposite of what would be predicted from the significance test results.

Based on these discoveries, we have changed our views on statistical analysis. Now, as far as possible, we use succinct visualizations to make some point (and demote statistical significance tests to the role of a “reasonableness test” for the conclusions drawn from the visualizations).[2]

Challenges to Replicability of Results

Replicability has proved to be a very elusive goal. On certain topics, there is little evidence that results from one project have been, or can be, generalized to others:

  • Zimmermann studied 629 pairs of software development projects [Zimmermann 2009]. In only 4% of cases was a defect prediction model learned from one project useful on its pair.

  • A survey by Kitchenham et al. claims that one project’s data is useful for effort estimation on a second one [Kitchenham et al. 2007]. They found the existing evidence inconclusive and even contradictory.

On other topics, we can find evidence that a certain effect holds across a number of different contexts—e.g., that a mature technique such as software inspections can find a significant amount of the extant defects in a software work product [Shull 2002]. However, if a new study reports evidence to the contrary, it is still difficult to determine whether the new (or old) study was somehow flawed or whether the study was in fact run in a unique environment. Given the wide variation in the contexts in which software development is done, both conclusions are often equally plausible.

Indeed, it would seem that despite the intuitive appeal of replicable results, there are either very few such studies related to a specific software engineering question or the studies are incomplete.

As an example of a lack of studies, Menzies studied 100 software quality assurance methods proposed by various groups (such as IEEE1017 and the internal NASA IV&V standards) and found no experiments showing that any method is more cost-effective than any other [Menzies et al. 2008].

As examples in which available studies were incomplete:

  • Zannier et al. studied a randomly selected subset of 5% of the papers ever published at ICSE, the self-described premier software engineering conference [Zannier et al. 2006]. They found that of the papers that claim to be “empirical,” very few of them (2%) compare methods from multiple researchers.

  • Neto et al. reported a survey of the literature on Model-Based Testing (MBT) approaches [Neto et al. 2008]. They found 85 papers that described 71 distinct MBT approaches, and a very small minority of studies with any experimental component, indicating an overall tendency for researchers to continue innovating and reporting on new approaches rather than understanding the comparable practical benefits of existing ones.

To check whether these papers were isolated reports or part of a more general pattern, we reviewed all presentations made at the PROMISE[3] conference on repeatable software engineering experiments. Since 2005, at the PROMISE conference:

  • There have been 68 presentations, 48 of which either tried a new analysis on old data or made reports in the style of [Zannier 2006]: i.e., that a new method worked for one particular project.

  • Nine papers raised questions about the validity of prior results (e.g., [Menzies 2009a]).

  • Four papers argued that the generality of software engineering models was unlikely or impossible (e.g., [Briand 2006]).

  • Only rarely (7 out of 68 presentations) did researchers report generalizations from one project to other projects:

    • Four papers reported that a software quality predictor learned from one project was usefully applied to a new project (e.g., [Weyuker et al. 2008] and [Tosun et al. 2009]).

    • Three papers made a partial case that such generality was possible (e.g., [Boehm 2009]).

Somewhat alarmed at these findings, we discussed them with leading empirical software engineering researchers in the United States and Europe. Our conversations can be summarized as follows:

  • Vic Basili pioneered empirical software engineering (SE) for over 30 years. After asserting that empirical software engineering is healthier now than in the 1980s, he acknowledged that (a) results thus far are incomplete, and (b) there are few examples of methods that are demonstrably useful on multiple projects [Basili 2009].

  • David Budgen, along with Barbara Kitchenham, is a leading European advocate of “evidence-based software engineering” (EBSE). In EBSE, the practices of software engineers should be based on methods with well-founded support in the literature. Budgen and Kitchenham ask, “Is evidence-based software engineering mature enough for practice and policy?” Their answer is “no, not yet”: the software engineering field needs to significantly restructure itself before we can show that results can be replicated in different projects [Budgen et al. 2009]. They argue for different reporting standards in software engineering, specifically, the use of “structured abstracts” to simplify the large-scale analysis of the SE literature.



[2] But we still reject about two journal submissions per month because they report only mean value results, with no statistical or visual representation of the variance around that mean. At the very least, we recommend a Mann-Whitney or Wilcoxon test (for unpaired and pair results, respectively) to demonstrate that apparently different results might actually be different.

Get Making Software now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.