Despite the decades of software engineering research, so far we have seen relatively few examples of convincing evidence that have actually led to changes in how people run software projects. We speculate that this is due to the context problem: researchers have been generating evidence about A, and the audience cares about B, C, D, and so on. We recommend a little more humility among researchers exploring evidence about software engineering, at least as things stand now, and a willingness to mingle with and listen to software practitioners who can help us better figure out what B, C, and D actually are. We think our field may need to retire, at least for a time, the goal of seeking evidence about results that hold for all projects in all cases; finding local results that make a difference tends to be challenging enough.
Endres and Rombach proposed a view of how knowledge gets built about software and systems engineering [Endres and Rombach 2003]:
Observations of what actually happens during development in a specific context can happen all the time. (“Observation” in this case is defined to comprise both hard facts and subjective impressions.)
Recurring observations lead to laws that help understand how things are likely to occur in the future.
Laws are explained by theories that explain why those events happen.
Given the complexity of the issues currently being tackled by empirical research, we’d like to slow the rush to theory-building. A more productive model of knowledge-building based on all the data and studies we’re currently seeing would be a two-tiered approach of “observations” and “laws” (to use Endres’s terminology), supported by the repositories that we described earlier.
For the first tier, a researcher would model the goals of the local audience, and then collect evidence focused on those goals. The ready availability of modern data mining tools can help engineers to learn such local lessons.
On the second tier, where we abstract conclusions across projects and/or contexts, we may have to be content for now to abstract the important factors or basic principles in an area, not to provide “the” solution that will work across some subset of contexts.
For example, Hall et al. tried to answer the question of what motivates software developers [Hall et al. 2008]. They looked across 92 studies that had examined this question, each in a different context. In trying to wrestle with all of the studies and their various results, the researchers took the approach of looking for which factors seem to motivate developers, even though it wasn’t feasible to quantify how much those factors contribute. Thus, factors that were found to contribute to motivation in multiple studies could be included in this model with some confidence.
The end result was not a predictive model that said factor X was twice as important as factor Y; rather, it was a checklist of important factors that managers could use in their own context to make sure they hadn’t neglected or forgotten something. Maybe the best we can do on many questions is to arm practitioners who have questions with the factors they need to consider in finding their own solutions.
To do this, we need to broaden the definition of what “evidence” is acceptable for the first tier of observations. Some researchers have long argued that software engineering research shows a bias toward quantitative data and analysis, but that qualitative work can be just as rigorous and can provide useful answers to relevant questions [Seaman 2007]. A start toward building more robust collections of evidence would be to truly broaden the definition of acceptable evidence to incorporate qualitative as well as quantitative sources—that is, more textual or graphical data about why technologies do or don’t work, in addition to quantitative data that measures what the technologies’ effects are.
However, truly convincing bodies of evidence would go even further and accept different types of evidence entirely—not just research studies that try to find statistical significance, but also reports of investigators’ experience that can provide more information about the practical application of technologies. Such experience reports are currently underrated because they suffer from being less rigorous than much of the existing literature. For example, it is not always possible to have confidence that aspects of interest have been measured precisely, that confounding factors have been excluded, or that process conformance issues have been avoided. However, such reports should be an explicit part of the “evidence trail” of any software development technology.
As Endres indicated by defining “observations” broadly enough to include subjective impressions, even less rigorous forms of input can help point us to valid conclusions—if they are tagged as such so that we don’t create laws with an unwarranted sense of confidence. Applying a confidence rating to these sources of evidence is important so that those methodological issues can be highlighted. (Of course, even the most rigorous research study is unlikely to be completely free of any methodological issues.)
Experience reports can keep research grounded by demonstrating that what is achieved under practical constraints may not always match our expectations for a given technology. Just as importantly, they provide insights about how technologies need to be tailored or adapted to meet the practical constraints of day-to-day software development. In the area of tech transfer, we often find that a single good case study that recounts positive experiences with a technology in practice can be more valuable to practitioners than a multitude of additional research reports.
Such convincing bodies of evidence can also help drive change by providing evidence that can reach different types of users. Rogers proposed an oft-cited model that characterized consumers of some innovation (say, research results) along a bell curve: the leftmost tail comprises the innovators, the “bulge” of the bell represents the majority who adopt somewhere in the middle curve, as the idea has increasingly caught on, and the rightmost tail contains the “laggards” who resist change [Rogers 1962]. Researchers who know their audience can select appropriate subsets of data from a truly robust set in order to make their case:
Early adopters may find a small set of relatively low-confidence feasibility studies, or one or two “elegant” research studies, sufficient for them to adopt a change for themselves—especially if the context of those studies shows that the evidence was gathered in an environment somewhat like their own.
The majority of adopters may need to see a mix of studies from different contexts to convince themselves that a research idea has merit, has been proven feasible in more than just a niche environment, and has started to become part of the accepted way of doing things.
Laggards or late adopters may require overwhelming evidence, which may comprise more than a handful of high-confidence studies and a wide range of contexts in which beneficial results have been obtained.
Some ways that such different types of evidence can be combined have been proposed [Shull et al. 2001]. But what matters most is the interplay of having both types: qualitative reports as well as quantitative data. We have often seen that practitioners do respond better to data sets that have mixes of different types of data. For example, having a rich set of both hard data and positive experiences from real teams helped software inspection technologies diffuse across multiple centers of NASA [Shull and Seaman 2008].
Evidence from well-designed empirical studies can help a finding reach statistical significance but can still be impractical to implement or too slow at delivering timely answers (especially in an area of rapid technological change, such as software engineering). Evidence from application in practice can be more convincing of the benefits but is typically less rigorous. Often it is a combination of evidence from both sources—where one corroborates the other—that is ultimately convincing.