Overview of the Prediction Model

Fault prediction for a given release N of a system is based on a model built by analyzing the code properties, process properties, and fault counts from earlier releases. The fundamental assumption is that properties associated with faulty files in earlier releases will also be associated with faulty files in the next release. A model consists of an equation whose variables are the code and process properties for a file of release N, and whose value is a predicted number of faults for the file. Creation of the model requires data from at least two prior releases, but can make use of data from as many prior releases as are available. The data used to construct a model are referred to as training data.

Because we have fault data available for the entire history of these systems, we can generate predictions for release N using information from releases 1 through N – 1, and then check the prediction quality by looking at the actual numbers of faults that occur in files of release N. Our fundamental way of evaluating the quality of the predictions for release N is to measure how many of the faults actually detected in N occurred in the files that appear in the first 20% of the prediction list. Table 9-2 shows that the top 20% always includes at least 75% of the faults, and usually more than 80%. The table provides compelling evidence that the prediction method will be applicable to other large systems.

You might be wondering why we have chosen to use the ...

Get Making Software now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.