Let us now describe how to retrieve actual historic data. Each software repository is different, and so are the mining steps necessary to extract the relevant data points. But most mining tools and setups share a common procedure. Here we give a step-by-step guide to mining a software repository on a real-world example: the IBM Eclipse project. Eclipse is open source software and has been a subject of many empirical software engineering research projects.
Thus, Eclipse is an ideal candidate for a hands-on example. But even though it is an open source project that exposes all the data about bugs and their fixes, the unstructured ways in which this information was collected makes it an interesting and challenging case study.
Historic data for a software project is preserved through many different activities in many different systems (e.g., version control, bug tracking systems, email messages, etc.). In order to extract and learn from the history of a software project, you have to access these resources. For many open source systems such as Eclipse, most of these resources are publicly available and can be accessed easily.
Following the steps detailed in the following sections, we will extract and link Eclipse history, process data, and bug data that can be used for various kinds of defect prediction or process analysis. Normally, the results get stored in persistent data storage systems (e.g., relational databases) that allow further analysis steps and manual inspection. ...