You are previewing The Art and Science of Analyzing Software Data.
O'Reilly logo
The Art and Science of Analyzing Software Data

Book Description

The Art and Science of Analyzing Software Data provides valuable information on analysis techniques often used to derive insight from software data. This book shares best practices in the field generated by leading data scientists, collected from their experience training software engineering students and practitioners to master data science.

The book covers topics such as the analysis of security data, code reviews, app stores, log files, and user telemetry, among others. It covers a wide variety of techniques such as co-change analysis, text analysis, topic analysis, and concept analysis, as well as advanced topics such as release planning and generation of source code comments. It includes stories from the trenches from expert data scientists illustrating how to apply data analysis in industry and open source, present results to stakeholders, and drive decisions.



  • Presents best practices, hints, and tips to analyze data and apply tools in data science projects
  • Presents research methods and case studies that have emerged over the past few years to further understanding of software data
  • Shares stories from the trenches of successful data science initiatives in industry

Table of Contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. List of Contributors
  6. Chapter 1: Past, Present, and Future of Analyzing Software Data
    1. Abstract
    2. Acknowledgments
    3. 1.1 Definitions
    4. 1.2 The Past: Origins
    5. 1.3 Present Day
    6. 1.4 Conclusion
  7. Part 1: Tutorial-Techniques
    1. Chapter 2: Mining Patterns and Violations Using Concept Analysis
      1. Abstract
      2. Acknowledgments
      3. 2.1 Introduction
      4. 2.2 Patterns and Blocks
      5. 2.3 Computing All Blocks
      6. 2.4 Mining Shopping Carts with Colibri
      7. 2.5 Violations
      8. 2.6 Finding Violations
      9. 2.7 Two Patterns or One Violation?
      10. 2.8 Performance
      11. 2.9 Encoding Order
      12. 2.10 Inlining
      13. 2.11 Related Work
      14. 2.12 Conclusions
    2. Chapter 3: Analyzing Text in Software Projects
      1. Abstract
      2. 3.1 Introduction
      3. 3.2 Textual Software Project Data and Retrieval
      4. 3.3 Manual Coding
      5. 3.4 Automated Analysis
      6. 3.5 Two Industrial Studies
      7. 3.6 Summary
    3. Chapter 4: Synthesizing Knowledge from Software Development Artifacts
      1. Abstract
      2. 4.1 Problem Statement
      3. 4.2 Artifact Lifecycle Models
      4. 4.3 Code Review
      5. 4.4 Lifecycle Analysis
      6. 4.5 Other Applications
      7. 4.6 Conclusion
    4. Chapter 5: A Practical Guide to Analyzing IDE Usage Data
      1. Abstract
      2. Acknowledgments
      3. 5.1 Introduction
      4. 5.2 Usage Data Research Concepts
      5. 5.3 How to Collect Data
      6. 5.4 How to Analyze Usage Data
      7. 5.5 Limits of What You Can Learn from Usage Data
      8. 5.6 Conclusion
      9. 5.7 Code Listings
    5. Chapter 6: Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data
      1. Abstract
      2. 6.1 Introduction
      3. 6.2 Applications of LDA in Software Analysis
      4. 6.3 How LDA Works
      5. 6.4 LDA Tutorial
      6. 6.5 Pitfalls and Threats to Validity
      7. 6.6 Conclusions
    6. Chapter 7: Tools and Techniques for Analyzing Product and Process Data
      1. Abstract
      2. 7.1 Introduction
      3. 7.2 A Rational Analysis Pipeline
      4. 7.3 Source Code Analysis
      5. 7.4 Compiled Code Analysis
      6. 7.5 Analysis of Configuration Management Data
      7. 7.6 Data Visualization
      8. 7.7 Concluding Remarks
  8. Part 2: Data/Problem Focussed
    1. Chapter 8: Analyzing Security Data
      1. Abstract
      2. 8.1 Vulnerability
      3. 8.2 Security Data “Gotchas”
      4. 8.3 Measuring Vulnerability Severity
      5. 8.4 Method of Collecting and Analyzing Vulnerability Data
      6. 8.5 What Security Data has Told Us Thus Far
      7. 8.6 Summary
    2. Chapter 9: A Mixed Methods Approach to Mining Code Review Data: Examples and a Study of Multicommit Reviews and Pull Requests
      1. Abstract
      2. 9.1 Introduction
      3. 9.2 Motivation for a Mixed Methods Approach
      4. 9.3 Review Process and Data
      5. 9.4 Quantitative Replication Study: Code Review on Branches
      6. 9.5 Qualitative Approaches
      7. 9.6 Triangulation
      8. 9.7 Conclusion
    3. Chapter 10: Mining Android Apps for Anomalies
      1. Abstract
      2. Acknowledgments
      3. 10.1 Introduction
      4. 10.2 Clustering Apps by Description
      5. 10.3 Identifying Anomalies by APIs
      6. 10.4 Evaluation
      7. 10.5 Related Work
      8. 10.6 Conclusion and Future Work
    4. Chapter 11: Change Coupling Between Software Artifacts: Learning from Past Changes
      1. Abstract
      2. 11.1 Introduction
      3. 11.2 Change Coupling
      4. 11.3 Change Coupling Identification Approaches
      5. 11.4 Challenges in Change Coupling Identification
      6. 11.5 Change Coupling Applications
      7. 11.6 Conclusion
  9. Part 3: Stories from the Trenches
    1. Chapter 12: Applying Software Data Analysis in Industry Contexts: When Research Meets Reality
      1. Abstract
      2. 12.1 Introduction
      3. 12.2 Background
      4. 12.3 Six Key Issues when Implementing a Measurement Program in Industry
      5. 12.4 Conclusions
    2. Chapter 13: Using Data to Make Decisions in Software Engineering: Providing a Method to our Madness
      1. Abstract
      2. 13.1 Introduction
      3. 13.2 Short History of Software Engineering Metrics
      4. 13.3 Establishing Clear Goals
      5. 13.4 Review of Metrics
      6. 13.5 Challenges with Data Analysis on Software Projects
      7. 13.6 Example of Changing Product Development Through the Use of Data
      8. 13.7 Driving Software Engineering Processes with Data
    3. Chapter 14: Community Data for OSS Adoption Risk Management
      1. Abstract
      2. Acknowledgments
      3. 14.1 Introduction
      4. 14.2 Background
      5. 14.3 An Approach to OSS Risk Adoption Management
      6. 14.4 OSS Communities Structure and Behavior Analysis: The XWiki Case
      7. 14.5 A Risk Assessment Example: The Moodbile Case
      8. 14.6 Related Work
      9. 14.7 Conclusions
    4. Chapter 15: Assessing the State of Software in a Large Enterprise: A 12-Year Retrospective
      1. Abstract
      2. Acknowledgments
      3. 15.1 Introduction
      4. 15.2 Evolution of the Process and the Assessment
      5. 15.3 Impact Summary of the State of Avaya Software Report
      6. 15.4 Assessment Approach and Mechanisms
      7. 15.5 Data Sources
      8. 15.6 Examples of Analyses
      9. 15.7 Software Practices
      10. 15.8 Assessment Follow-up: Recommendations and Impact
      11. 15.9 Impact of the Assessments
      12. 15.10 Conclusions
      13. 15.11 Appendix
      14. Author Biographies
    5. Chapter 16: Lessons Learned from Software Analytics in Practice
      1. Abstract
      2. 16.1 Introduction
      3. 16.2 Problem Selection
      4. 16.3 Data Collection
      5. 16.4 Descriptive Analytics
      6. 16.5 Predictive Analytics
      7. 16.6 Road Ahead
  10. Part 4: Advanced Topics
    1. Chapter 17: Code Comment Analysis for Improving Software Quality
      1. Abstract
      2. 17.1 Introduction
      3. 17.2 Text Analytics: Techniques, Tools, and Measures
      4. 17.3 Studies of Code Comments
      5. 17.4 Automated Code Comment Analysis for Specification Mining and Bug Detection
      6. 17.5 Studies and Analysis of API Documentation
      7. 17.6 Future Directions and Challenges
    2. Chapter 18: Mining Software Logs for Goal-Driven Root Cause Analysis
      1. Abstract
      2. 18.1 Introduction
      3. 18.2 Approaches to Root Cause Analysis
      4. 18.3 Root Cause Analysis Framework Overview
      5. 18.4 Modeling Diagnostics for Root Cause Analysis
      6. 18.5 Log Reduction
      7. 18.6 Reasoning Techniques
      8. 18.7 Root Cause Analysis for Failures Induced by Internal Faults
      9. 18.8 Root Cause Analysis for Failures due to External Threats
      10. 18.9 Experimental Evaluations
      11. 18.10 Conclusions
    3. Chapter 19: Analytical Product Release Planning
      1. Abstract
      2. Acknowledgments
      3. 19.1 Introduction and Motivation
      4. 19.2 Taxonomy of Data-intensive Release Planning Problems
      5. 19.3 Information Needs for Software Release Planning
      6. 19.4 The Paradigm of Analytical Open Innovation
      7. Analysis phase
      8. Synthesize phase
      9. 19.5 Analytical Release Planning—A Case Study
      10. 19.6 Summary and Future Research
      11. 19.7 Appendix: Feature Dependency Constraints
  11. Part 5: Data Analysis at Scale (Big Data)
    1. Chapter 20: Boa: An Enabling Language and Infrastructure for Ultra-Large-Scale MSR Studies
      1. Abstract
      2. 20.1 Objectives
      3. 20.2 Getting Started with Boa
      4. 20.3 Boa’s Syntax and Semantics
      5. 20.4 Mining Project and Repository Metadata
      6. 20.5 Mining Source Code with Visitors
      7. 20.6 Guidelines for Replicable Research
      8. 20.7 Conclusions
      9. 20.8 Practice Problems
      10. Project and Repository Metadata Problems
      11. Source Code Problems
    2. Chapter 21: Scalable Parallelization of Specification Mining Using Distributed Computing
      1. Abstract
      2. 21.1 Introduction
      3. 21.2 Background
      4. 21.3 Distributed Specification Mining
      5. 21.4 Implementation and Empirical Evaluation
      6. 21.5 Related Work
      7. 21.6 Conclusion and Future Work