You are previewing Perspectives on Data Science for Software Engineering.
O'Reilly logo
Perspectives on Data Science for Software Engineering

Book Description

Perspectives on Data Science for Software Engineering presents the best practices of seasoned data miners in software engineering. The idea for this book was created during the 2014 conference at Dagstuhl, an invitation-only gathering of leading computer scientists who meet to identify and discuss cutting-edge informatics topics.

At the 2014 conference, the concept of how to transfer the knowledge of experts from seasoned software engineers and data scientists to newcomers in the field highlighted many discussions. While there are many books covering data mining and software engineering basics, they present only the fundamentals and lack the perspective that comes from real-world experience. This book offers unique insights into the wisdom of the community’s leaders gathered to share hard-won lessons from the trenches.

Ideas are presented in digestible chapters designed to be applicable across many domains. Topics included cover data collection, data sharing, data mining, and how to utilize these techniques in successful software projects. Newcomers to software engineering data science will learn the tips and tricks of the trade, while more experienced data scientists will benefit from war stories that show what traps to avoid.

  • Presents the wisdom of community experts, derived from a summit on software analytics
  • Provides contributed chapters that share discrete ideas and technique from the trenches
  • Covers top areas of concern, including mining security and social data, data visualization, and cloud-based data
  • Presented in clear chapters designed to be applicable across many domains

Table of Contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Contributors
  6. Acknowledgments
  7. Introduction
    1. Perspectives on data science for software engineering
      1. Abstract
      2. Why This Book?
      3. About This Book
      4. The Future
    2. Software analytics and its application in practice
      1. Abstract
      2. Six Perspectives of Software Analytics
      3. Experiences in Putting Software Analytics into Practice
    3. Seven principles of inductive software engineering: What we do is different
      1. Abstract
      2. Different and Important
      3. Principle #1: Humans Before Algorithms
      4. Principle #2: Plan for Scale
      5. Principle #3: Get Early Feedback
      6. Principle #4: Be Open Minded
      7. Principle #5: Be smart with your learning
      8. Principle #6: Live With the Data You Have
      9. Principle #7: Develop a Broad Skill Set That Uses a Big Toolkit
    4. The need for data analysis patterns (in software engineering)
      1. Abstract
      2. The Remedy Metaphor
      3. Software Engineering Data
      4. Needs of Data Analysis Patterns
      5. Building Remedies for Data Analysis in Software Engineering Research
    5. From software data to software theory: The path less traveled
      1. Abstract
      2. Pathways of Software Repository Research
      3. From Observation, to Theory, to Practice
    6. Why theory matters
      1. Abstract
      2. Introduction
      3. How to Use Theory
      4. How to Build Theory
      5. In Summary: Find a Theory or Build One Yourself
  8. Success Stories/Applications
    1. Mining apps for anomalies
      1. Abstract
      2. The Million-Dollar Question
      3. App Mining
      4. Detecting Abnormal Behavior
      5. A Treasure Trove of Data …
      6. … but Also Obstacles
      7. Executive Summary
    2. Embrace dynamic artifacts
      1. Abstract
      2. Acknowledgments
      3. Can We Minimize the USB Driver Test Suite?
      4. Still Not Convinced? Here’s More
      5. Dynamic Artifacts Are Here to Stay
    3. Mobile app store analytics
      1. Abstract
      2. Introduction
      3. Understanding End Users
      4. Conclusion
    4. The naturalness of software
      1. Abstract
      2. Introduction
      3. Transforming Software Practice
      4. Conclusion
    5. Advances in release readiness
      1. Abstract
      2. Predictive Test Metrics
      3. Universal Release Criteria Model
      4. Best Estimation Technique
      5. Resource/Schedule/Content Model
      6. Using Models in Release Management
      7. Research to Implementation: A Difficult (but Rewarding) Journey
    6. How to tame your online services
      1. Abstract
      2. Background
      3. Service Analysis Studio
      4. Success Story
    7. Measuring individual productivity
      1. Abstract
      2. No Single and Simple Best Metric for Success/Productivity
      3. Measure the Process, Not Just the Outcome
      4. Allow for Measures to Evolve
      5. Goodhart’s Law and the Effect of Measuring
      6. How to Measure Individual Productivity?
    8. Stack traces reveal attack surfaces
      1. Abstract
      2. Another Use of Stack Traces?
      3. Attack Surface Approximation
    9. Visual analytics for software engineering data
      1. Abstract
    10. Gameplay data plays nicer when divided into cohorts
      1. Abstract
      2. Cohort Analysis as a Tool for Gameplay Data
      3. Play to Lose
      4. Forming Cohorts
      5. Case Studies of Gameplay Data
      6. Challenges of Using Cohorts
      7. Summary
    11. A success story in applying data science in practice
      1. Abstract
      2. Overview
      3. Analytics Process
      4. Communication Process—Best Practices
      5. Summary
    12. There's never enough time to do all the testing you want
      1. Abstract
      2. The Impact of Short Release Cycles (There's Not Enough Time)
      3. Learn From Your Test Execution History
      4. The Art of Testing Less
      5. Tests Evolve Over Time
      6. In Summary
    13. The perils of energy mining: measure a bunch, compare just once
      1. Abstract
      2. A Tale of Two HTTPs
      3. Let's ENERGISE Your Software Energy Experiments
      4. Summary
    14. Identifying fault-prone files in large industrial software systems
      1. Abstract
      2. Acknowledgment
    15. A tailored suit: The big opportunity in personalizing issue tracking
      1. Abstract
      2. Many Choices, Nothing Great
      3. The Need for Personalization
      4. Developer Dashboards or “A Tailored Suit”
      5. Room for Improvement
    16. What counts is decisions, not numbers—Toward an analytics design sheet
      1. Abstract
      2. Decisions Everywhere
      3. The Decision-Making Process
      4. The Analytics Design Sheet
      5. Example: App Store Release Analysis
    17. A large ecosystem study to understand the effect of programming languages on code quality
      1. Abstract
      2. Comparing Languages
      3. Study Design and Analysis
      4. Results
      5. Summary
    18. Code reviews are not for finding defects—Even established tools need occasional evaluation
      1. Abstract
      2. Results
      3. Effects
      4. Conclusions
  9. Techniques
    1. Interviews
      1. Abstract
      2. Why Interview?
      3. The Interview Guide
      4. Selecting Interviewees
      5. Recruitment
      6. Collecting Background Data
      7. Conducting the Interview
      8. Post-Interview Discussion and Notes
      9. Transcription
      10. Analysis
      11. Reporting
      12. Now Go Interview!
    2. Look for state transitions in temporal data
      1. Abstract
      2. Bikeshedding in Software Engineering
      3. Summarizing Temporal Data
      4. Recommendations
    3. Card-sorting: From text to themes
      1. Abstract
      2. Preparation Phase
      3. Execution Phase
      4. Analysis Phase
    4. Tools! Tools! We need tools!
      1. Abstract
      2. Tools in Science
      3. The Tools We Need
      4. Recommendations for Tool Building
    5. Evidence-based software engineering
      1. Abstract
      2. Introduction
      3. The Aim and Methodology of EBSE
      4. Contextualizing Evidence
      5. Strength of Evidence
      6. Evidence and Theory
    6. Which machine learning method do you need?
      1. Abstract
      2. Learning Styles
      3. Do additional Data Arrive Over Time?
      4. Are Changes Likely to Happen Over Time?
      5. If You Have a Prediction Problem, What Do You Really Need to Predict?
      6. Do You Have a Prediction Problem Where Unlabeled Data are Abundant and Labeled Data are Expensive?
      7. Are Your Data Imbalanced?
      8. Do You Need to Use Data From Different Sources?
      9. Do You Have Big Data?
      10. Do You Have Little Data?
      11. In Summary…
    7. Structure your unstructured data first!: The case of summarizing unstructured data with tag clouds
      1. Abstract
      2. Unstructured Data in Software Engineering
      3. Summarizing Unstructured Software Data
      4. Conclusion
    8. Parse that data! Practical tips for preparing your raw data for analysis
      1. Abstract
      2. Use Assertions Everywhere
      3. Print Information About Broken Records
      4. Use Sets or Counters to Store Occurrences of Categorical Variables
      5. Restart Parsing in the Middle of the Data Set
      6. Test on a Small Subset of Your Data
      7. Redirect Stdout and Stderr to Log Files
      8. Store Raw Data Alongside Cleaned Data
      9. Finally, Write a Verifier Program to Check the Integrity of Your Cleaned Data
    9. Natural language processing is no free lunch
      1. Abstract
      2. Natural Language Data in Software Projects
      3. Natural Language Processing
      4. How to Apply NLP to Software Projects
      5. Summary
    10. Aggregating empirical evidence for more trustworthy decisions
      1. Abstract
      2. What's Evidence?
      3. What Does Data From Empirical Studies Look Like?
      4. The Evidence-Based Paradigm and Systematic Reviews
      5. How Far Can We Use the Outcomes From Systematic Review to Make Decisions?
    11. If it is software engineering, it is (probably) a Bayesian factor
      1. Abstract
      2. Causing the Future With Bayesian Networks
      3. The Need for a Hybrid Approach in Software Analytics
      4. Use the Methodology, Not the Model
    12. Becoming Goldilocks: Privacy and data sharing in “just right” conditions
      1. Abstract
      2. Acknowledgments
      3. The “Data Drought”
      4. Change is Good
      5. Don’t Share Everything
      6. Share Your Leaders
      7. Summary
    13. The wisdom of the crowds in predictive modeling for software engineering
      1. Abstract
      2. The Wisdom of the Crowds
      3. So… How is That Related to Predictive Modeling for Software Engineering?
      4. Examples of Ensembles and Factors Affecting Their Accuracy
      5. Crowds for Transferring Knowledge and Dealing With Changes
      6. Crowds for Multiple Goals
      7. A Crowd of Insights
      8. Ensembles as Versatile Tools
    14. Combining quantitative and qualitative methods (when mining software data)
      1. Abstract
      2. Prologue: We Have Solid Empirical Evidence!
      3. Correlation is Not Causation and, Even If We Can Claim Causation…
      4. Collect Your Data: People and Artifacts
      5. Build a Theory Upon Your Data
      6. Conclusion: The Truth is Out There!
      7. Suggested Readings
    15. A process for surviving survey design and sailing through survey deployment
      1. Abstract
      2. Acknowledgments
      3. The Lure of the Sirens: The Attraction of Surveys
      4. Navigating the Open Seas: A Successful Survey Process in Software Engineering
      5. In Summary
  10. Wisdom
    1. Log it all?
      1. Abstract
      2. A Parable: The Blind Woman and an Elephant
      3. Misinterpreting Phenomenon in Software Engineering
      4. Using Data to Expand Perspectives
      5. Recommendations
    2. Why provenance matters
      1. Abstract
      2. What’s Provenance?
      3. What are the Key Entities?
      4. What are the Key Tasks?
      5. Another Example
      6. Looking Ahead
    3. Open from the beginning
      1. Abstract
      2. Alitheia Core
      3. GHTorrent
      4. Why the Difference?
      5. Be Open or Be Irrelevant
    4. Reducing time to insight
      1. Abstract
      2. What is Insight Anyway?
      3. Time to Insight
      4. The Insight Value Chain
      5. What To Do
      6. A Warning on Waste
    5. Five steps for success: How to deploy data science in your organizations
      1. Abstract
      2. Step 1. Choose the Right Questions for the Right Team
      3. Step 2. Work Closely With Your Consumers
      4. Step 3. Validate and Calibrate Your Data
      5. Step 4. Speak Plainly to Give Results Business Value
      6. Step 5. Go the Last Mile—Operationalizing Predictive Models
    6. How the release process impacts your software analytics
      1. Abstract
      2. Linking Defect Reports and Code Changes to a Release
      3. How the Version Control System Can Help
    7. Security cannot be measured
      1. Abstract
      2. Gotcha #1: Security is Negatively Defined
      3. Gotcha #2: Having Vulnerabilities is Actually Normal
      4. Gotcha #3: “More Vulnerabilities” Does not Always Mean “Less Secure”
      5. Gotcha #4: Design Flaws are not Usually Tracked
      6. Gotcha #5: Hackers are Innovative Too
      7. An Unfair Question
    8. Gotchas from mining bug reports
      1. Abstract
      2. Do Bug Reports Describe Code Defects?
      3. It's the User That Defines the Work Item Type
      4. Do Developers Apply Atomic Changes?
      5. In Summary
    9. Make visualization part of your analysis process
      1. Abstract
      2. Leveraging Visualizations: An Example With Software Repository Histories
      3. How to Jump the Pitfalls
    10. Don't forget the developers! (and be careful with your assumptions)
      1. Abstract
      2. Acknowledgments
      3. Disclaimer
      4. Background
      5. Are We Actually Helping Developers?
      6. Some Observations and Recommendations
    11. Limitations and context of research
      1. Abstract
      2. Small Research Projects
      3. Data Quality of Open Source Repositories
      4. Lack of Industrial Representatives at Conferences
      5. Research From Industry
      6. Summary
    12. Actionable metrics are better metrics
      1. Abstract
      2. What Would You Say… I Should DO?
      3. The Offenders
      4. Actionable Heroes
      5. Cyclomatic Complexity: An Interesting Case
      6. Are Unactionable Metrics Useless?
    13. Replicated results are more trustworthy
      1. Abstract
      2. The Replication Crisis
      3. Reproducible Studies
      4. Reliability and Validity in Studies
      5. So What Should Researchers Do?
      6. So What Should Practitioners Do?
    14. Diversity in software engineering research
      1. Abstract
      2. Introduction
      3. What Is Diversity and Representativeness?
      4. What Can We Do About It?
      5. Evaluation
      6. Recommendations
      7. Future Work
    15. Once is not enough: Why we need replication
      1. Abstract
      2. Motivating Example and Tips
      3. Exploring the Unknown
      4. Types of Empirical Results
      5. Do's and Don't's
    16. Mere numbers aren't enough: A plea for visualization
      1. Abstract
      2. Numbers Are Good, but…
      3. Case Studies on Visualization
      4. What to Do
    17. Don’t embarrass yourself: Beware of bias in your data
      1. Abstract
      2. Dewey Defeats Truman
      3. Impact of Bias in Software Engineering
      4. Identifying Bias
      5. Assessing Impact
      6. Which Features Should I Look At?
    18. Operational data are missing, incorrect, and decontextualized
      1. Abstract
      2. Background
      3. Examples
      4. A Life of a Defect
      5. What to Do?
    19. Data science revolution in process improvement and assessment?
      1. Abstract
    20. Correlation is not causation (or, when not to scream “Eureka!”)
      1. Abstract
      2. What Not to Do
      3. Example
      4. Examples from Software Engineering
      5. What to Do
      6. In Summary: Wait and Reflect Before You Report
    21. Software analytics for small software companies: More questions than answers
      1. Abstract
      2. The Reality for Small Software Companies
      3. Small Software Companies Projects: Smaller and Shorter
      4. Different Goals and Needs
      5. What to Do About the Dearth of Data?
      6. What to Do on a Tight Budget?
    22. Software analytics under the lamp post (or what star trek teaches us about the importance of asking the right questions)
      1. Abstract
      2. Prologue
      3. Learning from Data
      4. Which Bin is Mine?
      5. Epilogue
    23. What can go wrong in software engineering experiments?
      1. Abstract
      2. Operationalize Constructs
      3. Evaluate Different Design Alternatives
      4. Match Data Analysis and Experimental Design
      5. Do Not Rely on Statistical Significance Alone
      6. Do a Power Analysis
      7. Find Explanations for Results
      8. Follow Guidelines for Reporting Experiments
      9. Improving the reliability of experimental results
    24. One size does not fit all
      1. Abstract
    25. While models are good, simple explanations are better
      1. Abstract
      2. Acknowledgments
      3. How Do We Compare a USB2 Driver to a USB3 Driver?
      4. The Issue With Our Initial Approach
      5. “Just Tell us What Is Different and Nothing More”
      6. Looking Back
      7. Users Prefer Simple Explanations
    26. The white-shirt effect: Learning from failed expectations
      1. Abstract
      2. A Story
      3. The Right Reaction
      4. Practical Advice
    27. Simpler questions can lead to better insights
      1. Abstract
      2. Introduction
      3. Context of the Software Analytics Project
      4. Providing Predictions on Buggy Changes
      5. How to Read the Graph?
      6. (Anti-)Patterns in the Error-Handling Graph
      7. How to <span xmlns="" xmlns:epub="" class="italic">Act</span> on (Anti-)Patterns? on (Anti-)Patterns?
      8. Summary
    28. Continuously experiment to assess values early on
      1. Abstract
      2. Most Ideas Fail to Show Value
      3. Every Idea Can Be Tested With an Experiment
      4. How Do We Find Good Hypotheses and Conduct the Right Experiments?
      5. Key Takeaways
    29. Lies, damned lies, and analytics: Why big data needs thick data
      1. Abstract
      2. How Great It Is, to Have Data Like You
      3. Looking for Answers in All the Wrong Places
      4. Beware the Reality Distortion Field
      5. Build It and They Will Come, but Should We?
      6. To Classify Is Human, but Analytics Relies on Algorithms
      7. Lean in: How Ethnography Can Improve Software Analytics and Vice Versa
      8. Finding the Ethnographer Within
    30. The world is your test suite
      1. Abstract
      2. Watch the World and Learn
      3. Crashes, Hangs, and Bluescreens
      4. The Need for Speed
      5. Protecting Data and Identity
      6. Discovering Confusion and Missing Requirements
      7. Monitoring Is Mandatory