You are previewing Guerrilla Analytics.
O'Reilly logo
Guerrilla Analytics

Book Description

Doing data science is difficult. Projects are typically very dynamic with requirements that change as data understanding grows. The data itself arrives piecemeal, is added to, replaced, contains undiscovered flaws and comes from a variety of sources. Teams also have mixed skill sets and tooling is often limited. Despite these disruptions, a data science team must get off the ground fast and begin demonstrating value with traceable, tested work products. This is when you need Guerrilla Analytics.

 In this book, you will learn about:

The Guerrilla Analytics Principles: simple rules of thumb for maintaining data provenance across the entire analytics life cycle from data extraction, through analysis to reporting.

Reproducible, traceable analytics: how to design and implement work products that are reproducible, testable and stand up to external scrutiny.

Practice tips and war stories: 90 practice tips and 16 war stories based on real-world project challenges encountered in consulting, pre-sales and research.

Preparing for battle: how to set up your team's analytics environment in terms of tooling, skill sets, workflows and conventions.

Data gymnastics: over a dozen analytics patterns that your team will encounter again and again in projects



In this book, you will learn about:

  • The Guerrilla Analytics Principles: simple rules of thumb for maintaining data provenance across the entire analytics life cycle from data extraction, through analysis to reporting. 
  • Reproducible, traceable analytics: how to design and implement work products that are reproducible, testable and stand up to external scrutiny.
  • Practice tips and war stories: 90 practice tips and 16 war stories based on real-world project challenges encountered in consulting, pre-sales and research.
  • Preparing for battle: how to set up your team's analytics environment in terms of tooling, skill sets, workflows and conventions.
  • Data gymnastics: over a dozen analytics patterns that your team will encounter again and again in projects.

Table of Contents

  1. Cover
  2. Title page
  3. Table of Contents
  4. Copyright Page
  5. Preface
  6. Part 1: Principles
    1. Chapter 1: Introducing Guerrilla Analytics
      1. Summary
      2. 1.1. What is data analytics?
      3. 1.2. Types of data analytics projects
      4. 1.3. Introducing Guerrilla Analytics projects
      5. 1.4. Guerrilla Analytics definition
      6. 1.5. Example Guerrilla Analytics projects
      7. 1.6. Some terminology
      8. 1.7. Wrap up
    2. Chapter 2: Guerrilla Analytics: Challenges and Risks
      1. Summary
      2. 2.1. The Guerrilla Analytics workflow
      3. 2.2. Challenges of managing analytics projects
      4. 2.3. Risks
      5. 2.4. Impact of failure to address analytics risks
      6. 2.5. Wrap up
    3. Chapter 3: Guerrilla Analytics Principles
      1. Summary
      2. 3.1. Maintain data provenance despite disruptions
      3. 3.2. The principles
      4. 3.3. Applying the principles
      5. 3.4. Wrap up
  7. Part 2: Practice
    1. Chapter 4: Stage 1: Data Extraction
      1. Summary
      2. 4.1. Guerrilla Analytics workflow
      3. 4.2. Pitfalls and risks
      4. 4.3. Practice tip 1: freeze the source system during data extraction
      5. 4.4. Practice tip 2: extract data into an agreed file format
      6. 4.5. Practice tip 3: calculate checksums before data extraction
      7. 4.6. Practice tip 4: capture front-end reports
      8. 4.7. Practice tip 5: save raw copies of web pages
      9. 4.8. Practice tip 6: consistency check OCR data
      10. 4.9. Wrap up
    2. Chapter 5: Stage 2: Data Receipt
      1. Summary
      2. 5.1. Guerrilla Analytics workflow
      3. 5.2. Pitfalls and risks
      4. 5.3. Practice tip 7: have a single location for all data received
      5. 5.4. Practice tip 8: create unique identifiers for received data
      6. 5.5. Practice tip 9: store data tracking information in a data log
      7. 5.6. Practice tip 10: never modify raw data files
      8. 5.7. Practice tip 11: keep supporting material near the data
      9. 5.8. Practice tip 12: version-control data received
      10. 5.9. Bringing it all together
      11. 5.10. Wrap up
    3. Chapter 6: Stage 3: Data Load
      1. Summary
      2. 6.1. Guerrilla Analytics Workflow
      3. 6.2. Pitfalls and risks
      4. 6.3. Practice tip 13: minimize modifications to data before load
      5. 6.4. Practice tip 14: do data load preparations on a copy of raw data files
      6. 6.5. Practice tip 15: add identifiers to raw data before loading
      7. 6.6. Practice tip 16: prefer one-to-one Data Loads
      8. 6.7. Practice tip 17: preserve the raw file name and data UID
      9. 6.8. Practice tip 18: load data as plain text
      10. 6.9. Common challenges
      11. 6.10. Wrap up
    4. Chapter 7: Stage 4: Analytics Coding for Ease of Review
      1. Summary
      2. 7.1. Guerrilla Analytics workflow
      3. 7.2. Pitfalls and risks
      4. 7.3. Practice tip 19: use one code file per data output
      5. 7.4. Practice tip 20: produce clearly identifiable data outputs
      6. 7.5. Practice tip 21: write code that runs from start to finish
      7. 7.6. Practice tip 22: favor code that is not embedded in proprietary file formats
      8. 7.7. Practice tip 23: clearly label the running order of code files
      9. 7.8. Practice tip 24: drop all datasets at the start of code execution
      10. 7.9. Practice tip 25: break up data flows into “data steps”
      11. 7.10. Practice tip 26: don’t jump in and out of a code file
      12. 7.11. Practice tip 27: log code execution
      13. 7.12. Common Challenges
      14. 7.13. Wrap up
    5. Chapter 8: Stage 4: Analytics Coding to Maintain Data Provenance
      1. Summary
      2. 8.1. Guerrilla Analytics workflow
      3. 8.2. Examples
      4. 8.3. Pitfalls and risks
      5. 8.4. Practice tip 28: clean data at a minimum of locations in a data flow
      6. 8.5. Practice tip 29: when cleaning a data field, keep the original raw field
      7. 8.6. Practice tip 30: filter data with flags, not deletions
      8. 8.7. Practice tip 31: identify fields with metadata
      9. 8.8. Practice tip 32: create a unique identifier for DATA records
      10. 8.9. Practice tip 33: rename data fields with a field mapping
      11. 8.10. Wrap up
    6. Chapter 9: Stage 6: Creating Work Products
      1. Summary
      2. 9.1. Guerrilla Analytics workflow
      3. 9.2. Examples
      4. 9.3. The essence of a work product
      5. 9.4. Pitfalls and risks
      6. 9.5. Practice tip 34: track work products with a Unique Identifier (UID)
      7. 9.6. Practice tip 35: keep work product generators and outputs close together
      8. 9.7. Practice tip 36: avoid clutter in the file system
      9. 9.8. Practice tip 37: avoid clutter in the DME
      10. 9.9. Practice tip 38: give output data records a UID
      11. 9.10. Practice tip 39: version control work products
      12. 9.11. Practice tip 40: use a convention to name complex outputs
      13. 9.12. Practice tip 41: log all Work Products
      14. 9.13. Wrap up
    7. Chapter 10: Stage 7: Reporting
      1. Summary
      2. 10.1. Guerrilla Analytics workflow
      3. 10.2. What is a report?
      4. 10.3. Why reports are complicated
      5. 10.4. Report components
      6. 10.5. Pitfalls and risks
      7. 10.6. Practice tip 42: liaise with report writers
      8. 10.7. Practice tip 43: create one work product per report component
      9. 10.8. Practice tip 44: make presentation quality work products
      10. 10.9. Extreme reporting
      11. 10.10. Wrap up
    8. Chapter 11: Stage 5: Consolidating Knowledge in Builds
      1. Summary
      2. 11.1. Introduction
      3. 11.2. Pitfalls and risks
      4. 11.3. Example: the customer address problem
      5. 11.4. Sources of variation
      6. 11.5. Definition of a build
      7. 11.6. The customer address example using a Build
      8. 11.7. Data Builds
      9. 11.8. Service Builds
      10. 11.9. When to start a build
      11. 11.10. Wrap up
  8. Part 3: Testing
    1. Chapter 12: Introduction to Testing
      1. Summary
      2. 12.1. Guerrilla Analytics workflow
      3. 12.2. What is testing?
      4. 12.3. Why do testing?
      5. 12.4. Areas of testing
      6. 12.5. Comparing expected and actual
      7. 12.6. The challenge of testing Guerrilla Analytics
      8. 12.7. Practice Tip 61: establish a testing culture
      9. 12.8. Practice Tip 62: test early
      10. 12.9. Practice Tip 63: test often
      11. 12.10. Practice Tip 64: give tests unique identifiers
      12. 12.11. Practice Tip 65: organize test data by test UID
      13. 12.12. Next chapters on testing
      14. 12.13. Wrap up
    2. Chapter 13: Testing Data
      1. Summary
      2. 13.1. Guerrilla Analytics workflow
      3. 13.2. The five C’s of testing data
      4. 13.3. Testing data completeness
      5. 13.4. Testing data correctness
      6. 13.5. Testing consistency
      7. 13.6. Testing data coherence
      8. 13.7. Testing accountability
      9. 13.8. Implementing data testing
      10. 13.9. Wrap up
    3. Chapter 14: Testing Builds
      1. Summary
      2. 14.1. Structure of a data build
      3. 14.2. An illustrative example
      4. 14.3. Types of build tests
      5. 14.4. Test code development
      6. 14.5. Organizing build test code
      7. 14.6. Organizing test data
      8. 14.7. Wrap up
    4. Chapter 15: Testing Work Products
      1. Summary
      2. 15.1. Types of testable work products
      3. 15.2. Ordinary work products
      4. 15.3. General tips on testing ordinary work products
      5. 15.4. Testing statistical models
      6. 15.5. General tips on testing models
      7. 15.6. Wrap up
  9. Part 4: Building Guerrilla Analytics Capability
    1. Introduction
    2. Chapter 16: People
      1. Summary
      2. 16.1. That question again – what is data analytics?
      3. 16.2. Guerrilla Analytics skills
      4. 16.3. Programming
      5. 16.4. Substantive expertise
      6. 16.5. Communication
      7. 16.6. “Maths and stats”
      8. 16.7. Visualization
      9. 16.8. Software engineering
      10. 16.9. Mindset
      11. 16.10. Wrap up
    3. Chapter 17: Process
      1. Summary
      2. 17.1. What is workflow management?
      3. 17.2. Workflows in Analytics
      4. 17.3. Levels of review
      5. 17.4. Linking work products
      6. 17.5. Classifying work products
      7. 17.6. Granularity
      8. 17.7. When to use workflow management
      9. 17.8. Wrap up
    4. Chapter 18: Technology
      1. Summary
      2. 18.1. Analytics capabilities
      3. 18.2. Data manipulation environment
      4. 18.3. Source code control
      5. 18.4. Access to the command line
      6. 18.5. High-level scripting language
      7. 18.6. Visualization
      8. 18.7. Build tool
      9. 18.8. Access to the internet
      10. 18.9. Encryption
      11. 18.10. Code libraries for data wrangling
      12. 18.11. Machine learning and statistics libraries
      13. 18.12. Centralized and controlled file system
      14. 18.13. Additional technology capabilities
      15. 18.14. Wrap up
    5. Chapter 19: Closing Remarks
      1. 19.1. What was this book about?
      2. 19.2. Next steps for Guerrilla Analytics
      3. 19.3. Keep in touch
      4. Acknowledgments
    6. Appendix: Data Gymnastics
    7. References
    8. Index