Book description
Doing data science is difficult. Projects are typically very dynamic with requirements that change as data understanding grows. The data itself arrives piecemeal, is added to, replaced, contains undiscovered flaws and comes from a variety of sources. Teams also have mixed skill sets and tooling is often limited. Despite these disruptions, a data science team must get off the ground fast and begin demonstrating value with traceable, tested work products. This is when you need Guerrilla Analytics.
In this book, you will learn about:
The Guerrilla Analytics Principles: simple rules of thumb for maintaining data provenance across the entire analytics life cycle from data extraction, through analysis to reporting.
Reproducible, traceable analytics: how to design and implement work products that are reproducible, testable and stand up to external scrutiny.
Practice tips and war stories: 90 practice tips and 16 war stories based on real-world project challenges encountered in consulting, pre-sales and research.
Preparing for battle: how to set up your team's analytics environment in terms of tooling, skill sets, workflows and conventions.
Data gymnastics: over a dozen analytics patterns that your team will encounter again and again in projects
- The Guerrilla Analytics Principles: simple rules of thumb for maintaining data provenance across the entire analytics life cycle from data extraction, through analysis to reporting
- Reproducible, traceable analytics: how to design and implement work products that are reproducible, testable and stand up to external scrutiny
- Practice tips and war stories: 90 practice tips and 16 war stories based on real-world project challenges encountered in consulting, pre-sales and research
- Preparing for battle: how to set up your team's analytics environment in terms of tooling, skill sets, workflows and conventions
- Data gymnastics: over a dozen analytics patterns that your team will encounter again and again in projects
Table of contents
- Cover
- Title page
- Table of Contents
- Copyright Page
- Preface
- Part 1: Principles
-
Part 2: Practice
-
Chapter 4: Stage 1: Data Extraction
- Summary
- 4.1. Guerrilla Analytics workflow
- 4.2. Pitfalls and risks
- 4.3. Practice tip 1: freeze the source system during data extraction
- 4.4. Practice tip 2: extract data into an agreed file format
- 4.5. Practice tip 3: calculate checksums before data extraction
- 4.6. Practice tip 4: capture front-end reports
- 4.7. Practice tip 5: save raw copies of web pages
- 4.8. Practice tip 6: consistency check OCR data
- 4.9. Wrap up
-
Chapter 5: Stage 2: Data Receipt
- Summary
- 5.1. Guerrilla Analytics workflow
- 5.2. Pitfalls and risks
- 5.3. Practice tip 7: have a single location for all data received
- 5.4. Practice tip 8: create unique identifiers for received data
- 5.5. Practice tip 9: store data tracking information in a data log
- 5.6. Practice tip 10: never modify raw data files
- 5.7. Practice tip 11: keep supporting material near the data
- 5.8. Practice tip 12: version-control data received
- 5.9. Bringing it all together
- 5.10. Wrap up
-
Chapter 6: Stage 3: Data Load
- Summary
- 6.1. Guerrilla Analytics Workflow
- 6.2. Pitfalls and risks
- 6.3. Practice tip 13: minimize modifications to data before load
- 6.4. Practice tip 14: do data load preparations on a copy of raw data files
- 6.5. Practice tip 15: add identifiers to raw data before loading
- 6.6. Practice tip 16: prefer one-to-one Data Loads
- 6.7. Practice tip 17: preserve the raw file name and data UID
- 6.8. Practice tip 18: load data as plain text
- 6.9. Common challenges
- 6.10. Wrap up
-
Chapter 7: Stage 4: Analytics Coding for Ease of Review
- Summary
- 7.1. Guerrilla Analytics workflow
- 7.2. Pitfalls and risks
- 7.3. Practice tip 19: use one code file per data output
- 7.4. Practice tip 20: produce clearly identifiable data outputs
- 7.5. Practice tip 21: write code that runs from start to finish
- 7.6. Practice tip 22: favor code that is not embedded in proprietary file formats
- 7.7. Practice tip 23: clearly label the running order of code files
- 7.8. Practice tip 24: drop all datasets at the start of code execution
- 7.9. Practice tip 25: break up data flows into “data steps”
- 7.10. Practice tip 26: don’t jump in and out of a code file
- 7.11. Practice tip 27: log code execution
- 7.12. Common Challenges
- 7.13. Wrap up
-
Chapter 8: Stage 4: Analytics Coding to Maintain Data Provenance
- Summary
- 8.1. Guerrilla Analytics workflow
- 8.2. Examples
- 8.3. Pitfalls and risks
- 8.4. Practice tip 28: clean data at a minimum of locations in a data flow
- 8.5. Practice tip 29: when cleaning a data field, keep the original raw field
- 8.6. Practice tip 30: filter data with flags, not deletions
- 8.7. Practice tip 31: identify fields with metadata
- 8.8. Practice tip 32: create a unique identifier for DATA records
- 8.9. Practice tip 33: rename data fields with a field mapping
- 8.10. Wrap up
-
Chapter 9: Stage 6: Creating Work Products
- Summary
- 9.1. Guerrilla Analytics workflow
- 9.2. Examples
- 9.3. The essence of a work product
- 9.4. Pitfalls and risks
- 9.5. Practice tip 34: track work products with a Unique Identifier (UID)
- 9.6. Practice tip 35: keep work product generators and outputs close together
- 9.7. Practice tip 36: avoid clutter in the file system
- 9.8. Practice tip 37: avoid clutter in the DME
- 9.9. Practice tip 38: give output data records a UID
- 9.10. Practice tip 39: version control work products
- 9.11. Practice tip 40: use a convention to name complex outputs
- 9.12. Practice tip 41: log all Work Products
- 9.13. Wrap up
-
Chapter 10: Stage 7: Reporting
- Summary
- 10.1. Guerrilla Analytics workflow
- 10.2. What is a report?
- 10.3. Why reports are complicated
- 10.4. Report components
- 10.5. Pitfalls and risks
- 10.6. Practice tip 42: liaise with report writers
- 10.7. Practice tip 43: create one work product per report component
- 10.8. Practice tip 44: make presentation quality work products
- 10.9. Extreme reporting
- 10.10. Wrap up
- Chapter 11: Stage 5: Consolidating Knowledge in Builds
-
Chapter 4: Stage 1: Data Extraction
-
Part 3: Testing
-
Chapter 12: Introduction to Testing
- Summary
- 12.1. Guerrilla Analytics workflow
- 12.2. What is testing?
- 12.3. Why do testing?
- 12.4. Areas of testing
- 12.5. Comparing expected and actual
- 12.6. The challenge of testing Guerrilla Analytics
- 12.7. Practice Tip 61: establish a testing culture
- 12.8. Practice Tip 62: test early
- 12.9. Practice Tip 63: test often
- 12.10. Practice Tip 64: give tests unique identifiers
- 12.11. Practice Tip 65: organize test data by test UID
- 12.12. Next chapters on testing
- 12.13. Wrap up
- Chapter 13: Testing Data
- Chapter 14: Testing Builds
- Chapter 15: Testing Work Products
-
Chapter 12: Introduction to Testing
-
Part 4: Building Guerrilla Analytics Capability
- Introduction
- Chapter 16: People
- Chapter 17: Process
-
Chapter 18: Technology
- Summary
- 18.1. Analytics capabilities
- 18.2. Data manipulation environment
- 18.3. Source code control
- 18.4. Access to the command line
- 18.5. High-level scripting language
- 18.6. Visualization
- 18.7. Build tool
- 18.8. Access to the internet
- 18.9. Encryption
- 18.10. Code libraries for data wrangling
- 18.11. Machine learning and statistics libraries
- 18.12. Centralized and controlled file system
- 18.13. Additional technology capabilities
- 18.14. Wrap up
- Chapter 19: Closing Remarks
- Appendix: Data Gymnastics
- References
- Index
Product information
- Title: Guerrilla Analytics
- Author(s):
- Release date: September 2014
- Publisher(s): Morgan Kaufmann
- ISBN: 9780128005033
You might also like
video
How Can I Clean My Data for Use in a Predictive Model?
Garbage In/Garbage Out applies to more than just manufacturing. Dirty data can doom your predictive analytics …
video
Panel: Real-time Discussions on Real-time Data processing
How do streaming experts define “streaming”? What are their favorite use cases of real-time streaming, and …
book
Smarter Data Science
Organizations can make data science a repeatable, predictable tool, which business professionals use to get more …
book
Data-Intensive Science
In this book, a diverse cross-section of application, computer, and data scientists explores the impact of …