Data Analysis with Open Source Tools

Book description

Collecting data is relatively easy, but turning raw information into something useful requires that you know how to extract precisely what you need. With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data in a business environment. You'll learn how to look at data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.

Along the way, you'll experiment with concepts through hands-on workshops at the end of each chapter. Above all, you'll learn how to think about the results you want to achieve -- rather than rely on tools to think for you.

  • Use graphics to describe data with one, two, or dozens of variables
  • Develop conceptual models using back-of-the-envelope calculations, as well asscaling and probability arguments
  • Mine data with computationally intensive methods such as simulation and clustering
  • Make your conclusions understandable through reports, dashboards, and other metrics programs
  • Understand financial calculations, including the time-value of money
  • Use dimensionality reduction techniques or predictive analytics to conquer challenging data analysis situations
  • Become familiar with different open source programming environments for data analysis

"Finally, a concise reference for understanding how to conquer piles of data."--Austin King, Senior Web Developer, Mozilla

"An indispensable text for aspiring data scientists."--Michael E. Driscoll, CEO/Founder, Dataspora

Publisher resources

View/Submit Errata

Table of contents

  1. Dedication
  2. A Note Regarding Supplemental Files
  3. Preface
    1. Before We Begin
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
  4. 1. Introduction
    1. Data Analysis
    2. What’s in This Book
    3. What’s with the Workshops?
    4. What’s with the Math?
    5. What You’ll Need
    6. What’s Missing
  5. I. Graphics: Looking at Data
    1. 2. A Single Variable: Shape and Distribution
      1. Dot and Jitter Plots
      2. Histograms and Kernel Density Estimates
        1. Histograms
        2. Kernel Density Estimates
        3. Optional: Optimal Bandwidth Selection
      3. The Cumulative Distribution Function
        1. Optional: Comparing Distributions with Probability Plots and QQ Plots
      4. Rank-Order Plots and Lift Charts
      5. Only When Appropriate: Summary Statistics and Box Plots
        1. Summary Statistics
        2. Box-and-Whisker Plots
      6. Workshop: NumPy
        1. NumPy in Action
        2. NumPy in Detail
      7. Further Reading
    2. 3. Two Variables: Establishing Relationships
      1. Scatter Plots
      2. Conquering Noise: Smoothing
        1. Splines
        2. LOESS
        3. Examples
        4. Residuals
        5. Additional Ideas and Warnings
      3. Logarithmic Plots
      4. Banking
      5. Linear Regression and All That
      6. Showing What’s Important
      7. Graphical Analysis and Presentation Graphics
      8. Workshop: matplotlib
        1. Using matplotlib Interactively
        2. Case Study: LOESS with matplotlib
        3. Managing Properties
        4. The matplotlib Object Model and Architecture
        5. Odds and Ends
      9. Further Reading
    3. 4. Time As a Variable: Time-Series Analysis
      1. Examples
      2. The Task
        1. Requirements and the Real World
      3. Smoothing
        1. Running Averages
        2. Exponential Smoothing
      4. Don’t Overlook the Obvious!
      5. The Correlation Function
        1. Examples
        2. Implementation Issues
      6. Optional: Filters and Convolutions
      7. Workshop: scipy.signal
      8. Further Reading
    4. 5. More Than Two Variables: Graphical Multivariate Analysis
      1. False-Color Plots
      2. A Lot at a Glance: Multiplots
        1. The Scatter-Plot Matrix
        2. The Co-Plot
        3. Variations
      3. Composition Problems
        1. Changes in Composition
        2. Multidimensional Composition: Tree and Mosaic Plots
      4. Novel Plot Types
        1. Glyphs
        2. Parallel Coordinate Plots
      5. Interactive Explorations
        1. Querying and Zooming
        2. Linking and Brushing
        3. Grand Tours and Projection Pursuits
        4. Tools
      6. Workshop: Tools for Multivariate Graphics
        1. R
        2. Experimental Tools
        3. Python Chaco Library
      7. Further Reading
    5. 6. Intermezzo: A Data Analysis Session
      1. A Data Analysis Session
      2. Workshop: gnuplot
      3. Further Reading
  6. II. Analytics: Modeling Data
    1. 7. Guesstimation and the Back of the Envelope
      1. Principles of Guesstimation
        1. Estimating Sizes
        2. Establishing Relationships
        3. Working with Numbers
          1. Powers of ten
          2. Small perturbations
          3. Logarithms
        4. More Examples
        5. Things I Know
      2. How Good Are Those Numbers?
        1. Before You Get Started: Feasibility and Cost
        2. After You Finish: Quoting and Displaying Numbers
      3. Optional: A Closer Look at Perturbation Theory and Error Propagation
        1. Error Propagation
      4. Workshop: The Gnu Scientific Library (GSL)
      5. Further Reading
    2. 8. Models from Scaling Arguments
      1. Models
        1. Modeling
        2. Using and Misusing Models
      2. Arguments from Scale
        1. Scaling Arguments
        2. Example: A Dimensional Argument
        3. Example: An Optimization Problem
        4. Example: A Cost Model
        5. Optional: Scaling Arguments Versus Dimensional Analysis
        6. Other Arguments
      3. Mean-Field Approximations
        1. Background and Further Examples
      4. Common Time-Evolution Scenarios
        1. Unconstrained Growth and Decay Phenomena
        2. Constrained Growth: The Logistic Equation
        3. Oscillations
      5. Case Study: How Many Servers Are Best?
      6. Why Modeling?
      7. Workshop: Sage
      8. Further Reading
    3. 9. Arguments from Probability Models
      1. The Binomial Distribution and Bernoulli Trials
        1. Exact Results
        2. Using Bernoulli Trials to Develop Mean-Field Models
      2. The Gaussian Distribution and the Central Limit Theorem
        1. The Central Limit Theorem
        2. The Central Term and the Tails
        3. Why Is the Gaussian so Useful?
        4. Optional: Gaussian Integrals
        5. Beware: The World Is Not Normal!
      3. Power-Law Distributions and Non-Normal Statistics
        1. Working with Power-Law Distributions
        2. Optional: Distributions with Infinite Expectation Values
        3. Where to Go from Here
      4. Other Distributions
        1. Geometric Distribution
        2. Poisson Distribution
        3. Log-Normal Distribution
        4. Special-Purpose Distributions
      5. Optional: Case Study—Unique Visitors over Time
      6. Workshop: Power-Law Distributions
      7. Further Reading
    4. 10. What You Really Need to Know About Classical Statistics
      1. Genesis
      2. Statistics Defined
      3. Statistics Explained
        1. Example: Formal Tests Versus Graphical Methods
      4. Controlled Experiments Versus Observational Studies
        1. Design of Experiments
        2. Perspective
      5. Optional: Bayesian Statistics—The Other Point of View
        1. The Frequentist Interpretation of Probability
        2. The Bayesian Interpretation of Probability
        3. Bayesian Data Analysis: A Worked Example
        4. Bayesian Inference: Summary and Discussion
      6. Workshop: R
      7. Further Reading
    5. 11. Intermezzo: Mythbusting—Bigfoot, Least Squares, and All That
      1. How to Average Averages
        1. Simpson’s Paradox
      2. The Standard Deviation
        1. How to Calculate
        2. Optional: One over What?
        3. Optional: The Standard Error
      3. Least Squares
        1. Statistical Parameter Estimation
        2. Function Approximation
      4. Further Reading
  7. III. Computation: Mining Data
    1. 12. Simulations
      1. A Warm-Up Question
      2. Monte Carlo Simulations
        1. Combinatorial Problems
        2. Obtaining Outcome Distributions
        3. Pro and Con
      3. Resampling Methods
        1. The Bootstrap
        2. When Does Bootstrapping Work?
        3. Bootstrap Variants
      4. Workshop: Discrete Event Simulations with SimPy
        1. Introducing SimPy
        2. The Simplest Queueing Process
        3. Optional: Queueing Theory
        4. Running SimPy Simulations
        5. Summary
      5. Further Reading
    2. 13. Finding Clusters
      1. What Constitutes a Cluster?
        1. A Different Point of View
      2. Distance and Similarity Measures
        1. Common Distance and Similarity Measures
          1. Numerical data
          2. Categorical data
          3. String data
          4. Special-purpose metrics
      3. Clustering Methods
        1. Center Seekers
        2. Tree Builders
        3. Neighborhood Growers
      4. Pre- and Postprocessing
        1. Scale Normalization
        2. Cluster Properties and Evaluation
      5. Other Thoughts
      6. A Special Case: Market Basket Analysis
      7. A Word of Warning
      8. Workshop: Pycluster and the C Clustering Library
      9. Further Reading
    3. 14. Seeing the Forest for the Trees: Finding Important Attributes
      1. Principal Component Analysis
        1. Motivation
        2. Optional: Theory
        3. Interpretation
        4. Computation
        5. Practical Points
          1. Biplots
      2. Visual Techniques
        1. Multidimensional Scaling
        2. Network Graphs
      3. Kohonen Maps
      4. Workshop: PCA with R
      5. Further Reading
        1. Linear Algebra
    4. 15. Intermezzo: When More Is Different
      1. A Horror Story
      2. Some Suggestions
      3. What About Map/Reduce?
      4. Workshop: Generating Permutations
      5. Further Reading
  8. IV. Applications: Using Data
    1. 16. Reporting, Business Intelligence, and Dashboards
      1. Business Intelligence
        1. Reporting
      2. Corporate Metrics and Dashboards
        1. Recommendations for a Metrics Program
      3. Data Quality Issues
        1. Data Availability
        2. Data Consistency
      4. Workshop: Berkeley DB and SQLite
        1. Berkeley DB
        2. SQLite
      5. Further Reading
    2. 17. Financial Calculations and Modeling
      1. The Time Value of Money
        1. A Single Payment: Future and Present Value
        2. Multiple Payments: Compounding
        3. Calculational Tricks with Compounding
        4. The Whole Picture: Cash-Flow Analysis and Net Present Value
      2. Uncertainty in Planning and Opportunity Costs
        1. Using Expectation Values to Account for Uncertainty
        2. Opportunity Costs
      3. Cost Concepts and Depreciation
        1. Direct and Indirect Costs
        2. Fixed and Variable Costs
        3. Capital Expenditure and Operating Cost
      4. Should You Care?
      5. Is This All That Matters?
      6. Workshop: The Newsvendor Problem
        1. Optional: Exact Solution
      7. Further Reading
        1. The Newsvendor Problem
    3. 18. Predictive Analytics
      1. Topics in Predictive Analytics
      2. Some Classification Terminology
      3. Algorithms for Classification
        1. Instance-Based Classifiers and Nearest-Neighbor Methods
        2. Bayesian Classifiers
        3. Regression
        4. Support Vector Machines
        5. Decision Trees and Rule-Based Classifiers
        6. Other Classifiers
      4. The Process
        1. Ensemble Methods: Bagging and Boosting
        2. Estimating Prediction Error
        3. Class Imbalance Problems
      5. The Secret Sauce
      6. The Nature of Statistical Learning
      7. Workshop: Two Do-It-Yourself Classifiers
      8. Further Reading
    4. 19. Epilogue: Facts Are Not Reality
  9. A. Programming Environments for Scientific Computation and Data Analysis
    1. Software Tools
      1. Scientific Software Is Different
    2. A Catalog of Scientific Software
      1. Matlab
      2. R
      3. Python
        1. NumPy/SciPy
      4. What About Java?
      5. Other Players
      6. Recommendations
    3. Writing Your Own
    4. Further Reading
      1. Matlab
      2. R
      3. NumPy/SciPy
  10. B. Results from Calculus
    1. Common Functions
      1. Powers
      2. Polynomials and Rational Functions
      3. Exponential Function and Logarithm
      4. Trigonometric Functions
      5. Gaussian Function and the Normal Distribution
      6. Other Functions
      7. The Inverse of a Function
    2. Calculus
      1. Derivatives
      2. Finding Minima and Maxima
      3. Integrals
      4. Limits, Sequences, and Series
      5. Power Series and Taylor Expansion
    3. Useful Tricks
      1. The Binomial Theorem
      2. The Linear Transformation
      3. Dividing by Zero
    4. Notation and Basic Math
      1. On Reading Formulas
      2. Elementary Algebra
      3. Working with Fractions
      4. Sets, Sequences, and Series
      5. Special Symbols
        1. Binary relationships
        2. Parentheses and other delimiters
        3. Miscellaneous symbols
      6. The Greek Alphabet
    5. Where to Go from Here
      1. On Math
    6. Further Reading
      1. Calculus
      2. Linear Algebra
      3. Complex Analysis
      4. Mindbenders
  11. C. Working with Data
    1. Sources for Data
    2. Cleaning and Conditioning
    3. Sampling
    4. Data File Formats
    5. The Care and Feeding of Your Data Zoo
    6. Skills
    7. Terminology
      1. Types of Data
      2. The Data Type Depends on the Semantics
      3. Types of Data Sets
    8. Further Reading
      1. Data Set Repositories
  12. D. About the Author
  13. Index
  14. About the Author
  15. Colophon
  16. Copyright

Product information

  • Title: Data Analysis with Open Source Tools
  • Author(s): Philipp K. Janert
  • Release date: November 2010
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9780596802356