Cover image for Data Analysis with Open Source Tools

Book description

Real World Data Analysis shows you how you think about data and the results you want to achieve with it. Author Philipp Janert teaches you how to approach data analysis problems and extract all the information available from your data. But this book isn't just about academic topics: it's the only book on data that stresses the seat-of-the-pants knowledge that leads you to the right approach in the first place. There are lots of people who can apply a formula. Janert shows you how to look at a result and know whether it's meaningful.

Table of Contents

  1. Data Analysis with Open Source Tools
  2. Dedication
  3. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  4. A Note Regarding Supplemental Files
  5. Preface
    1. Before We Begin
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
  6. 1. Introduction
    1. Data Analysis
    2. What’s in This Book
    3. What’s with the Workshops?
    4. What’s with the Math?
    5. What You’ll Need
    6. What’s Missing
  7. I. Graphics: Looking at Data
    1. 2. A Single Variable: Shape and Distribution
      1. Dot and Jitter Plots
      2. Histograms and Kernel Density Estimates
        1. Histograms
        2. Kernel Density Estimates
        3. Optional: Optimal Bandwidth Selection
      3. The Cumulative Distribution Function
        1. Optional: Comparing Distributions with Probability Plots and QQ Plots
      4. Rank-Order Plots and Lift Charts
      5. Only When Appropriate: Summary Statistics and Box Plots
        1. Summary Statistics
        2. Box-and-Whisker Plots
      6. Workshop: NumPy
        1. NumPy in Action
        2. NumPy in Detail
      7. Further Reading
    2. 3. Two Variables: Establishing Relationships
      1. Scatter Plots
      2. Conquering Noise: Smoothing
        1. Splines
        2. LOESS
        3. Examples
        4. Residuals
        5. Additional Ideas and Warnings
      3. Logarithmic Plots
      4. Banking
      5. Linear Regression and All That
      6. Showing What’s Important
      7. Graphical Analysis and Presentation Graphics
      8. Workshop: matplotlib
        1. Using matplotlib Interactively
        2. Case Study: LOESS with matplotlib
        3. Managing Properties
        4. The matplotlib Object Model and Architecture
        5. Odds and Ends
      9. Further Reading
    3. 4. Time As a Variable: Time-Series Analysis
      1. Examples
      2. The Task
        1. Requirements and the Real World
      3. Smoothing
        1. Running Averages
        2. Exponential Smoothing
      4. Don’t Overlook the Obvious!
      5. The Correlation Function
        1. Examples
        2. Implementation Issues
      6. Optional: Filters and Convolutions
      7. Workshop: scipy.signal
      8. Further Reading
    4. 5. More Than Two Variables: Graphical Multivariate Analysis
      1. False-Color Plots
      2. A Lot at a Glance: Multiplots
        1. The Scatter-Plot Matrix
        2. The Co-Plot
        3. Variations
      3. Composition Problems
        1. Changes in Composition
        2. Multidimensional Composition: Tree and Mosaic Plots
      4. Novel Plot Types
        1. Glyphs
        2. Parallel Coordinate Plots
      5. Interactive Explorations
        1. Querying and Zooming
        2. Linking and Brushing
        3. Grand Tours and Projection Pursuits
        4. Tools
      6. Workshop: Tools for Multivariate Graphics
        1. R
        2. Experimental Tools
        3. Python Chaco Library
      7. Further Reading
    5. 6. Intermezzo: A Data Analysis Session
      1. A Data Analysis Session
      2. Workshop: gnuplot
      3. Further Reading
  8. II. Analytics: Modeling Data
    1. 7. Guesstimation and the Back of the Envelope
      1. Principles of Guesstimation
        1. Estimating Sizes
        2. Establishing Relationships
        3. Working with Numbers
          1. Powers of ten
          2. Small perturbations
          3. Logarithms
        4. More Examples
        5. Things I Know
      2. How Good Are Those Numbers?
        1. Before You Get Started: Feasibility and Cost
        2. After You Finish: Quoting and Displaying Numbers
      3. Optional: A Closer Look at Perturbation Theory and Error Propagation
        1. Error Propagation
      4. Workshop: The Gnu Scientific Library (GSL)
      5. Further Reading
    2. 8. Models from Scaling Arguments
      1. Models
        1. Modeling
        2. Using and Misusing Models
      2. Arguments from Scale
        1. Scaling Arguments
        2. Example: A Dimensional Argument
        3. Example: An Optimization Problem
        4. Example: A Cost Model
        5. Optional: Scaling Arguments Versus Dimensional Analysis
        6. Other Arguments
      3. Mean-Field Approximations
        1. Background and Further Examples
      4. Common Time-Evolution Scenarios
        1. Unconstrained Growth and Decay Phenomena
        2. Constrained Growth: The Logistic Equation
        3. Oscillations
      5. Case Study: How Many Servers Are Best?
      6. Why Modeling?
      7. Workshop: Sage
      8. Further Reading
    3. 9. Arguments from Probability Models
      1. The Binomial Distribution and Bernoulli Trials
        1. Exact Results
        2. Using Bernoulli Trials to Develop Mean-Field Models
      2. The Gaussian Distribution and the Central Limit Theorem
        1. The Central Limit Theorem
        2. The Central Term and the Tails
        3. Why Is the Gaussian so Useful?
        4. Optional: Gaussian Integrals
        5. Beware: The World Is Not Normal!
      3. Power-Law Distributions and Non-Normal Statistics
        1. Working with Power-Law Distributions
        2. Optional: Distributions with Infinite Expectation Values
        3. Where to Go from Here
      4. Other Distributions
        1. Geometric Distribution
        2. Poisson Distribution
        3. Log-Normal Distribution
        4. Special-Purpose Distributions
      5. Optional: Case Study—Unique Visitors over Time
      6. Workshop: Power-Law Distributions
      7. Further Reading
    4. 10. What You Really Need to Know About Classical Statistics
      1. Genesis
      2. Statistics Defined
      3. Statistics Explained
        1. Example: Formal Tests Versus Graphical Methods
      4. Controlled Experiments Versus Observational Studies
        1. Design of Experiments
        2. Perspective
      5. Optional: Bayesian Statistics—The Other Point of View
        1. The Frequentist Interpretation of Probability
        2. The Bayesian Interpretation of Probability
        3. Bayesian Data Analysis: A Worked Example
        4. Bayesian Inference: Summary and Discussion
      6. Workshop: R
      7. Further Reading
    5. 11. Intermezzo: Mythbusting—Bigfoot, Least Squares, and All That
      1. How to Average Averages
        1. Simpson’s Paradox
      2. The Standard Deviation
        1. How to Calculate
        2. Optional: One over What?
        3. Optional: The Standard Error
      3. Least Squares
        1. Statistical Parameter Estimation
        2. Function Approximation
      4. Further Reading
  9. III. Computation: Mining Data
    1. 12. Simulations
      1. A Warm-Up Question
      2. Monte Carlo Simulations
        1. Combinatorial Problems
        2. Obtaining Outcome Distributions
        3. Pro and Con
      3. Resampling Methods
        1. The Bootstrap
        2. When Does Bootstrapping Work?
        3. Bootstrap Variants
      4. Workshop: Discrete Event Simulations with SimPy
        1. Introducing SimPy
        2. The Simplest Queueing Process
        3. Optional: Queueing Theory
        4. Running SimPy Simulations
        5. Summary
      5. Further Reading
    2. 13. Finding Clusters
      1. What Constitutes a Cluster?
        1. A Different Point of View
      2. Distance and Similarity Measures
        1. Common Distance and Similarity Measures
          1. Numerical data
          2. Categorical data
          3. String data
          4. Special-purpose metrics
      3. Clustering Methods
        1. Center Seekers
        2. Tree Builders
        3. Neighborhood Growers
      4. Pre- and Postprocessing
        1. Scale Normalization
        2. Cluster Properties and Evaluation
      5. Other Thoughts
      6. A Special Case: Market Basket Analysis
      7. A Word of Warning
      8. Workshop: Pycluster and the C Clustering Library
      9. Further Reading
    3. 14. Seeing the Forest for the Trees: Finding Important Attributes
      1. Principal Component Analysis
        1. Motivation
        2. Optional: Theory
        3. Interpretation
        4. Computation
        5. Practical Points
          1. Biplots
      2. Visual Techniques
        1. Multidimensional Scaling
        2. Network Graphs
      3. Kohonen Maps
      4. Workshop: PCA with R
      5. Further Reading
        1. Linear Algebra
    4. 15. Intermezzo: When More Is Different
      1. A Horror Story
      2. Some Suggestions
      3. What About Map/Reduce?
      4. Workshop: Generating Permutations
      5. Further Reading
  10. IV. Applications: Using Data
    1. 16. Reporting, Business Intelligence, and Dashboards
      1. Business Intelligence
        1. Reporting
      2. Corporate Metrics and Dashboards
        1. Recommendations for a Metrics Program
      3. Data Quality Issues
        1. Data Availability
        2. Data Consistency
      4. Workshop: Berkeley DB and SQLite
        1. Berkeley DB
        2. SQLite
      5. Further Reading
    2. 17. Financial Calculations and Modeling
      1. The Time Value of Money
        1. A Single Payment: Future and Present Value
        2. Multiple Payments: Compounding
        3. Calculational Tricks with Compounding
        4. The Whole Picture: Cash-Flow Analysis and Net Present Value
      2. Uncertainty in Planning and Opportunity Costs
        1. Using Expectation Values to Account for Uncertainty
        2. Opportunity Costs
      3. Cost Concepts and Depreciation
        1. Direct and Indirect Costs
        2. Fixed and Variable Costs
        3. Capital Expenditure and Operating Cost
      4. Should You Care?
      5. Is This All That Matters?
      6. Workshop: The Newsvendor Problem
        1. Optional: Exact Solution
      7. Further Reading
        1. The Newsvendor Problem
    3. 18. Predictive Analytics
      1. Topics in Predictive Analytics
      2. Some Classification Terminology
      3. Algorithms for Classification
        1. Instance-Based Classifiers and Nearest-Neighbor Methods
        2. Bayesian Classifiers
        3. Regression
        4. Support Vector Machines
        5. Decision Trees and Rule-Based Classifiers
        6. Other Classifiers
      4. The Process
        1. Ensemble Methods: Bagging and Boosting
        2. Estimating Prediction Error
        3. Class Imbalance Problems
      5. The Secret Sauce
      6. The Nature of Statistical Learning
      7. Workshop: Two Do-It-Yourself Classifiers
      8. Further Reading
    4. 19. Epilogue: Facts Are Not Reality
  11. A. Programming Environments for Scientific Computation and Data Analysis
    1. Software Tools
      1. Scientific Software Is Different
    2. A Catalog of Scientific Software
      1. Matlab
      2. R
      3. Python
        1. NumPy/SciPy
      4. What About Java?
      5. Other Players
      6. Recommendations
    3. Writing Your Own
    4. Further Reading
      1. Matlab
      2. R
      3. NumPy/SciPy
  12. B. Results from Calculus
    1. Common Functions
      1. Powers
      2. Polynomials and Rational Functions
      3. Exponential Function and Logarithm
      4. Trigonometric Functions
      5. Gaussian Function and the Normal Distribution
      6. Other Functions
      7. The Inverse of a Function
    2. Calculus
      1. Derivatives
      2. Finding Minima and Maxima
      3. Integrals
      4. Limits, Sequences, and Series
      5. Power Series and Taylor Expansion
    3. Useful Tricks
      1. The Binomial Theorem
      2. The Linear Transformation
      3. Dividing by Zero
    4. Notation and Basic Math
      1. On Reading Formulas
      2. Elementary Algebra
      3. Working with Fractions
      4. Sets, Sequences, and Series
      5. Special Symbols
        1. Binary relationships
        2. Parentheses and other delimiters
        3. Miscellaneous symbols
      6. The Greek Alphabet
    5. Where to Go from Here
      1. On Math
    6. Further Reading
      1. Calculus
      2. Linear Algebra
      3. Complex Analysis
      4. Mindbenders
  13. C. Working with Data
    1. Sources for Data
    2. Cleaning and Conditioning
    3. Sampling
    4. Data File Formats
    5. The Care and Feeding of Your Data Zoo
    6. Skills
    7. Terminology
      1. Types of Data
      2. The Data Type Depends on the Semantics
      3. Types of Data Sets
    8. Further Reading
      1. Data Set Repositories
  14. D. About the Author
  15. Index
  16. About the Author
  17. Colophon
  18. SPECIAL OFFER: Upgrade this ebook with O’Reilly
  19. Copyright