You are previewing Data Analysis with Open Source Tools.
1. Data Analysis with Open Source Tools
2. Dedication
3. SPECIAL OFFER: Upgrade this ebook with O’Reilly
4. A Note Regarding Supplemental Files
5. Preface
6. 1. Introduction
7. I. Graphics: Looking at Data
1. 2. A Single Variable: Shape and Distribution
2. 3. Two Variables: Establishing Relationships
3. 4. Time As a Variable: Time-Series Analysis
4. 5. More Than Two Variables: Graphical Multivariate Analysis
5. 6. Intermezzo: A Data Analysis Session
8. II. Analytics: Modeling Data
1. 7. Guesstimation and the Back of the Envelope
2. 8. Models from Scaling Arguments
3. 9. Arguments from Probability Models
4. 10. What You Really Need to Know About Classical Statistics
5. 11. Intermezzo: Mythbusting—Bigfoot, Least Squares, and All That
9. III. Computation: Mining Data
1. 12. Simulations
2. 13. Finding Clusters
3. 14. Seeing the Forest for the Trees: Finding Important Attributes
4. 15. Intermezzo: When More Is Different
10. IV. Applications: Using Data
1. 16. Reporting, Business Intelligence, and Dashboards
2. 17. Financial Calculations and Modeling
3. 18. Predictive Analytics
4. 19. Epilogue: Facts Are Not Reality
11. A. Programming Environments for Scientific Computation and Data Analysis
1. Software Tools
2. A Catalog of Scientific Software
12. B. Results from Calculus
1. Common Functions
2. Calculus
3. Useful Tricks
4. Notation and Basic Math
5. Where to Go from Here
13. C. Working with Data
1. Sources for Data
2. Cleaning and Conditioning
3. Sampling
4. Data File Formats
5. The Care and Feeding of Your Data Zoo
6. Skills
7. Terminology
15. Index
17. Colophon
18. SPECIAL OFFER: Upgrade this ebook with O’Reilly

# Chapter 15. Intermezzo: When More Is Different

WHEN DEALING WITH SOME OF THE MORE COMPUTATIONALLY INTENSIVE DATA ANALYSIS OR MINING algorithms, you may encounter an unexpected obstacle: the brick wall. Programs or algorithms that seemed to work just fine turn out not to work once in production. And I don’t mean that they work slower than expected. I mean they do not work at all!

Of course, performance and scalability problems are familiar to most enterprise developers. However, the kinds of problems that arise in data-centric or computationally intensive applications are different, and most enterprise programmers (and, in fact, most computer science graduates) are badly prepared for them.

Let’s try an example: Table 15-1 shows the time required to perform 10 matrix multiplications for square matrices of various size. (The details of matrix multiplication don’t concern us here; suffice it to say that it’s the basic operation in almost all problems involving matrices and is at the heart of operator decomposition problems, including the principal component analysis introduced in Chapter 14.)

Table 15-1. Time required to perform 10 matrix multiplications for square matrices of different sizes

Size n

Time [seconds]

100

0.00

200

0.06

500

2.12

1,000

22.44

2,000

176.22

Would you agree that the data in Table 15-1 does not look too threatening? For a 2,000 × 2,000 matrix, the time required is a shade under three minutes. How long might it take to perform the same operation for a 10,000 × 10,000 matrix? Five, ...