Chapter 15. Intermezzo: When More Is Different

WHEN DEALING WITH SOME OF THE MORE COMPUTATIONALLY INTENSIVE DATA ANALYSIS OR MINING algorithms, you may encounter an unexpected obstacle: the brick wall. Programs or algorithms that seemed to work just fine turn out not to work once in production. And I don’t mean that they work slower than expected. I mean they do not work at all!

Of course, performance and scalability problems are familiar to most enterprise developers. However, the kinds of problems that arise in data-centric or computationally intensive applications are different, and most enterprise programmers (and, in fact, most computer science graduates) are badly prepared for them.

Let’s try an example: Table 15-1 shows the time required to perform 10 matrix multiplications for square matrices of various size. (The details of matrix multiplication don’t concern us here; suffice it to say that it’s the basic operation in almost all problems involving matrices and is at the heart of operator decomposition problems, including the principal component analysis introduced in Chapter 14.)

Table 15-1. Time required to perform 10 matrix multiplications for square matrices of different sizes

Size n

Time [seconds]

100

0.00

200

0.06

500

2.12

1,000

22.44

2,000

176.22

Would you agree that the data in Table 15-1 does not look too threatening? For a 2,000 × 2,000 matrix, the time required is a shade under three minutes. How long might it take to perform the same operation for a 10,000 × 10,000 matrix? Five, ...

Get Data Analysis with Open Source Tools now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.