Impediments to Connecting Data

Hopefully you're starting to be convinced that there are huge advantages to being able to easily integrate data from many different sources. But there are a few different reasons people aren't doing it already….

The Representation Problem

Perhaps the most basic problem with attempting to connect data sets is the fact that most data is stored in very inflexible structures. First of all, a surprising amount of important data in science and business is also kept in Excel spreadsheets, which are stored locally on people's computers, inaccessible to others and also not designed for integration anyway.

Even in companies where databases are made accessible, data is classically stored in relational databases, most of which have predefined schemas to fit the data that was initially believed to be important. Figure 20-2 shows a simple example of a relational schema for restaurant data. This is excellent for large, predictable data sets because relational databases have excellent performance when well configured, but presents problems when the application requires new kinds of data, new fields, or new relationships to be added frequently.

A relational schema for restaurant data.

Figure 20-2. A relational schema for restaurant data.

I've seen people solve this problem in a number of ways, but two really stand out, mostly because they're opposite ends of a spectrum. The traditional approach is to continually ...

Get Beautiful Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.