You are previewing Big Data Glossary.

Big Data Glossary

Cover of Big Data Glossary by Pete Warden Published by O'Reilly Media, Inc.
O'Reilly logo

Chapter 1. Terms

These new tools need some shorthand labels to describe their properties, and since they’re likely to be unfamiliar to traditional database users, I’ll start off with a few definitions.

Document-Oriented

In a traditional relational database, the user begins by specifying a series of column types and names for a table. Information is then added as rows of values, with each of those named columns as a cell of each row. You can’t have additional values that weren’t specified when you created the table, and every value must be present, even if it’s as a NULL value.

Document stores instead let you enter each record as a series of names with associated values, which you can picture being like a JavaScript object, a Python dictionary, or a Ruby hash. You don’t specify ahead of time what names will be in each table using a schema. In theory, each record could contain a completely different set of named values, though in practice, the application layer often relies on an informal schema, with the client code expecting certain named values to be present.

The key advantage of this document-oriented approach is its flexibility. You can add or remove the equivalent of columns with no penalty, as long as the application layer doesn’t rely on the values that were removed. A good analogy is the difference between languages where you declare the types of variables ahead of time, and those where the type is inferred by the compiler or interpreter. You lose information that can be used to automatically check correctness and optimize for performance, but it becomes a lot easier to prototype and experiment.

Key/Value Stores

The memcached system introduced a lot of web programmers to the power of treating a data store like a giant associative array, reading and writing values based purely on a unique key. It leads to a very simple interface, with three primitive operations to get the data associated with a particular key, to store some data against a key, and to delete a key and its data. Unlike relational databases, with a pure key/value store, it’s impossible to run queries, though some may offer extensions, like the ability to find all the keys that match a wild-carded expression. This means that the application code has to handle building any complex operations out of the primitive calls it can make to the store.

Why would any developer want to do that extra work? With more complex databases, you’re often paying a penalty in complexity or performance for features you may not care about, like full ACID compliance. With key/value stores, you’re given very basic building blocks that have very predictable performance characteristics, and you can create the more complex operations using the same language as the rest of your application.

A lot of the databases listed here try to retain the simplicity of a pure key/value store interface, but with some extra features added to meet common requirements. It seems likely that there’s a sweet spot of functionality that retains some of the advantages of minimal key/value stores without requiring quite as much duplicated effort from the application developer.

Horizontal or Vertical Scaling

Traditional database architectures are designed to run well on a single machine, and the simplest way to handle larger volumes of operations is to upgrade the machine with a faster processor or more memory. That approach to increasing speed is known as vertical scaling. More recent data processing systems, such as Hadoop and Cassandra, are designed to run on clusters of comparatively low-specification servers, and so the easiest way to handle more data is to add more of those machines to the cluster. This horizontal scaling approach tends to be cheaper as the number of operations and the size of the data increases, and the very largest data processing pipelines are all built on a horizontal model. There is a cost to this approach, though. Writing distributed data handling code is tricky and involves tradeoffs between speed, scalability, fault tolerance, and traditional database goals like atomicity and consistency.

MapReduce

MapReduce is an algorithm design pattern that originated in the functional programming world. It consists of three steps. First, you write a mapper function or script that goes through your input data and outputs a series of keys and values to use in calculating the results. The keys are used to cluster together bits of data that will be needed to calculate a single output result. The unordered list of keys and values is then put through a sort step that ensures that all the fragments that have the same key are next to one another in the file. The reducer stage then goes through the sorted output and receives all of the values that have the same key in a contiguous block.

That may sound like a very roundabout way of building your algorithms, but its prime virtue is that it removes unplanned random accesses, with all scattering and gathering handled in the sorting phase. Even on single machines, this boosts performance, thanks to the increased locality of memory accesses, but it also allows the process to be split across a large number of machines easily, by dealing with the input in many independent chunks and partitioning the data based on the key.

Hadoop is the best-known public system for running MapReduce algorithms, but many modern databases, such as MongoDB, also support it as an option. It’s worthwhile even in a fairly traditional system, since if you can write your query in a MapReduce form, you’ll be able to run it efficiently on as many machines as you have available.

Sharding

Any database that’s spread across multiple machines needs some scheme to decide which machines a given piece of data should be stored on. A sharding system makes this decision for each row in a table, using its key. In the simplest case, the application programmer will specify an explicit rule to use for sharding. For example, if you had a ten machine cluster and a numerical key, you might use the last decimal digit of the key to decide which machine to store data on. Since both the storing and retrieval code knows about this rule, when you need to get the row it’s possible to go directly to the machine that holds it.

The biggest problems with sharding are splitting the data evenly across machines and dealing with changes in the size of the cluster. Using the same example, imagine that the numerical keys often end in zero; that will lead to an extremely unbalanced distribution where a single machine is overused and becomes a bottleneck. If the cluster size is expanded from ten to fifteen machines, we could switch to a modulo fifteen scheme for assigning data, but it would require a wholesale shuffling of all the data on the cluster.

To ease the pain of these problems, more complex schemes are used to split up the data. Some of these rely on a central directory that holds the locations of particular keys. This level of indirection allows data to be moved between machines when a particular shard grows too large (to rebalance the distribution), at the cost of requiring an extra lookup in the directory for each operation. The directory’s information is usually fairly small and reasonably static, though, so it’s a good candidate for local caching, as long as the infrequent changes are spotted.

Another popular approach is the use of consistent hashing for the sharding. This technique uses a small table splitting the possible range of hash values into ranges, with one assigned to each shard. The lookup data needed by clients is extremely lightweight, with just a couple of numerical values per node, so it can be shared and cached efficiently, but it has enough flexibility to allow fast rebalancing of the value distributions when nodes are added and removed, or even just when one node becomes overloaded, unlike fixed modulo functions.

The best content for your career. Discover unlimited learning on demand for around $1/day.