O'Reilly logo

Big Data Glossary by Pete Warden

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 11. Serialization

As you work on turning your data into something useful, it will have to pass between various systems and probably be stored in files at various points. These operations all require some kind of serialization, especially since different stages of your processing are likely to require different languages and APIs. When you’re dealing with very large numbers of records, the choices you make about how to represent and store them can have a massive impact on your storage requirements and performance.

Though it’s well known to most web developers, JSON (JavaScript Object Notation) has only recently emerged as a popular format for data processing. Its biggest advantages are that it maps trivially to existing data structures in most languages and it has a layout that’s restrictive enough to keep the parsing code and schema design simple, but with enough flexibility to express most data in a fairly natural way. Its simplicity does come with some costs, though, especially in storage size. If you’re representing a list of objects mapping keys to values, the most intuitive way would be to use an indexed array of associative arrays. This means that the string for each key is stored inside each object, which involves a large number of duplicated strings when the number of unique keys is small compared to the number of values. There are manual ways around this, of course, especially as the textual representations usually compress well, but many of the other serialization ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required