Summary

Big data processing involves data representation either in storage or in transit over the network. Compact representation, fast transformations, extensibility, and backward compatibility of the data representation are desired properties. Some key takeaways from this chapter related to data representation are as follows:

  • Hadoop provides inbuilt serialization/deserialization mechanisms using the Writable interface. The Writable classes are serialized more compactly than Java serialization.
  • Avro is a flexible and extensible data serialization framework. It serializes data in binary and is supported by Hadoop, MapReduce, Pig, and Hive.
  • Avro provides dynamic typing, eliminating the need for code generation. The schema can be stored with the data ...

Get Mastering Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.