Data data everywhere...

In discussions concerning integration of Hadoop with other systems, it is easy to think of it as a one-to-one pattern. Data comes out of one system, gets processed in Hadoop, and then is passed onto a third.

Things may be like that on day one, but the reality is more often a series of collaborating components with data flows passing back and forth between them. How we build this complex network in a maintainable fashion is the focus of this chapter.

Types of data

For the sake of the discussion, we will categorize data into two broad categories:

  • Network traffic, where data is generated by a system and sent across a network connection
  • File data, where data is generated by a system and written to files on a filesystem somewhere ...

Get Hadoop: Data Processing and Modelling now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.