Considerations for multiple data centers

If you run your business out of multiple data centers and have a large volume of data collected, you may want to consider setting up a Hadoop cluster in each data center rather than sending all your collected data back to a single data center. This will make analyzing the data more difficult as you can't just run one MapReduce job against all the data. Instead you would have to run parallel jobs and then combine the results in a second pass. You can do this with searching and counting problems, but not things such as averages—an average of averages isn't the same as an average.

Pulling all your data into a single cluster may also be more than your networking can handle. Depending on how your data centers ...

Get Apache Flume: Distributed Log Collection for Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Apache Flume: Distributed Log Collection for Hadoop by Steve Hoffman

Considerations for multiple data centers

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly