Summary

In this chapter, we explored briefly how to understand a data processing architecture. First, we explored how to interact with a distributed file system. Then, we provided installation instructions for a Cloudera VM and how to get started in a data environment. Finally, we described the main features of Apache Spark and ran a practical example of how to implement a word count algorithm.

Apache Spark is highly recommended for all the data community, because it provides a robust and fast data processing tool. It also provides a library for Sql-like querying, graph processing, and machine learning algorithms.

Get Practical Data Analysis - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.