What is Hadoop?

Hadoop is an open-source framework for working with large quantities of data spread across a single computer to thousands of computers. Hadoop is composed of four modules:

  • Hadoop Core
  • Hadoop Distributed File System (HDFS)
  • Yet Another Resource Negotiator (YARN)
  • MapReduce

The Hadoop Core makes up the components needed to run the other three modules. HDFS is a Java-based file system that has been designed to be distributed and is capable of storing large files across many machines. By large files, we are talking terabytes. YARN manages the resources and scheduling in your Hadoop framework. The MapReduce engine allows you to process data in parallel.

There are several other projects that can be installed to work with the Hadoop ...

Get Mastering Geospatial Analysis with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.