Data ingestion

Data management in big data is an important and critical aspect. We have to import and export large scale data to do processing, which becomes unmanageable in the production environment. In Hadoop, we deal with different set of sources such as batch, streaming, real time, and also sources that are complex in data formats, as some are semi-structured and unstructured too. Managing such data is very difficult, therefore we have some tools for data management such as Flume, Sqoop, and Storm, which are mentioned as follows:

  • Apache Flume: Apache Flume is a widely used tool for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store. Flume is a distributed, reliable, ...

Get Hadoop Essentials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.