Organizing your Data Lake in HDFS

As you load files to your Data Lake, it is important to have this process managed for data consumers in order to find the right data. Organization of data requires planning, coordination, and governance. One proposed model that I have seen used by several clients is to have three main directories:

  • Staging: This directory will host all the original source files, as they get ingested to the Data Lake. Each source should have its own directory. For example, let's consider that an organization has two financial databases, findb01 and findb02. A proposed directory structure in Data Lake can be /data/stage/findb01 and /data/stage/findb02.
  • Cleansed: The data in staging should go through basic audit and data quality checks ...

Get HDInsight Essentials - Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.