Chapter 8. Data Engineering with Drill

Drill is a SQL engine that reads large data files stored in a distributed filesystem such as HDFS, MapR-FS, or Amazon S3. Drill works best with data stored in Parquet, but data seldom arrives in Parquet, and it is often handy to work with data in its original format. In this chapter, you will see that with Drill you can read data in many formats, and use specialized tricks to overcome schema-related issues. However, for production, Parquet is the preferred file format.

Although some of the material in this chapter has been covered in previous chapters, this chapter will go into much greater detail on how Drill actually processes data, which is vital to understand if you are developing extensions for Drill or if you encounter files with an ambiguous schema.

Schema-on-Read

Apache Drill is designed for the modern data lake, which consists of a very large number of files, organized into directories and stored in a wide variety of file formats. Although Drill is optimized for Parquet files, it can read data from many different file formats using extensible storage plug-ins.

Unlike Hive, which requires a schema to define a file, Drill uses the structure within the file itself. This strategy, known as schema-on-read, works very well for file formats such as Parquet, which carry a clear, unambiguous schema within the file itself. But as you will see in this chapter, you must provide Drill a bit of help to read other file formats, such as CSV or ...

Get Learning Apache Drill now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.