Chapter 12. Writing a Format Plug-in

As described in Chapter 8, Apache Drill uses storage and format plug-ins to read data. The storage plug-in connects to a storage system such as Kafka, a database, or a distributed filesystem. The DFS interface is based on the HDFS client libraries and can obtain data from HDFS, Amazon S3, MapR, and so on.

A distributed filesystem contains a wide variety of files (Parquet, CSV, JSON, and so on.) The dfs storage plug-in uses format plug-ins to read data from these files. In this chapter, we explore how to create custom format plug-ins for file formats that Drill does not yet support.

Format plug-ins integrate tightly with Drill’s internal mechanisms for configuration, memory allocation, column projection, filename resolution, and data representation. Writing plug-ins is therefore an “advanced” task that requires Java experience, patience, frequent consultation of existing code, and posting questions on the “dev” mailing list.

Drill provides two ways to structure your plug-in. Here we focus on the “Easy” format plug-in, useful for most file formats, that handles much of the boilerplate for you. It is also possible to write a plug-in without the Easy framework, but it is unlikely you will need to do so.

The Example Regex Format Plug-in

As an example, we’re going to create a format plug-in for any text file format that can be described as a regular expression, or regex. The regex defines how to parse columns from an input record and is defined ...

Get Learning Apache Drill now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.