Chapter 2. About Kudu

Apache Kudu is often summarized in one single phrase: a Hadoop storage layer to enable fast analytics on fast data. Although a short, simple-to-understand statement, until now, achieving those goals has not been easy.

We can achieve analytics with today’s big data technology. Namely, storing data in highly efficient, columnar storage formats in particular, such as Parquet and ORC, allows for compute engines to sequentially read data across the entire distributed filesystem, HDFS, at a very high rate. Analytical type queries perform large aggregations over a subset of columns. This effectively translates to a projection of the columns the query is requesting coupled with performing a mathematical operation on a large number of values in that column. Thus, columnar storage formats are terrific because a) to project a column simply means you limit the I/O to solely the pages on disk containing data for that column—which is doable because the format is already split across columns, and b) numbers in particular can use various encoding and packing mechanisms to stuff massive amounts of data representing many rows onto a single page on disk. This means that I/O can be extremely efficient, and compute operations on the values in a column can be performed quickly. In short, the HDFS filesystem, coupled with columnar file formats, yields highly performant I/O and compute capability resulting in analytics queries being processed quickly.

On the other hand, we also ...

Get Getting Started with Kudu now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.