Hive buckets

Besides partition, bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from partition, the bucket corresponds to segments of files in HDFS. For example, the employee_partitioned table from the previous section uses the year and month as the top-level partition. If there is a further request to use the employee_id as the third level of partition, it leads to many deep and small partitions and directories. For instance, we can bucket the employee_partitioned table using employee_id as the bucket column. The value of this column will be hashed by a user-defined number into buckets. The records with the same employee_id will always be stored in the same bucket (segment of ...

Get Apache Hive Essentials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.