Buckets

Besides partition, the bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from a partition, a bucket corresponds to segments of files in HDFS. For example, the employee_partitioned table from the previous section uses year and month as the top-level partition. If there is a further request to use employee_id as the third level of partition, it creates many partition directories. For instance, we can bucket the employee_partitioned table using employee_id as a bucket column. The value of this column will be hashed by a user-defined number of buckets. The records with the same employee_id will always be stored in the same bucket (segment of files). The bucket columns are ...

Get Apache Hive Essentials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.