Bucket table sampling

This is a special sampling method, optimized for bucket tables, as shown in the following example. The SELECT clause specifies the columns to sample data from. The rand() function can also be used when sampling entire rows. If the sample column is also the CLUSTERED BY column, the sample will be more efficient:

-- Sampling based on the whole row> SELECT name FROM employee_trans> TABLESAMPLE(BUCKET 1 OUT OF 2 ON rand()) a;+--------+| name   |+--------+| Steven |+--------+1 row selected (0.129 seconds)-- Sampling based on the bucket column, which is efficient> SELECT name FROM employee_trans > TABLESAMPLE(BUCKET 1 OUT OF 2 ON emp_id) a;+---------+| name    |+---------+| Lucy    || Steven  || Michael |+---------+3 rows selected ...

Get Apache Hive Essentials now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.