Chapter 4. HiveQL: Data Definition

HiveQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s dialect, but with significant differences. Hive offers no support for row-level inserts, updates, and deletes. Hive doesn’t support transactions. Hive adds extensions to provide better performance in the context of Hadoop and to integrate with custom extensions and even external programs.

Still, much of HiveQL will be familiar. This chapter and the ones that follow discuss the features of HiveQL using representative examples. In some cases, we will briefly mention details for completeness, then explore them more fully in later chapters.

This chapter starts with the so-called data definition language parts of HiveQL, which are used for creating, altering, and dropping databases, tables, views, functions, and indexes. We’ll discuss databases and tables in this chapter, deferring the discussion of views until Chapter 7, indexes until Chapter 8, and functions until Chapter 13.

We’ll also discuss the SHOW and DESCRIBE commands for listing and describing items as we go.

Subsequent chapters explore the data manipulation language parts of HiveQL that are used to put data into Hive tables and to extract data to the filesystem, and how to explore and manipulate data with queries, grouping, filtering, joining, etc.

Databases in Hive

The Hive concept of a database is essentially just a catalog or namespace of tables. However, they are very useful for larger clusters with multiple teams and users, as a way of avoiding table name collisions. It’s also common to use databases to organize production tables into logical groups.

If you don’t specify a database, the default database is used.

The simplest syntax for creating a database is shown in the following example:

hive> CREATE DATABASE financials;

Hive will throw an error if financials already exists. You can suppress these warnings with this variation:

hive> CREATE DATABASE IF NOT EXISTS financials;

While normally you might like to be warned if a database of the same name already exists, the IF NOT EXISTS clause is useful for scripts that should create a database on-the-fly, if necessary, before proceeding.

You can also use the keyword SCHEMA instead of DATABASE in all the database-related commands.

At any time, you can see the databases that already exist as follows:

hive> SHOW DATABASES;
default
financials

hive> CREATE DATABASE human_resources;

hive> SHOW DATABASES;
default
financials
human_resources

If you have a lot of databases, you can restrict the ones listed using a regular expression, a concept we’ll explain in LIKE and RLIKE, if it is new to you. The following example lists only those databases that start with the letter h and end with any other characters (the .* part):

hive> SHOW DATABASES LIKE 'h.*';
human_resources
hive> ...

Hive will create a directory for each database. Tables in that database will be stored in subdirectories of the database directory. The exception is tables in the default database, which doesn’t have its own directory.

The database directory is created under a top-level directory specified by the property hive.metastore.warehouse.dir, which we discussed in Local Mode Configuration and Distributed and Pseudodistributed Mode Configuration. Assuming you are using the default value for this property, /user/hive/warehouse, when the financials database is created, Hive will create the directory /user/hive/warehouse/financials.db. Note the .db extension.

You can override this default location for the new directory as shown in this example:

hive> CREATE DATABASE financials
    > LOCATION '/my/preferred/directory';

You can add a descriptive comment to the database, which will be shown by the DESCRIBE DATABASE <database> command.

hive> CREATE DATABASE financials
    > COMMENT 'Holds all financial tables';

hive> DESCRIBE DATABASE financials;
financials   Holds all financial tables
  hdfs://master-server/user/hive/warehouse/financials.db

Note that DESCRIBE DATABASE also shows the directory location for the database. In this example, the URI scheme is hdfs. For a MapR installation, it would be maprfs. For an Amazon Elastic MapReduce (EMR) cluster, it would also be hdfs, but you could set hive.metastore.warehouse.dir to use Amazon S3 explicitly (i.e., by specifying s3n://bucketname/… as the property value). You could use s3 as the scheme, but the newer s3n is preferred.

In the output of DESCRIBE DATABASE, we’re showing master-server to indicate the URI authority, in this case a DNS name and optional port number (i.e., server:port) for the “master node” of the filesystem (i.e., where the NameNode service is running for HDFS). If you are running in pseudo-distributed mode, then the master server will be localhost. For local mode, the path will be a local path, file:///user/hive/warehouse/financials.db.

If the authority is omitted, Hive uses the master-server name and port defined by the property fs.default.name in the Hadoop configuration files, found in the $HADOOP_HOME/conf directory.

To be clear, hdfs:///user/hive/warehouse/financials.db is equivalent to hdfs://master-server/user/hive/warehouse/financials.db, where master-server is your master node’s DNS name and optional port.

For completeness, when you specify a relative path (e.g., some/relative/path), Hive will put this under your home directory in the distributed filesystem (e.g., hdfs:///user/<user-name>) for HDFS. However, if you are running in local mode, your current working directory is used as the parent of some/relative/path.

For script portability, it’s typical to omit the authority, only specifying it when referring to another distributed filesystem instance (including S3 buckets).

Lastly, you can associate key-value properties with the database, although their only function currently is to provide a way of adding information to the output of DESCRIBE DATABASE EXTENDED <database>:

hive> CREATE DATABASE financials
    > WITH DBPROPERTIES ('creator' = 'Mark Moneybags', 'date' = '2012-01-02');

hive> DESCRIBE DATABASE financials;
financials   hdfs://master-server/user/hive/warehouse/financials.db

hive> DESCRIBE DATABASE EXTENDED financials;
financials   hdfs://master-server/user/hive/warehouse/financials.db
 {date=2012-01-02, creator=Mark Moneybags);

The USE command sets a database as your working database, analogous to changing working directories in a filesystem:

hive> USE financials;

Now, commands such as SHOW TABLES; will list the tables in this database.

Unfortunately, there is no command to show you which database is your current working database! Fortunately, it’s always safe to repeat the USE … command; there is no concept in Hive of nesting of databases.

Recall that we pointed out a useful trick in Variables and Properties for setting a property to print the current database as part of the prompt (Hive v0.8.0 and later):

hive> set hive.cli.print.current.db=true;

hive (financials)> USE default;

hive (default)> set hive.cli.print.current.db=false;

hive> ...

Finally, you can drop a database:

hive> DROP DATABASE IF EXISTS financials;

The IF EXISTS is optional and suppresses warnings if financials doesn’t exist.

By default, Hive won’t permit you to drop a database if it contains tables. You can either drop the tables first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first:

hive> DROP DATABASE IF EXISTS financials CASCADE;

Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing tables must be dropped before dropping the database.

When a database is dropped, its directory is also deleted.

Alter Database

You can set key-value pairs in the DBPROPERTIES associated with a database using the ALTER DATABASE command. No other metadata about the database can be changed, including its name and directory location:

hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba');

There is no way to delete or “unset” a DBPROPERTY.

Creating Tables

The CREATE TABLE statement follows SQL conventions, but Hive’s version offers significant extensions to support a wide range of flexibility where the data files for tables are stored, the formats used, etc. We discussed many of these options in Text File Encoding of Data Values and we’ll return to more advanced options later in Chapter 15. In this section, we describe the other options available for the CREATE TABLE statement, adapting the employees table declaration we used previously in Collection Data Types:

CREATE TABLE IF NOT EXISTS mydb.employees (
  name         STRING COMMENT 'Employee name',
  salary       FLOAT  COMMENT 'Employee salary',
  subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
  deductions   MAP<STRING, FLOAT>
               COMMENT 'Keys are deductions names, values are percentages',
  address      STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
               COMMENT 'Home address')
COMMENT 'Description of the table'
LOCATION '/user/hive/warehouse/mydb.db/employees'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...);

First, note that you can prefix a database name, mydb in this case, if you’re not currently working in the target database.

If you add the option IF NOT EXISTS, Hive will silently ignore the statement if the table already exists. This is useful in scripts that should create a table the first time they run.

However, the clause has a gotcha you should know. If the schema specified differs from the schema in the table that already exists, Hive won’t warn you. If your intention is for this table to have the new schema, you’ll have to drop the old table, losing your data, and then re-create it. Consider if you should use one or more ALTER TABLE statements to change the existing table schema instead. See Alter Table for details.

Warning

If you use IF NOT EXISTS and the existing table has a different schema than the schema in the CREATE TABLE statement, Hive will ignore the discrepancy.

You can add a comment to any column, after the type. Like databases, you can attach a comment to the table itself and you can define one or more table properties. In most cases, the primary benefit of TBLPROPERTIES is to add additional documentation in a key-value format. However, when we examine Hive’s integration with databases such as DynamoDB (see DynamoDB), we’ll see that the TBLPROPERTIES can be used to express essential metadata about the database connection.

Hive automatically adds two table properties: last_modified_by holds the username of the last user to modify the table, and last_modified_time holds the epoch time in seconds of that modification.

Note

A planned enhancement for Hive v0.10.0 is to add a SHOW TBLPROPERTIES table_name command that will list just the TBLPROPERTIES for a table.

Finally, you can optionally specify a location for the table data (as opposed to metadata, which the metastore will always hold). In this example, we are showing the default location that Hive would use, /user/hive/warehouse/mydb.db/employees, where /user/hive/warehouse is the default “warehouse” location (as discussed previously), mydb.db is the database directory, and employees is the table directory.

By default, Hive always creates the table’s directory under the directory for the enclosing database. The exception is the default database. It doesn’t have a directory under /user/hive/warehouse, so a table in the default database will have its directory created directly in /user/hive/warehouse (unless explicitly overridden).

Note

To avoid potential confusion, it’s usually better to use an external table if you don’t want to use the default location table. See External Tables for details.

You can also copy the schema (but not the data) of an existing table:

CREATE TABLE IF NOT EXISTS mydb.employees2
LIKE mydb.employees;

This version also accepts the optional LOCATION clause, but note that no other properties, including the schema, can be defined; they are determined from the original table.

The SHOW TABLES command lists the tables. With no additional arguments, it shows the tables in the current working database. Let’s assume we have already created a few other tables, table1 and table2, and we did so in the mydb database:

hive> USE mydb;

hive> SHOW TABLES;
employees
table1
table2

If we aren’t in the same database, we can still list the tables in that database:

hive> USE default;

hive> SHOW TABLES IN mydb;
employees
table1
table2

If we have a lot of tables, we can limit the ones listed using a regular expression, a concept we’ll discuss in detail in LIKE and RLIKE:

hive> USE mydb;

hive> SHOW TABLES 'empl.*';
employees

Not all regular expression features are supported. If you know regular expressions, it’s better to test a candidate regular expression to make sure it actually works!

The regular expression in the single quote looks for all tables with names starting with empl and ending with any other characters (the .* part).

Note

Using the IN database_name clause and a regular expression for the table names together is not supported.

We can also use the DESCRIBE EXTENDED mydb.employees command to show details about the table. (We can drop the mydb. prefix if we’re currently using the mydb database.) We have reformatted the output for easier reading and we have suppressed many details to focus on the items that interest us now:

hive> DESCRIBE EXTENDED mydb.employees;
name    string  Employee name
salary  float   Employee salary
subordinates    array<string>   Names of subordinates
deductions      map<string,float> Keys are deductions names, values are percentages
address struct<street:string,city:string,state:string,zip:int>  Home address

Detailed Table Information      Table(tableName:employees, dbName:mydb, owner:me,
...
location:hdfs://master-server/user/hive/warehouse/mydb.db/employees,
parameters:{creator=me, created_at='2012-01-02 10:00:00',
            last_modified_user=me, last_modified_time=1337544510,
            comment:Description of the table, ...}, ...)

Replacing EXTENDED with FORMATTED provides more readable but also more verbose output.

The first section shows the output of DESCRIBE without EXTENDED or FORMATTED (i.e., the schema including the comments for each column).

If you only want to see the schema for a particular column, append the column to the table name. Here, EXTENDED adds no additional output:

hive> DESCRIBE mydb.employees.salary;
salary  float   Employee salary

Returning to the extended output, note the line in the description that starts with location:. It shows the full URI path in HDFS to the directory where Hive will keep all the data for this table, as we discussed above.

Warning

We said that the last_modified_by and last_modified_time table properties are automatically created. However, they are only shown in the Detailed Table Information if a user-specified table property has also been defined!

Managed Tables

The tables we have created so far are called managed tables or sometimes called internal tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen, Hive stores the data for these tables in a subdirectory under the directory defined by hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.

When we drop a managed table (see Dropping Tables), Hive deletes the data in the table.

However, managed tables are less convenient for sharing with other tools. For example, suppose we have data that is created and used primarily by Pig or other tools, but we want to run some queries against it, but not give Hive ownership of the data. We can define an external table that points to that data, but doesn’t take ownership of it.

External Tables

Suppose we are analyzing data from the stock markets. Periodically, we ingest the data for NASDAQ and the NYSE from a source like Infochimps (http://infochimps.com/datasets) and we want to study this data with many tools. (See the data sets named infochimps_dataset_4777_download_16185 and infochimps_dataset_4778_download_16677, respectively, which are actually sourced from Yahoo! Finance.) The schema we’ll use next matches the schemas of both these data sources. Let’s assume the data files are in the distributed filesystem directory /data/stocks.

The following table declaration creates an external table that can read all the data files for this comma-delimited data in /data/stocks:

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
  exchange        STRING,
  symbol          STRING,
  ymd             STRING,
  price_open      FLOAT,
  price_high      FLOAT,
  price_low       FLOAT,
  price_close     FLOAT,
  volume          INT,
  price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

The EXTERNAL keyword tells Hive this table is external and the LOCATION … clause is required to tell Hive where it’s located.

Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted.

There are a few other small differences between managed and external tables, where some HiveQL constructs are not permitted for external tables. We’ll discuss those when we come to them.

However, it’s important to note that the differences between managed and external tables are smaller than they appear at first. Even for managed tables, you know where they are located, so you can use other tools, hadoop dfs commands, etc., to modify and even delete the files in the directories for managed tables. Hive may technically own these directories and files, but it doesn’t have full control over them! Recall, in Schema on Read, we said that Hive really has no control over the integrity of the files used for storage and whether or not their contents are consistent with the table schema. Even managed tables don’t give us this control.

Still, a general principle of good software design is to express intent. If the data is shared between tools, then creating an external table makes this ownership explicit.

You can tell whether or not a table is managed or external using the output of DESCRIBE EXTENDED tablename. Near the end of the Detailed Table Information output, you will see the following for managed tables:

... tableType:MANAGED_TABLE)

For external tables, you will see the following:

... tableType:EXTERNAL_TABLE)

As for managed tables, you can also copy the schema (but not the data) of an existing table:

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';

Note

If you omit the EXTERNAL keyword and the original table is external, the new table will also be external. If you omit EXTERNAL and the original table is managed, the new table will also be managed. However, if you include the EXTERNAL keyword and the original table is managed, the new table will be external. Even in this scenario, the LOCATION clause will still be optional.

Partitioned, Managed Tables

The general notion of partitioning data is an old one. It can take many forms, but often it’s used for distributing load horizontally, moving data physically closer to its most frequent users, and other purposes.

Hive has the notion of partitioned tables. We’ll see that they have important performance benefits, and they can help organize data in a logical fashion, such as hierarchically.

We’ll discuss partitioned managed tables first. Let’s return to our employees table and imagine that we work for a very large multinational corporation. Our HR people often run queries with WHERE clauses that restrict the results to a particular country or to a particular first-level subdivision (e.g., state in the United States or province in Canada). (First-level subdivision is an actual term, used here, for example: http://www.commondatahub.com/state_source.jsp.) We’ll just use the word state for simplicity. We have redundant state information in the address field. It is distinct from the state partition. We could remove the state element from address. There is no ambiguity in queries, since we have to use address.state to project the value inside the address. So, let’s partition the data first by country and then by state:

CREATE TABLE employees (
  name         STRING,
  salary       FLOAT,
  subordinates ARRAY<STRING>,
  deductions   MAP<STRING, FLOAT>,
  address      STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);

Partitioning tables changes how Hive structures the data storage. If we create this table in the mydb database, there will still be an employees directory for the table:

hdfs://master_server/user/hive/warehouse/mydb.db/employees

However, Hive will now create subdirectories reflecting the partitioning structure. For example:

...
.../employees/country=CA/state=AB
.../employees/country=CA/state=BC
...
.../employees/country=US/state=AL
.../employees/country=US/state=AK
...

Yes, those are the actual directory names. The state directories will contain zero or more files for the employees in those states.

Once created, the partition keys (country and state, in this case) behave like regular columns. There is one known exception, due to a bug (see Aggregate functions). In fact, users of the table don’t need to care if these “columns” are partitions or not, except when they want to optimize query performance.

For example, the following query selects all employees in the state of Illinois in the United States:

SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';

Note that because the country and state values are encoded in directory names, there is no reason to have this data in the data files themselves. In fact, the data just gets in the way in the files, since you have to account for it in the table schema, and this data wastes space.

Perhaps the most important reason to partition data is for faster queries. In the previous query, which limits the results to employees in Illinois, it is only necessary to scan the contents of one directory. Even if we have thousands of country and state directories, all but one can be ignored. For very large data sets, partitioning can dramatically improve query performance, but only if the partitioning scheme reflects common range filtering (e.g., by locations, timestamp ranges).

When we add predicates to WHERE clauses that filter on partition values, these predicates are called partition filters.

Even if you do a query across the entire US, Hive only reads the 65 directories covering the 50 states, 9 territories, and the District of Columbia, and 6 military “states” used by the armed services. You can see the full list here: http://www.50states.com/abbreviations.htm.

Of course, if you need to do a query for all employees around the globe, you can still do it. Hive will have to read every directory, but hopefully these broader disk scans will be relatively rare.

However, a query across all partitions could trigger an enormous MapReduce job if the table data and number of partitions are large. A highly suggested safety measure is putting Hive into “strict” mode, which prohibits queries of partitioned tables without a WHERE clause that filters on partitions. You can set the mode to “nonstrict,” as in the following session:

hive> set hive.mapred.mode=strict;

hive> SELECT e.name, e.salary FROM employees e LIMIT 100;
FAILED: Error in semantic analysis: No partition predicate found for
 Alias "e" Table "employees"

hive> set hive.mapred.mode=nonstrict;

hive> SELECT e.name, e.salary FROM employees e LIMIT 100;
John Doe  100000.0
...

You can see the partitions that exist with the SHOW PARTITIONS command:

hive> SHOW PARTITIONS employees;
...
Country=CA/state=AB
country=CA/state=BC
...
country=US/state=AL
country=US/state=AK
...

If you have a lot of partitions and you want to see if partitions have been defined for particular partition keys, you can further restrict the command with an optional PARTITION clause that specifies one or more of the partitions with specific values:

hive> SHOW PARTITIONS employees PARTITION(country='US');
country=US/state=AL
country=US/state=AK
...

hive> SHOW PARTITIONS employees PARTITION(country='US', state='AK');
country=US/state=AK

The DESCRIBE EXTENDED employees command shows the partition keys:

hive> DESCRIBE EXTENDED employees;
name         string,
salary       float,
...
address      struct<...>,
country      string,
state        string

Detailed Table Information...
partitionKeys:[FieldSchema(name:country, type:string, comment:null),
FieldSchema(name:state, type:string, comment:null)],
...

The schema part of the output lists the country and state with the other columns, because they are columns as far as queries are concerned. The Detailed Table Information includes the country and state as partition keys. The comments for both of these keys are null; we could have added comments just as for regular columns.

You create partitions in managed tables by loading data into them. The following example creates a US and CA (California) partition while loading data into it from a local directory, $HOME/california-employees. You must specify a value for each partition column. Notice how we reference the HOME environment variable in HiveQL:

LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');

The directory for this partition, …/employees/country=US/state=CA, will be created by Hive and all data files in $HOME/california-employees will be copied into it. See Loading Data into Managed Tables for more information on populating tables.

External Partitioned Tables

You can use partitioning with external tables. In fact, you may find that this is your most common scenario for managing large production data sets. The combination gives you a way to “share” data with other tools, while still optimizing query performance.

You also have more flexibility in the directory structure used, as you define it yourself. We’ll see a particularly useful example in a moment.

Let’s consider a new example that fits this scenario well: logfile analysis. Most organizations use a standard format for log messages, recording a timestamp, severity (e.g., ERROR, WARNING, INFO), perhaps a server name and process ID, and then an arbitrary text message. Suppose our Extract, Transform, and Load (ETL) process ingests and aggregates logfiles in our environment, converting each log message to a tab-delimited record and also decomposing the timestamp into separate year, month, and day fields, and a combined hms field for the remaining hour, minute, and second parts of the timestamp, for reasons that will become clear in a moment. You could do this parsing of log messages using the string parsing functions built into Hive or Pig, for example. Alternatively, we could use smaller integer types for some of the timestamp-related fields to conserve space. Here, we are ignoring subsequent resolution.

Here’s how we might define the corresponding Hive table:

CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
  hms             INT,
  severity        STRING,
  server          STRING,
  process_id      INT,
  message         STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

We’re assuming that a day’s worth of log data is about the correct size for a useful partition and finer grain queries over a day’s data will be fast enough.

Recall that when we created the nonpartitioned external stocks table, a LOCATION … clause was required. It isn’t used for external partitioned tables. Instead, an ALTER TABLE statement is used to add each partition separately. It must specify a value for each partition key, the year, month, and day, in this case (see Alter Table for more details on this feature). Here is an example, where we add a partition for January 2nd, 2012:

ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';

The directory convention we use is completely up to us. Here, we follow a hierarchical directory structure, because it’s a logical way to organize our data, but there is no requirement to do so. We could follow Hive’s directory naming convention (e.g., …/exchange=NASDAQ/symbol=AAPL), but there is no requirement to do so.

An interesting benefit of this flexibility is that we can archive old data on inexpensive storage, like Amazon’s S3, while keeping newer, more “interesting” data in HDFS. For example, each day we might use the following procedure to move data older than a month to S3:

  • Copy the data for the partition being moved to S3. For example, you can use the hadoop distcp command:

    hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
  • Alter the table to point the partition to the S3 location:

    ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
    SET LOCATION 's3n://ourbucket/logs/2011/01/02';
  • Remove the HDFS copy of the partition using the hadoop fs -rmr command:

    hadoop fs -rmr /data/log_messages/2011/01/02

You don’t have to be an Amazon Elastic MapReduce user to use S3 this way. S3 support is part of the Apache Hadoop distribution. You can still query this data, even queries that cross the month-old “boundary,” where some data is read from HDFS and some data is read from S3!

By the way, Hive doesn’t care if a partition directory doesn’t exist for a partition or if it has no files. In both cases, you’ll just get no results for a query that filters for the partition. This is convenient when you want to set up partitions before a separate process starts writing data to them. As soon as data is there, queries will return results from that data.

This feature illustrates another benefit: new data can be written to a dedicated directory with a clear distinction from older data in other directories. Also, whether you move old data to an “archive” location or delete it outright, the risk of tampering with newer data is reduced since the data subsets are in separate directories.

As for nonpartitioned external tables, Hive does not own the data and it does not delete the data if the table is dropped.

As for managed partitioned tables, you can see an external table’s partitions with SHOW PARTITIONS:

hive> SHOW PARTITIONS log_messages;
...
year=2011/month=12/day=31
year=2012/month=1/day=1
year=2012/month=1/day=2
...

Similarly, the DESCRIBE EXTENDED log_messages shows the partition keys both as part of the schema and in the list of partitionKeys:

hive> DESCRIBE EXTENDED log_messages;
...
message         string,
year            int,
month           int,
day             int

Detailed Table Information...
partitionKeys:[FieldSchema(name:year, type:int, comment:null),
FieldSchema(name:month, type:int, comment:null),
FieldSchema(name:day, type:int, comment:null)],
...

This output is missing a useful bit of information, the actual location of the partition data. There is a location field, but it only shows Hive’s default directory that would be used if the table were a managed table. However, we can get a partition’s location as follows:

hive> DESCRIBE EXTENDED log_messages PARTITION (year=2012, month=1, day=2);
...
location:s3n://ourbucket/logs/2011/01/02,
...

We frequently use external partitioned tables because of the many benefits they provide, such as logical data management, performant queries, etc.

ALTER TABLE … ADD PARTITION is not limited to external tables. You can use it with managed tables, too, when you have (or will have) data for partitions in directories created outside of the LOAD and INSERT options we discussed above. You’ll need to remember that not all of the table’s data will be under the usual Hive “warehouse” directory, and this data won’t be deleted when you drop the managed table! Hence, from a “sanity” perspective, it’s questionable whether you should dare to use this feature with managed tables.

Customizing Table Storage Formats

In Text File Encoding of Data Values, we discussed that Hive defaults to a text file format, which is indicated by the optional clause STORED AS TEXTFILE, and you can overload the default values for the various delimiters when creating the table. Here we repeat the definition of the employees table we used in that discussion:

CREATE TABLE employees (
  name         STRING,
  salary       FLOAT,
  subordinates ARRAY<STRING>,
  deductions   MAP<STRING, FLOAT>,
  address      STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

TEXTFILE implies that all fields are encoded using alphanumeric characters, including those from international character sets, although we observed that Hive uses non-printing characters as “terminators” (delimiters), by default. When TEXTFILE is used, each line is considered a separate record.

You can replace TEXTFILE with one of the other built-in file formats supported by Hive, including SEQUENCEFILE and RCFILE, both of which optimize disk space usage and I/O bandwidth performance using binary encoding and optional compression. These formats are discussed in more detail in Chapter 11 and Chapter 15.

Hive draws a distinction between how records are encoded into files and how columns are encoded into records. You customize these behaviors separately.

The record encoding is handled by an input format object (e.g., the Java code behind TEXTFILE.) Hive uses a Java class (compiled module) named org.apache.hadoop.mapred.TextInputFormat. If you are unfamiliar with Java, the dotted name syntax indicates a hierarchical namespace tree of packages that actually corresponds to the directory structure for the Java code. The last name, TextInputFormat, is a class in the lowest-level package mapred.

The record parsing is handled by a serializer/deserializer or SerDe for short. For TEXTFILE and the encoding we described in Chapter 3 and repeated in the example above, the SerDe Hive uses is another Java class called org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.

For completeness, there is also an output format that Hive uses for writing the output of queries to files and to the console. For TEXTFILE, the Java class named org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat is used for output.

Note

Hive uses an input format to split input streams into records, an output format to format records into output streams (i.e., the output of queries), and a SerDe to parse records into columns, when reading, and encodes columns into records, when writing. We’ll explore these distinctions in greater depth in Chapter 15.

Third-party input and output formats and SerDes can be specified, a feature which permits users to customize Hive for a wide range of file formats not supported natively.

Here is a complete example that uses a custom SerDe, input format, and output format for files accessible through the Avro protocol, which we will discuss in detail in Avro Hive SerDe:

CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES ('schema.url'='http://schema_provider/kst.avsc')
STORED AS
INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat';

The ROW FORMAT SERDE … specifies the SerDe to use. Hive provides the WITH SERDEPROPERTIES feature that allows users to pass configuration information to the SerDe. Hive knows nothing about the meaning of these properties. It’s up to the SerDe to decide their meaning. Note that the name and value of each property must be a quoted string.

Finally, the STORED AS INPUTFORMAT … OUTPUTFORMAT … clause specifies the Java classes to use for the input and output formats, respectively. If you specify one of these formats, you are required to specify both of them.

Note that the DESCRIBE EXTENDED table command lists the input and output formats, the SerDe, and any SerDe properties in the DETAILED TABLE INFORMATION. For our example, we would see the following:

hive> DESCRIBE EXTENDED kst
...
inputFormat:com.linkedin.haivvreo.AvroContainerInputFormat,
outputFormat:com.linkedin.haivvreo.AvroContainerOutputFormat,
...
serdeInfo:SerDeInfo(name:null,
serializationLib:com.linkedin.haivvreo.AvroSerDe,
  parameters:{schema.url=http://schema_provider/kst.avsc})
...

Finally, there are a few additional CREATE TABLE clauses that describe more details about how the data is supposed to be stored. Let’s extend our previous stocks table example from External Tables:

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
  exchange        STRING,
  symbol          STRING,
  ymd             STRING,
  price_open      FLOAT,
  price_high      FLOAT,
  price_low       FLOAT,
  price_close     FLOAT,
  volume          INT,
  price_adj_close FLOAT)
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

The CLUSTERED BY … INTO … BUCKETS clause, with an optional SORTED BY … clause is used to optimize certain kinds of queries, which we discuss in detail in Bucketing Table Data Storage.

Dropping Tables

The familiar DROP TABLE command from SQL is supported:

DROP TABLE IF EXISTS employees;

The IF EXISTS keywords are optional. If not used and the table doesn’t exist, Hive returns an error.

For managed tables, the table metadata and data are deleted.

Note

Actually, if you enable the Hadoop Trash feature, which is not on by default, the data is moved to the .Trash directory in the distributed filesystem for the user, which in HDFS is /user/$USER/.Trash. To enable this feature, set the property fs.trash.interval to a reasonable positive number. It’s the number of minutes between “trash checkpoints”; 1,440 would be 24 hours. While it’s not guaranteed to work for all versions of all distributed filesystems, if you accidentally drop a managed table with important data, you may be able to re-create the table, re-create any partitions, and then move the files from .Trash to the correct directories (using the filesystem commands) to restore the data.

For external tables, the metadata is deleted but the data is not.

Alter Table

Most table properties can be altered with ALTER TABLE statements, which change metadata about the table but not the data itself. These statements can be used to fix mistakes in schema, move partition locations (as we saw in External Partitioned Tables), and do other operations.

Warning

ALTER TABLE modifies table metadata only. The data for the table is untouched. It’s up to you to ensure that any modifications are consistent with the actual data.

Renaming a Table

Use this statement to rename the table log_messages to logmsgs:

ALTER TABLE log_messages RENAME TO logmsgs;

Adding, Modifying, and Dropping a Table Partition

As we saw previously, ALTER TABLE table ADD PARTITION … is used to add a new partition to a table (usually an external table). Here we repeat the same command shown previously with the additional options available:

ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'
...;

Multiple partitions can be added in the same query when using Hive v0.8.0 and later. As always, IF NOT EXISTS is optional and has the usual meaning.

Warning

Hive v0.7.X allows you to use the syntax with multiple partition specifications, but it actually uses just the first partition specification, silently ignoring the others! Instead, use a separate ALTER STATEMENT statement for each partition.

Similarly, you can change a partition location, effectively moving it:

ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';

This command does not move the data from the old location, nor does it delete the old data.

Finally, you can drop a partition:

ALTER TABLE log_messages DROP IF EXISTS PARTITION(year = 2011, month = 12, day = 2);

The IF EXISTS clause is optional, as usual. For managed tables, the data for the partition is deleted, along with the metadata, even if the partition was created using ALTER TABLE … ADD PARTITION. For external tables, the data is not deleted.

There are a few more ALTER statements that affect partitions discussed later in Alter Storage Properties and Miscellaneous Alter Table Statements.

Changing Columns

You can rename a column, change its position, type, or comment:

ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;

You have to specify the old name, a new name, and the type, even if the name or type is not changing. The keyword COLUMN is optional as is the COMMENT clause. If you aren’t moving the column, the AFTER other_column clause is not necessary. In the example shown, we move the column after the severity column. If you want to move the column to the first position, use FIRST instead of AFTER other_column.

As always, this command changes metadata only. If you are moving columns, the data must already match the new schema or you must change it to match by some other means.

Adding Columns

You can add new columns to the end of the existing columns, before any partition columns.

ALTER TABLE log_messages ADD COLUMNS (
 app_name   STRING COMMENT 'Application name',
 session_id BIGINT   COMMENT 'The current session id');

The COMMENT clauses are optional, as usual. If any of the new columns are in the wrong position, use an ALTER COLUMN table CHANGE COLUMN statement for each one to move it to the correct position.

Deleting or Replacing Columns

The following example removes all the existing columns and replaces them with the new columns specified:

ALTER TABLE log_messages REPLACE COLUMNS (
 hours_mins_secs INT    COMMENT 'hour, minute, seconds from timestamp',
 severity        STRING COMMENT 'The message severity'
 message         STRING COMMENT 'The rest of the message');

This statement effectively renames the original hms column and removes the server and process_id columns from the original schema definition. As for all ALTER statements, only the table metadata is changed.

The REPLACE statement can only be used with tables that use one of the native SerDe modules: DynamicSerDe or MetadataTypedColumnsetSerDe. Recall that the SerDe determines how records are parsed into columns (deserialization) and how a record’s columns are written to storage (serialization). See Chapter 15 for more details on SerDes.

Alter Table Properties

You can add additional table properties or modify existing properties, but not remove them:

ALTER TABLE log_messages SET TBLPROPERTIES (
 'notes' = 'The process id is no longer captured; this column is always NULL');

Alter Storage Properties

There are several ALTER TABLE statements for modifying format and SerDe properties.

The following statement changes the storage format for a partition to be SEQUENCEFILE, as we discussed in Creating Tables (see Sequence Files and Chapter 15 for more information):

ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;

The PARTITION clause is required if the table is partitioned.

You can specify a new SerDe along with SerDe properties or change the properties for the existing SerDe. The following example specifies that a table will use a Java class named com.example.JSONSerDe to process a file of JSON-encoded records:

ALTER TABLE table_using_JSON_storage
SET SERDE 'com.example.JSONSerDe'
WITH SERDEPROPERTIES (
 'prop1' = 'value1',
 'prop2' = 'value2');

The SERDEPROPERTIES are passed to the SerDe module (the Java class com.example.JSONSerDe, in this case). Note that both the property names (e.g., prop1) and the values (e.g., value1) must be quoted strings.

The SERDEPROPERTIES feature is a convenient mechanism that SerDe implementations can exploit to permit user customization. We’ll see a real-world example of a JSON SerDe and how it uses SERDEPROPERTIES in JSON SerDe.

The following example demonstrates how to add new SERDEPROPERTIES for the current SerDe:

ALTER TABLE table_using_JSON_storage
SET SERDEPROPERTIES (
 'prop3' = 'value3',
 'prop4' = 'value4');

You can alter the storage properties that we discussed in Creating Tables:

ALTER TABLE stocks
CLUSTERED BY (exchange, symbol)
SORTED BY (symbol)
INTO 48 BUCKETS;

The SORTED BY clause is optional, but the CLUSTER BY and INTO … BUCKETS are required. (See also Bucketing Table Data Storage for information on the use of data bucketing.)

Miscellaneous Alter Table Statements

In Execution Hooks, we’ll discuss a technique for adding execution “hooks” for various operations. The ALTER TABLE … TOUCH statement is used to trigger these hooks:

ALTER TABLE log_messages TOUCH
PARTITION(year = 2012, month = 1, day = 1);

The PARTITION clause is required for partitioned tables. A typical scenario for this statement is to trigger execution of the hooks when table storage files have been modified outside of Hive. For example, a script that has just written new files for the 2012/01/01 partition for log_message can make the following call to the Hive CLI:

hive -e 'ALTER TABLE log_messages TOUCH PARTITION(year = 2012, month = 1, day = 1);'

This statement won’t create the table or partition if it doesn’t already exist. Use the appropriate creation commands in that case.

The ALTER TABLE … ARCHIVE PARTITION statement captures the partition files into a Hadoop archive (HAR) file. This only reduces the number of files in the filesystem, reducing the load on the NameNode, but doesn’t provide any space savings (e.g., through compression):

ALTER TABLE log_messages ARCHIVE
PARTITION(year = 2012, month = 1, day = 1);

To reverse the operation, substitute UNARCHIVE for ARCHIVE. This feature is only available for individual partitions of partitioned tables.

Finally, various protections are available. The following statements prevent the partition from being dropped and queried:

ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;

ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;

To reverse either operation, replace ENABLE with DISABLE. These operations also can’t be used with nonpartitioned tables.

Get Programming Hive now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.