Chapter 18. Security

To understand Hive security, we have to backtrack and understand Hadoop security and the history of Hadoop. Hadoop started out as a subproject of Apache Nutch. At that time and through its early formative years, features were prioritized over security. Security is more complex in a distributed system because multiple components across different machines need to communicate with each other.

Unsecured Hadoop like the versions before the v0.20.205 release derived the username by forking a call to the whoami program. Users are free to change this parameter by setting the hadoop.job.ugi property for FSShell (filesystem) commands. Map and reduce tasks all run under the same system user (usually hadoop or mapred) on TaskTracker nodes. Also, Hadoop components are typically listening on ports with high numbers. They are also typically launched by nonprivileged users (i.e., users other than root).

The recent efforts to secure Hadoop involved several changes, primarily the incorporation of Kerberos authorization support, but also other changes to close vulnerabilities. Kerberos allows mutual authentication between client and server. A client’s request for a ticket is passed along with a request. Tasks on the TaskTracker are run as the user who launched the job. Users are no longer able to impersonate other users by setting the hadoop.job.ugi property. For this to work, all Hadoop components must use Kerberos security from end to end.

Hive was created before any of this Kerberos ...

Get Programming Hive now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.