CHAPTER 4 Security

In this chapter, we will cover the security architecture of Spark. Hadoop ecosystems, including Spark, are operated in multi-tenant environments, which means that a cluster can be used by more than one user. Your company might have several departments that use Spark for their own purposes, since it can often be wasteful to construct a single cluster per department. Therefore, sharing one cluster is common in enterprise usage, because it saves time and money.

But there are a couple of issues to note here:

  • Data security—Spark clusters store various types of data in your company. Some examples are user activity logs, purchase logs, and access logs. Some of them can be accessed by everyone, and of course some others can’t be accessed by all. In order to protect your user data, you must manage the access control against the stored data.
  • Job security—Even if data access is controlled by some mechanism, it is wasted if everyone can submit any type of job to the Spark cluster. Since each job can access data storage, it is also necessary to manage the authentication and ACL for submitting jobs. In addition to this, Spark jobs publish metrics through the Spark web UI and API. These must also be restricted.
  • Network security—As often is the case in web applications, access can be controlled by host IP and port numbers. Otherwise, everyone can attack against a Spark cluster even from an external area. In order to manage a firewall to protect a Spark cluster, you must ...

Get Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.