Chapter 12. Optimizations and Performance Tuning

This chapter covers various optimizations and performance-tuning best practices when working with Spark.

The chapter is divided into the following recipes:

  • Optimizing memory
  • Using compression to improve performance
  • Using serialization to improve performance
  • Optimizing garbage collection
  • Optimizing the level of parallelism
  • Understanding the future of optimization – project Tungsten

Introduction

Before looking into various ways to optimize Spark, it is a good idea to look at the Spark internals. So far, we have looked at Spark at higher level, where focus was the functionality provided by the various libraries.

Let's start with redefining an RDD. Externally, an RDD is a distributed immutable collection of objects. ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.