Chapter 9. Performance and monitoring

This chapter covers

  • Monitoring Spark applications
  • Performance-related configuration options
  • Tuning your application for maximum performance
  • Using graph partitioning to boost large-scale processing

Most of the examples we’ve looked at so far have been small-scale. They would run on one machine and complete their processing without requiring a large amount of computing resources. But one of the key reasons to use Apache Spark is to take advantage of its distributed processing model. Spark’s ability to distribute data and processing across a cluster of many machines is the key to its capacity to run the type of processing we’ve discussed on large datasets.

Once you have a cluster with plenty of resources ...

Get Spark GraphX in Action now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.