Spark v2.0 and beyond

Spark v2.0 and beyond has been the catalyst for a renaissance in data science! Datasets, DataFrames, ML pipelines, and new and improved algorithms in MLlib have paved the way for data wrangling at scale. I think Version 2.0 marks the spot where Spark turned into a mature framework. It could handle huge workloads in terms of the number of machines as well as the volume of data. The community update at the Spark Summit 2015 in San Francisco included a slide that showed the power of Spark:

  • The largest cluster-8,000 nodes (Tencent)
  • The largest single job-1 petabyte and more (Alibaba and Tencent)
  • The longest running job-1 petabyte and more for a week (Alibaba)
  • The top streaming intake-1 terabyte/hour (Janelia farm)
  • The largest shuffle-1 ...

Get Fast Data Processing with Spark 2 - Third Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.