Mastering Spark for Structured Streaming

by Tianhui Michael Li

Released November 2016

Publisher(s): O'Reilly Media, Inc.

ISBN: 9781491974438

Start your free trial

Video description

Spark is one of today’s most popular distributed computation engines for processing and analyzing big data. This course provides data engineers, data scientist and data analysts interested in exploring the technology of data streaming with practical experience in using Spark. You’ll learn about the Spark Structured Streaming API, the powerful Catalyst query optimizer, the Tungsten execution engine, and more in this hands-on course where you’ll build small several applications that leverage all the aspects of Spark 2.0. While not a requirement, the course works best for those with some Scala experience.

Understand the main features of Spark and its advantages over existing systems
Learn the basics of parallelism, streaming computation, and Spark streaming
Explore the distinctions between Spark Structured Streaming and legacy DStream APIs
Understand how to write to and use the Spark Structured Streaming API
Learn about the new Catalyst query optimizer and the Tungsten execution engine
Discover how Scala and Spark Structured Streaming simplify distributed streaming tasks
Gain hands-on experience building applications using Spark 2.0

Michael Li is the founder of The Data Incubator, which provides big data corporate training and a selective eight-week fellowship for PhDs transitioning into industry. Previously, he worked as a data scientist, software engineer, and researcher at Foursquare, Google, Andreessen Horowitz, J.P. Morgan, and NASA. He is a regular contributor to VentureBeat, The Next Web, and Harvard Business Review. Michael earned his Ph.D. at Princeton and was a Marshall Scholar in Cambridge.