O'Reilly logo

Python: Advanced Predictive Analytics by Joseph Babcock, Ashish Kumar

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Introduction to PySpark

So far we've mainly focused on datasets that can fit on a single machine. For larger datasets, we may need to access them through distributed file systems such as Amazon S3 or HDFS. For this purpose, we can utilize the open-source distributed computing framework PySpark (http://spark.apache.org/docs/latest/api/python/). PySpark is a distributed computing framework that uses the abstraction of Resilient Distributed Datasets (RDDs) for parallel collections of objects, which allows us to programmatically access a dataset as if it fits on a single machine. In later chapters we will demonstrate how to build predictive models in PySpark, but for this introduction we focus on data manipulation functions in PySpark.

Creating the ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required