PySpark and Jupyter Notebook

Let's now integrate Jupyter Notebook with PySpark so that we can write our first Spark applications in Python! In the case of our local development environment, the easiest way to integrate Jupyter Notebook with PySpark is to set a global SPARK_HOME environmental variable that points to the directory containing the Spark binaries. Thereafter, we can employ the findspark Python package, as installed earlier, that will append the location of SPARK_HOME, and hence the PySpark API, to sys.path at runtime. Note that findspark should not be used for production-grade code development—instead, Spark applications should be deployed as code artifacts submitted via spark-submit.

Please execute the following shell commands ...

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.