In the previous section, we have seen how to use the interactive shell pyspark to learn the Spark Python API. In this section, we will write a simple Python program that we will run on the Spark cluster. In real-world scenarios, this is how we run our applications on the Spark cluster.
In order to do this, we will write a program called MyFirstApp.py with the following contents:
[hive@node-3 ~]$ cat MyFirstApp.py from pyspark.sql import SparkSession # Path to the file in HDFS csvFile = "employees.csv" # Create a session for this application spark = SparkSession.builder.appName("MyFirstApp").getOrCreate() # Read the CSV File csvTable = spark.read.format("csv").option("header", "true").option("delimiter", "\t").load(csvFile) ...