Scaling out with PySpark – predicting year of song release

To close, let us look at another example using PySpark. With this dataset, which is a subset of the Million Song dataset (Bertin-Mahieux, Thierry, et al. "The million song dataset." ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida. University of Miami, 2011), the goal is to predict the year of a song's release based on the features of the track. The data is supplied as a comma-separated text file, which we can convert into an RDD using the Spark textFile() function. As before in our clustering example, we also define a parsing function with a try…catch block so that we do not fail on a single error ...

Get Mastering Predictive Analytics with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.