Case study: fitting classifier models in pyspark

Now that we have examined several algorithms for fitting classifier models in the scikit-learn library, let us look at how we might implement a similar model in PySpark. We can use the same census dataset from earlier in this chapter, and start by loading the data using a textRdd after starting the spark context:

>>> censusRdd = sc.textFile('census.data')

Next we need to split the data into individual fields, and strip whitespace

>>> censusRddSplit = censusRdd.map(lambda x: [e.strip() for e in x.split(',')])

Now, as before, we need to determine which of our features are categorical and need to be re-encoded using one-hot encoding. We do this by taking a single row and asking whether the string in ...

Get Mastering Predictive Analytics with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.