NLP in Apache Spark

As of Spark 2.3.2, tokenization and stop-word removal feature transformers (among a wide variety of others), and the TF–IDF feature extractor is available natively in MLlib. Although stemming, lemmatization, and standardization can be achieved indirectly through transformations on Spark dataframes in Spark 2.3.2 (via user-defined functions (UDFs) and map functions that are applied to RDDs), we will be using a third-party Spark library called spark-nlp to perform these feature transformations. This third-party library has been designed to extend the features already available in MLlib by providing an easy-to-use API for distributed NLP annotations on Spark dataframes at scale. To learn more about spark-nlp, please visit ...

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.