Natural Language Processing Using Apache Spark

In this chapter, we'll study and implement common algorithms that are used in NLP, which can help us develop machines that are capable of automatically analyzing and understanding human text and speech in context. Specifically, we will study and implement the following classes of computer science algorithms related to NLP:

  • Feature transformers, including the following:
    • Tokenization
    • Stemming
    • Lemmatization
    • Normalization
  • Feature extractors, including the following :
    • Bag of words
    • Term frequency–inverse document frequency

Get Machine Learning with Apache Spark Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.