Summary

  • Modeling the linguistic data found in corpora can help us to understand linguistic patterns, and can be used to make predictions about new language data.

  • Supervised classifiers use labeled training corpora to build models that predict the label of an input based on specific features of that input.

  • Supervised classifiers can perform a wide variety of NLP tasks, including document classification, part-of-speech tagging, sentence segmentation, dialogue act type identification, and determining entailment relations, and many other tasks.

  • When training a supervised classifier, you should split your corpus into three datasets: a training set for building the classifier model, a dev-test set for helping select and tune the model’s features, and a test set for evaluating the final model’s performance.

  • When evaluating a supervised classifier, it is important that you use fresh data that was not included in the training or dev-test set. Otherwise, your evaluation results may be unrealistically optimistic.

  • Decision trees are automatically constructed tree-structured flowcharts that are used to assign labels to input values based on their features. Although they’re easy to interpret, they are not very good at handling cases where feature values interact in determining the proper label.

  • In naive Bayes classifiers, each feature independently contributes to the decision of which label should be used. This allows feature values to interact, but can be problematic when two or more features are ...

Get Natural Language Processing with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.