Example: Extracting Keywords from Articles

Let’s imagine that we have a list of many hundreds of articles, spread across many different web pages. We’d like to analyze the content of these articles and create a searchable database of them for ourselves. We’d like to store the content of the articles—but not any extraneous text from the web page, such as header text or sidebar content. We’d also like to make an attempt to store some keywords, so that we can search against them and not have to search the whole body of the text. When we’re finished, we’ll be able to list all the terms mentioned in an article, and by extension we’ll be able to list all the articles that match a particular term.

This problem neatly covers two areas of language processing. ...

Get Text Processing with Ruby now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.