The Right Tool for the Job: Nokogiri

Often novices—and experts, too—approach the problem of extracting HTML with regular expressions. (If you’re not familiar with regular expressions, we’ll be covering them in detail in Part II.) This is a truly terrible idea, and one that generally leads to a level of pain that can make people avoid scraping HTML again—a great shame, since knowing how to extract information from web pages is of huge practical benefit. But the fact remains: attempting to parse HTML with regular expressions is an awful idea.

The right tool to use is an HTML parser, which will parse the document into a tree and allow you to search and manipulate the nodes within that tree. A few libraries are available, but one called Nokogiri ...

Get Text Processing with Ruby now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.