Chapter 6Scraping HTML

As developers, the text that we process isn’t always readily available in an easy-to-read plain-text format. It’s not always available as the searchable, easy-to-digest result of some well-designed third-party API. Sometimes, it exists only on web pages, buried among a tangle of messy HTML. Extracting it can seem a daunting task: how do we cut through the clutter and get to just the information that we want?

Luckily, Ruby—with its good HTTP support and powerful text-processing capabilities—is a great choice for doing this sort of task, known as screen scraping or web scraping. Let’s look at how.

Get Text Processing with Ruby now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.