O'Reilly logo

Web Scraping with Python by Ryan Mitchell

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 6. Reading Documents

It is tempting to think of the Internet primarily as a collection of text-based websites interspersed with newfangled web 2.0 multimedia content that can mostly be ignored for the purposes of web scraping. However, this ignores what the Internet most fundamentally is: a content-agnostic vehicle for transmitting files.

Although the Internet has been around in some form or another since the late 1960s, HTML didn’t debut until 1992. Until then, the Internet consisted mostly of email and file transmission; the concept of web pages as we know them today didn’t really exist. In other words, the Internet is not a collection of HTML files. It is a collection of information, with HTML files often being used as a frame to showcase it. Without being able to read a variety of document types, including text, PDF, images, video, email, and more, we are missing out on a huge part of the available data.

This chapter covers dealing with documents, whether you’re downloading them to a local folder or reading them and extracting data. We’ll also take a look at dealing with various types of text encoding, which can make it possible to even read foreign-language HTML pages.

Document Encoding

A document’s encoding tells applications—whether they are your computer’s operating system or your own Python code—how to read it. This encoding can usually be deduced from its file extension, although this file extension is not mandated by its encoding. I could, for example, save ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required