Unit 9Reaching the Web

According to WorldWideWebSize,[7] the indexed web contains at least 4.85 billion pages. Some of them may be of interest to us. The module urllib.request contains functions for downloading data from the web. While it may be feasible (though not advisable) to download a single data set by hand, save it into the cache directory, and then analyze it using Python scripts, some data analysis projects call for automated iterative or recursive downloads.

The first step toward getting anything off the web is to open the URL with the function urlopen(url) and obtain the open URL handle. Once opened, the URL handle is similar to a read-only open file handle: you can use the functions read, readline, and readlines to access the ...

Get Data Science Essentials in Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.