O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Python Web Scraping Cookbook

Book Description

Untangle your web scraping complexities and access web data with ease using Python scripts

About This Book

  • Hands-on recipes for advancing your web scraping skills to expert level.
  • One-Stop Solution Guide to address complex and challenging web scraping tasks using Python.
  • Understand the web page structure and collect meaningful data from the website with ease

Who This Book Is For

This book is ideal for Python programmers, web administrators, security professionals or someone who wants to perform web analytics would find this book relevant and useful. Familiarity with Python and basic understanding of web scraping would be useful to take full advantage of this book.

What You Will Learn

  • Use a wide variety of tools to scrape any website and data—including BeautifulSoup, Scrapy, Selenium, and many more
  • Master expression languages such as XPath, CSS, and regular expressions to extract web data
  • Deal with scraping traps such as hidden form fields, throttling, pagination, and different status codes
  • Build robust scraping pipelines with SQS and RabbitMQ
  • Scrape assets such as images media and know what to do when Scraper fails to run
  • Explore ETL techniques of build a customized crawler, parser, and convert structured and unstructured data from websites
  • Deploy and run your scraper-as-aservice in AWS Elastic Container Service

In Detail

Python Web Scraping Cookbook is a solution-focused book that will teach you techniques to develop high-performance scrapers and deal with crawlers, sitemaps, forms automation,

Ajax-based sites, caches, and more.You'll explore a number of real-world scenarios where every part of the development/product life cycle will be fully covered. You will not only develop the skills to design and develop reliable, performance data flows, but also deploy your codebase to an AWS. If you are involved in software engineering, product development, or data mining (or are interested in building data-driven products), you will find this book useful as each recipe has a clear purpose and objective.

Right from extracting data from the websites to writing a sophisticated web crawler, the book's independent recipes will be a godsend on the job. This book covers Python libraries, requests, and BeautifulSoup. You will learn about crawling, web spidering, working with AJAX websites, paginated items, and more. You will also learn to tackle problems such as 403 errors, working with proxy, scraping images, LXML, and more.

By the end of this book, you will be able to scrape websites more efficiently and to be able to deploy and operate your scraper in the cloud.

Style and approach

This book is a rich collection of recipes that will come in handy when you are scraping a website using Python.

Addressing your common and not-so-common pain points while scraping website, this is a book that you must have on the shelf.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Conventions used
    4. Get in touch
      1. Reviews
  2. Getting Started with Scraping
    1. Introduction
    2. Setting up a Python development environment 
      1. Getting ready
      2. How to do it...
    3. Scraping Python.org with Requests and Beautiful Soup
      1. Getting ready...
      2. How to do it...
      3. How it works...
    4. Scraping Python.org in urllib3 and Beautiful Soup
      1. Getting ready...
      2. How to do it...
      3. How it works
      4. There's more...
    5. Scraping Python.org with Scrapy
      1. Getting ready...
      2. How to do it...
      3. How it works
    6. Scraping Python.org with Selenium and PhantomJS
      1. Getting ready
      2. How to do it...
      3. How it works
      4. There's more...
  3. Data Acquisition and Extraction
    1. Introduction
    2. How to parse websites and navigate the DOM using BeautifulSoup
      1. Getting ready
      2. How to do it...
      3. How it works
      4. There's more...
    3. Searching the DOM with Beautiful Soup's find methods
      1. Getting ready
      2. How to do it...
    4. Querying the DOM with XPath and lxml
      1. Getting ready
      2. How to do it...
      3. How it works
      4. There's more...
    5. Querying data with XPath and CSS selectors
      1. Getting ready
      2. How to do it...
      3. How it works
      4. There's more...
    6. Using Scrapy selectors
      1. Getting ready
      2. How to do it...
      3. How it works
      4. There's more...
    7. Loading data in unicode / UTF-8
      1. Getting ready
      2. How to do it...
      3. How it works
      4. There's more...
  4. Processing Data
    1. Introduction
    2. Working with CSV and JSON data
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    3. Storing data using AWS S3
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    4. Storing data using MySQL
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    5. Storing data using PostgreSQL
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    6. Storing data in Elasticsearch
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    7. How to build robust ETL pipelines with AWS SQS
      1. Getting ready
      2. How to do it - posting messages to an AWS queue
      3. How it works
      4. How to do it - reading and processing messages
      5. How it works
      6. There's more...
  5. Working with Images, Audio, and other Assets
    1. Introduction
    2. Downloading media content from the web
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    3.  Parsing a URL with urllib to get the filename
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    4. Determining the type of content for a URL 
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    5. Determining the file extension from a content type
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    6. Downloading and saving images to the local file system
      1. How to do it
      2. How it works
      3. There's more...
    7. Downloading and saving images to S3
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    8.  Generating thumbnails for images
      1. Getting ready
      2. How to do it
      3. How it works
    9. Taking a screenshot of a website
      1. Getting ready
      2. How to do it
      3. How it works
    10. Taking a screenshot of a website with an external service
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    11. Performing OCR on an image with pytesseract
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    12. Creating a Video Thumbnail
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more..
    13. Ripping an MP4 video to an MP3
      1. Getting ready
      2. How to do it
      3. There's more...
  6. Scraping - Code of Conduct
    1. Introduction
    2. Scraping legality and scraping politely
      1. Getting ready
      2. How to do it
    3. Respecting robots.txt
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    4. Crawling using the sitemap
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    5. Crawling with delays
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    6. Using identifiable user agents 
      1. How to do it
      2. How it works
      3. There's more...
    7. Setting the number of concurrent requests per domain
      1. How it works
    8. Using auto throttling
      1. How to do it
      2. How it works
      3. There's more...
    9. Using an HTTP cache for development
      1. How to do it
      2. How it works
      3. There's more...
  7. Scraping Challenges and Solutions
    1. Introduction
    2. Retrying failed page downloads
      1. How to do it
      2. How it works
    3. Supporting page redirects
      1. How to do it
      2. How it works
    4. Waiting for content to be available in Selenium
      1. How to do it
      2. How it works
    5. Limiting crawling to a single domain
      1. How to do it
      2. How it works
    6. Processing infinitely scrolling pages
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    7. Controlling the depth of a crawl
      1. How to do it
      2. How it works
    8. Controlling the length of a crawl
      1. How to do it
      2. How it works
    9. Handling paginated websites
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    10. Handling forms and forms-based authorization
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    11. Handling basic authorization
      1. How to do it
      2. How it works
      3. There's more...
    12. Preventing bans by scraping via proxies
      1. Getting ready
      2. How to do it
      3. How it works
    13. Randomizing user agents
      1. How to do it
    14. Caching responses
      1. How to do it
      2. There's more...
  8. Text Wrangling and Analysis
    1. Introduction
    2. Installing NLTK
      1. How to do it
    3. Performing sentence splitting
      1. How to do it
      2. There's more...
    4. Performing tokenization
      1. How to do it
    5. Performing stemming
      1. How to do it
    6. Performing lemmatization
      1. How to do it
    7. Determining and removing stop words
      1. How to do it
      2. There's more...
    8. Calculating the frequency distributions of words
      1. How to do it
      2. There's more...
    9. Identifying and removing rare words
      1. How to do it
    10. Identifying and removing rare words
      1. How to do it
    11. Removing punctuation marks
      1. How to do it
      2. There's more...
    12. Piecing together n-grams
      1. How to do it
      2. There's more...
    13. Scraping a job listing from StackOverflow 
      1. Getting ready
      2. How to do it
      3. There's more...
    14. Reading and cleaning the description in the job listing
      1. Getting ready
      2. How to do it...
  9. Searching, Mining and Visualizing Data
    1. Introduction
    2. Geocoding an IP address
      1. Getting ready
      2. How to do it
    3. How to collect IP addresses of Wikipedia edits
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    4. Visualizing contributor location frequency on Wikipedia
      1. How to do it
    5. Creating a word cloud from a StackOverflow job listing
      1. Getting ready
      2. How to do it
    6. Crawling links on Wikipedia
      1. Getting ready
      2. How to do it
      3. How it works
      4. Theres more...
    7. Visualizing page relationships on Wikipedia
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    8. Calculating degrees of separation
      1. How to do it
      2. How it works
      3. There's more...
  10. Creating a Simple Data API
    1. Introduction
    2. Creating a REST API with Flask-RESTful
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    3. Integrating the REST API with scraping code
      1. Getting ready
      2. How to do it
    4. Adding an API to find the skills for a job listing
      1. Getting ready
      2. How to do it
    5. Storing data in Elasticsearch as the result of a scraping request
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    6. Checking Elasticsearch for a listing before scraping
      1. How to do it
      2. There's more...
  11. Creating Scraper Microservices with Docker
    1. Introduction
    2. Installing Docker
      1. Getting ready
      2. How to do it
    3. Installing a RabbitMQ container from Docker Hub
      1. Getting ready
      2. How to do it
    4. Running a Docker container (RabbitMQ)
      1. Getting ready
      2. How to do it
      3. There's more...
    5. Creating and running an Elasticsearch container
      1. How to do it
    6. Stopping/restarting a container and removing the image
      1. How to do it
      2. There's more...
    7. Creating a generic microservice with Nameko
      1. Getting ready
      2. How to do it
      3. How it works
      4. There's more...
    8. Creating a scraping microservice
      1. How to do it
      2. There's more...
    9. Creating a scraper container
      1. Getting ready
      2. How to do it
      3. How it works
    10. Creating an API container
      1. Getting ready
      2. How to do it
      3. There's more...
    11. Composing and running the scraper locally with docker-compose
      1. Getting ready
      2. How to do it
      3. There's more...
  12. Making the Scraper as a Service Real
    1. Introduction
    2. Creating and configuring an Elastic Cloud trial account
      1. How to do it
    3. Accessing the Elastic Cloud cluster with curl
      1. How to do it
    4. Connecting to the Elastic Cloud cluster with Python
      1. Getting ready
      2. How to do it
      3. There's more...
    5. Performing an Elasticsearch query with the Python API 
      1. Getting ready
      2. How to do it
      3. There's more...
    6. Using Elasticsearch to query for jobs with specific skills
      1. Getting ready
      2. How to do it
    7. Modifying the API to search for jobs by skill
      1. How to do it
      2. How it works
      3. There's more...
    8. Storing configuration in the environment 
      1. How to do it
    9. Creating an AWS IAM user and a key pair for ECS
      1. Getting ready
      2. How to do it
    10. Configuring Docker to authenticate with ECR
      1. Getting ready
      2. How to do it
    11. Pushing containers into ECR
      1. Getting ready
      2. How to do it
    12. Creating an ECS cluster
      1. How to do it
    13. Creating a task to run our containers
      1. Getting ready
      2. How to do it
      3. How it works
    14. Starting and accessing the containers in AWS
      1. Getting ready
      2. How to do it
      3. There's more...
  13. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think