You are previewing Learning Scrapy.
O'Reilly logo
Learning Scrapy

Book Description

Learn the art of efficient web scraping and crawling with Python

About This Book

  • Extract data from any source to perform real time analytics.

  • Full of techniques and examples to help you crawl websites and extract data within hours.

  • A hands-on guide to web scraping and crawling with real-life problems and solutions

  • Who This Book Is For

    If you are a software developer, data scientist, NLP or machine-learning enthusiast or just need to migrate your company's wiki from a legacy platform, then this book is for you. It is perfect for someone , who needs instant access to large amounts of semi-structured data effortlessly.

    What You Will Learn

  • Understand HTML pages and write XPath to extract the data you need

  • Write Scrapy spiders with simple Python and do web crawls

  • Push your data into any database, search engine or analytics system

  • Configure your spider to download files, images and use proxies

  • Create efficient pipelines that shape data in precisely the form you want

  • Use Twisted Asynchronous API to process hundreds of items concurrently

  • Make your crawler super-fast by learning how to tune Scrapy's performance

  • Perform large scale distributed crawls with scrapyd and scrapinghub

  • In Detail

    This book covers the long awaited Scrapy v 1.0 that empowers you to extract useful data from virtually any source with very little effort. It starts off by explaining the fundamentals of Scrapy framework, followed by a thorough description of how to extract data from any source, clean it up, shape it as per your requirement using Python and 3rd party APIs. Next you will be familiarised with the process of storing the scrapped data in databases as well as search engines and performing real time analytics on them with Spark Streaming. By the end of this book, you will perfect the art of scarping data for your applications with ease

    Style and approach

    It is a hands on guide, with first few chapters written as a tutorial, aiming to motivate you and get you started quickly. As the book progresses, more advanced features are explained with real world examples that can be reffered while developing your own web applications.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the code file.

    Table of Contents

    1. Learning Scrapy
      1. Table of Contents
      2. Learning Scrapy
      3. Credits
      4. About the Author
      5. About the Reviewer
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Introducing Scrapy
        1. Hello Scrapy
        2. More reasons to love Scrapy
        3. About this book: aim and usage
        4. The importance of mastering automated data scraping
          1. Developing robust, quality applications, and providing realistic schedules
          2. Developing quality minimum viable products quickly
          3. Scraping gives you scale; Google couldn't use forms
          4. Discovering and integrating into your ecosystem
        5. Being a good citizen in a world full of spiders
        6. What Scrapy is not
        7. Summary
      9. 2. Understanding HTML and XPath
        1. HTML, the DOM tree representation, and the XPath
          1. The URL
          2. The HTML document
          3. The tree representation
          4. What you see on the screen
        2. Selecting HTML elements with XPath
          1. Useful XPath expressions
          2. Using Chrome to get XPath expressions
          3. Examples of common tasks
          4. Anticipating changes
        3. Summary
      10. 3. Basic Crawling
        1. Installing Scrapy
          1. MacOS
          2. Windows
          3. Linux
            1. Ubuntu or Debian Linux
            2. Red Hat or CentOS Linux
          4. From the latest source
          5. Upgrading Scrapy
          6. Vagrant: this book's official way to run examples
        2. UR2IM – the fundamental scraping process
          1. The URL
          2. The request and the response
          3. The Items
        3. A Scrapy project
          1. Defining items
          2. Writing spiders
          3. Populating an item
          4. Saving to files
          5. Cleaning up – item loaders and housekeeping fields
          6. Creating contracts
        4. Extracting more URLs
          1. Two-direction crawling with a spider
          2. Two-direction crawling with a CrawlSpider
        5. Summary
      11. 4. From Scrapy to a Mobile App
        1. Choosing a mobile application framework
        2. Creating a database and a collection
        3. Populating the database with Scrapy
        4. Creating a mobile application
          1. Creating a database access service
          2. Setting up the user interface
          3. Mapping data to the User Interface
          4. Mappings between database fields and User Interface controls
          5. Testing, sharing, and exporting your mobile app
        5. Summary
      12. 5. Quick Spider Recipes
        1. A spider that logs in
        2. A spider that uses JSON APIs and AJAX pages
          1. Passing arguments between responses
        3. A 30-times faster property spider
        4. A spider that crawls based on an Excel file
        5. Summary
      13. 6. Deploying to Scrapinghub
        1. Signing up, signing in, and starting a project
        2. Deploying our spiders and scheduling runs
        3. Accessing our items
        4. Scheduling recurring crawls
        5. Summary
      14. 7. Configuration and Management
        1. Using Scrapy settings
        2. Essential settings
          1. Analysis
            1. Logging
            2. Stats
            3. Telnet
              1. Example 1 – using telnet
          2. Performance
          3. Stopping crawls early
          4. HTTP caching and working offline
            1. Example 2 – working offline by using the cache
          5. Crawling style
          6. Feeds
          7. Downloading media
            1. Other media
              1. Example 3 – downloading images
          8. Amazon Web Services
          9. Using proxies and crawlers
            1. Example 4 – using proxies and Crawlera's clever proxy
        3. Further settings
          1. Project-related settings
          2. Extending Scrapy settings
          3. Fine-tuning downloading
          4. Autothrottle extension settings
          5. Memory UsageExtension settings
          6. Logging and debugging
        4. Summary
      15. 8. Programming Scrapy
        1. Scrapy is a Twisted application
          1. Deferreds and deferred chains
          2. Understanding Twisted and nonblocking I/O – a Python tale
        2. Overview of Scrapy architecture
          1. Example 1 - a very simple pipeline
        3. Signals
          1. Example 2 - an extension that measures throughput and latencies
        4. Extending beyond middlewares
        5. Summary
      16. 9. Pipeline Recipes
        1. Using REST APIs
          1. Using treq
          2. A pipeline that writes to Elasticsearch
          3. A pipeline that geocodes using the Google Geocoding API
          4. Enabling geoindexing on Elasticsearch
        2. Interfacing databases with standard Python clients
          1. A pipeline that writes to MySQL
        3. Interfacing services using Twisted-specific clients
          1. A pipeline that reads/writes to Redis
        4. Interfacing CPU-intensive, blocking, or legacy functionality
          1. A pipeline that performs CPU-intensive or blocking operations
          2. A pipeline that uses binaries or scripts
        5. Summary
      17. 10. Understanding Scrapy's Performance
        1. Scrapy's engine – an intuitive approach
          1. Cascading queuing systems
          2. Identifying the bottleneck
          3. Scrapy's performance model
        2. Getting component utilization using telnet
        3. Our benchmark system
        4. The standard performance model
        5. Solving performance problems
          1. Case #1 – saturated CPU
          2. Case #2 – blocking code
          3. Case #3 – "garbage" on the downloader
          4. Case #4 – overflow due to many or large responses
          5. Case #5 – overflow due to limited/excessive item concurrency
          6. Case #6 – the downloader doesn't have enough to do
        6. Troubleshooting flow
        7. Summary
      18. 11. Distributed Crawling with Scrapyd and Real-Time Analytics
        1. How does the title of a property affect the price?
        2. Scrapyd
        3. Overview of our distributed system
        4. Changes to our spider and middleware
          1. Sharded-index crawling
          2. Batching crawl URLs
          3. Getting start URLs from settings
          4. Deploy your project to scrapyd servers
        5. Creating our custom monitoring command
        6. Calculating the shift with Apache Spark streaming
        7. Running a distributed crawl
        8. System performance
        9. The key take-away
        10. Summary
      19. A. Installing and troubleshooting prerequisite software
        1. Installing prerequisites
        2. The system
        3. Installation in a nutshell
        4. Installing on Linux
        5. Installing on Windows or Mac
          1. Install Vagrant
          2. How to access the terminal
          3. Install VirtualBox and Git
          4. Ensure that VirtualBox supports 64-bit images
          5. Enable ssh client for Windows
          6. Download this book's code and set up the system
        6. System setup and operations FAQ
          1. What do I download and how much time does it take?
          2. What should I do if Vagrant freezes?
          3. How do I shut down/resume the VM quickly?
          4. How do I fully reset the VM?
          5. How do I resize the virtual machine?
          6. How do I resolve any port conflicts?
            1. On Linux using Docker natively
            2. On Windows or Mac using a VM
          7. How do I make it work behind a corporate proxy?
          8. How do I connect with the Docker provider VM?
          9. How much CPU/memory does each server use?
          10. How can I see the size of Docker container images?
          11. How can I reset the system if Vagrant doesn't respond?
          12. There's a problem I can't work around, what can I do?
      20. Index