You are previewing Webbots, Spiders, and Screen Scrapers, 2nd Edition.
O'Reilly logo
Webbots, Spiders, and Screen Scrapers, 2nd Edition

Book Description

There's a wealth of data online, but sorting and gathering it by hand can be tedious and time consuming. Rather than click through page after endless page, why not let bots do the work for you? Webbots, Spiders, and Screen Scrapers will show you how to create simple programs with PHP/CURL to mine, parse, and archive online data to help you make informed decisions.

Table of Contents

  1. Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL
  2. About the Author
  3. About the Technical Reviewer
  4. Acknowledgments
  5. Introduction
    1. Old-School Client-Server Technology
    2. The Problem with Browsers
    3. What to Expect from This Book
      1. Learn from My Mistakes
      2. Master Webbot Techniques
      3. Leverage Existing Scripts
    4. About the Website
    5. About the Code
    6. Requirements
      1. Hardware
      2. Software
      3. Internet Access
    7. A Disclaimer (This Is Important)
  6. I. Fundamental Concepts and Techniques
    1. 1. What’s in It for You?
      1. Uncovering the Internet’s True Potential
      2. What’s in It for Developers?
        1. Webbot Developers Are in Demand
        2. Webbots Are Fun to Write
        3. Webbots Facilitate “Constructive Hacking”
      3. What’s in It for Business Leaders?
        1. Customize the Internet for Your Business
        2. Capitalize on the Public’s Inexperience with Webbots
        3. Accomplish a Lot with a Small Investment
      4. Final Thoughts
    2. 2. Ideas for Webbot Projects
      1. Inspiration from Browser Limitations
        1. Webbots That Aggregate and Filter Information for Relevance
        2. Webbots That Interpret What They Find Online
        3. Webbots That Act on Your Behalf
      2. A Few Crazy Ideas to Get You Started
        1. Help Out a Busy Executive
        2. Save Money by Automating Tasks
        3. Protect Intellectual Property
        4. Monitor Opportunities
        5. Verify Access Rights on a Website
        6. Create an Online Clipping Service
        7. Plot Unauthorized Wi-Fi Networks
        8. Track Web Technologies
        9. Allow Incompatible Systems to Communicate
      3. Final Thoughts
    3. 3. Downloading Web Pages
      1. Think About Files, Not Web Pages
      2. Downloading Files with PHP’s Built-in Functions
        1. Downloading Files with fopen() and fgets()
          1. Creating Your First Webbot Script
          2. Executing Webbots in Command Shells
          3. Executing Webbots in Browsers
        2. Downloading Files with file()
      3. Introducing PHP/CURL
        1. Multiple Transfer Protocols
        2. Form Submission
        3. Basic Authentication
        4. Cookies
        5. Redirection
        6. Agent Name Spoofing
        7. Referer Management
        8. Socket Management
      4. Installing PHP/CURL
      5. LIB_http
        1. Familiarizing Yourself with the Default Values
        2. Using LIB_http
          1. http_get()
          2. http_get_withheader()
        3. Learning More About HTTP Headers
        4. Examining LIB_http’s Source Code
          1. LIB_http Defaults
          2. LIB_http Functions
      6. Final Thoughts
    4. 4. Basic Parsing Techniques
      1. Content Is Mixed with Markup
      2. Parsing Poorly Written HTML
      3. Standard Parse Routines
      4. Using LIB_parse
        1. Splitting a String at a Delimiter: split_string()
        2. Parsing Text Between Delimiters: return_between()
        3. Parsing a Data Set into an Array: parse_array()
        4. Parsing Attribute Values: get_attribute()
        5. Removing Unwanted Text: remove()
      5. Useful PHP Functions
        1. Detecting Whether a String Is Within Another String
        2. Replacing a Portion of a String with Another String
        3. Parsing Unformatted Text
        4. Measuring the Similarity of Strings
      6. Final Thoughts
        1. Don’t Trust a Poorly Coded Web Page
        2. Parse in Small Steps
        3. Don’t Render Parsed Text While Debugging
        4. Use Regular Expressions Sparingly
    5. 5. Advanced Parsing with Regular Expressions
      1. Pattern Matching, the Key to Regular Expressions
      2. PHP Regular Expression Types
        1. PHP Regular Expressions Functions
          1. preg_replace(pattern, replacement, subject)
          2. preg_match(pattern, subject)
          3. preg_match_all(pattern, subject, result_array)
          4. preg_split(pattern, subject)
        2. Resemblance to PHP Built-In Functions
      3. Learning Patterns Through Examples
        1. Parsing Numbers
        2. Detecting a Series of Characters
        3. Matching Alpha Characters
        4. Matching on Wildcards
        5. Specifying Alternate Matches
        6. Regular Expressions Groupings and Ranges
      4. Regular Expressions of Particular Interest to Webbot Developers
        1. Parsing Phone Numbers
        2. Where to Go from Here
      5. When Regular Expressions Are (or Aren’t) the Right Parsing Tool
        1. Strengths of Regular Expressions
        2. Disadvantages of Pattern Matching While Parsing Web Pages
          1. Regular Expressions Provide Little (If Any) Context
          2. Regular Expressions Provide Too Many Choices
          3. Regular Expressions Are Harder to Debug
          4. Regular Expressions Complicate Your Code
        3. Which Are Faster: Regular Expressions or PHP’s Built-In Functions?
      6. Final Thoughts
    6. 6. Automating Form Submission
      1. Reverse Engineering Form Interfaces
      2. Form Handlers, Data Fields, Methods, and Event Triggers
        1. Form Handlers
        2. Data Fields
        3. Methods
          1. The GET Method
          2. The POST Method
        4. Multipart Encoding
        5. Event Triggers
      3. Unpredictable Forms
        1. JavaScript Can Change a Form Just Before Submission
        2. Form HTML Is Often Unreadable by Humans
        3. Cookies Aren’t Included in the Form, but Can Affect Operation
      4. Analyzing a Form
      5. Final Thoughts
        1. Don’t Blow Your Cover
        2. Correctly Emulate Browsers
        3. Avoid Form Errors
    7. 7. Managing Large Amounts of Data
      1. Organizing Data
        1. Naming Conventions
        2. Storing Data in Structured Files
        3. Storing Text in a Database
          1. LIB_mysql
          2. The insert() Function
          3. The update() Function
          4. The exe_sql() Function
        4. Storing Images in a Database
        5. Database or File?
      2. Making Data Smaller
        1. Storing References to Image Files
        2. Compressing Data
          1. Compressing Inbound Files
          2. Compressing Files on Your Hard Drive
        3. Removing Formatting
      3. Thumbnailing Images
      4. Final Thoughts
  7. II. Projects
    1. 8. Price-Monitoring Webbots
      1. The Target
      2. Designing the Parsing Script
      3. Initialization and Downloading the Target
      4. Further Exploration
    2. 9. Image-Capturing Webbots
      1. Example Image-Capturing Webbot
      2. Creating the Image-Capturing Webbot
        1. Binary-Safe Download Routine
        2. Directory Structure
        3. The Main Script
          1. Initialization and Target Validation
          2. Defining the Page Base
          3. Creating a Root Directory for Imported File Structure
          4. Parsing Image Tags from the Downloaded Web Page
          5. The Image-Processing Loop
          6. Creating the Local Directory Structure
          7. Downloading and Saving the File
      3. Further Exploration
      4. Final Thoughts
    3. 10. Link-Verification Webbots
      1. Creating the Link-Verification Webbot
        1. Initializing the Webbot and Downloading the Target
        2. Setting the Page Base
        3. Parsing the Links
        4. Running a Verification Loop
        5. Generating Fully Resolved URLs
        6. Downloading the Linked Page
        7. Displaying the Page Status
      2. Running the Webbot
        1. LIB_http_codes
        2. LIB_resolve_addresses
      3. Further Exploration
    4. 11. Search-Ranking Webbots
      1. Description of a Search Result Page
      2. What the Search-Ranking Webbot Does
      3. Running the Search-Ranking Webbot
      4. How the Search-Ranking Webbot Works
      5. The Search-Ranking Webbot Script
        1. Initializing Variables
        2. Starting the Loop
        3. Fetching the Search Results
        4. Parsing the Search Results
      6. Final Thoughts
        1. Be Kind to Your Sources
        2. Search Sites May Treat Webbots Differently Than Browsers
        3. Spidering Search Engines Is a Bad Idea
        4. Familiarize Yourself with the Google API
      7. Further Exploration
    5. 12. Aggregation Webbots
      1. Choosing Data Sources for Webbots
      2. Example Aggregation Webbot
        1. Familiarizing Yourself with RSS Feeds
        2. Writing the Aggregation Webbot
          1. Downloading and Parsing the Target
          2. Dealing with CDATA
      3. Adding Filtering to Your Aggregation Webbot
      4. Further Exploration
    6. 13. FTP Webbots
      1. Example FTP Webbot
      2. PHP and FTP
      3. Further Exploration
    7. 14. Webbots That Read Email
      1. The POP3 Protocol
        1. Logging into a POP3 Mail Server
        2. Reading Mail from a POP3 Mail Server
          1. The POP3 LIST Command
          2. The POP3 RETR Command
          3. Other Useful POP3 Commands
      2. Executing POP3 Commands with a Webbot
      3. Further Exploration
        1. Email-Controlled Webbots
        2. Email Interfaces
    8. 15. Webbots That Send Email
      1. Email, Webbots, and Spam
      2. Sending Mail with SMTP and PHP
        1. Configuring PHP to Send Mail
        2. Sending an Email with mail()
      3. Writing a Webbot That Sends Email Notifications
        1. Keeping Legitimate Mail out of Spam Filters
        2. Sending HTML-Formatted Email
      4. Further Exploration
        1. Using Returned Emails to Prune Access Lists
        2. Using Email as Notification That Your Webbot Ran
        3. Leveraging Wireless Technologies
        4. Writing Webbots That Send Text Messages
    9. 16. Converting a Website into a Function
      1. Writing a Function Interface
        1. Defining the Interface
        2. Analyzing the Target Web Page
        3. Using describe_zipcode()
          1. Getting the Session Value
          2. Submitting the Form
          3. Parsing and Returning the Result
      2. Final Thoughts
        1. Distributing Resources
        2. Using Standard Interfaces
        3. Designing a Custom Lightweight “Web Service”
  8. III. Advanced Technical Considerations
    1. 17. Spiders
      1. How Spiders Work
      2. Example Spider
      3. LIB_simple_spider
        1. harvest_links()
        2. archive_links()
        3. get_domain()
        4. exclude_link()
      4. Experimenting with the Spider
      5. Adding the Payload
      6. Further Exploration
        1. Save Links in a Database
        2. Separate the Harvest and Payload
        3. Distribute Tasks Across Multiple Computers
        4. Regulate Page Requests
    2. 18. Procurement Webbots and Snipers
      1. Procurement Webbot Theory
        1. Get Purchase Criteria
        2. Authenticate Buyer
        3. Verify Item
        4. Evaluate Purchase Triggers
        5. Make Purchase
        6. Evaluate Results
      2. Sniper Theory
        1. Get Purchase Criteria
        2. Authenticate Buyer
        3. Verify Item
        4. Synchronize Clocks
        5. Time to Bid?
        6. Submit Bid
        7. Evaluate Results
      3. Testing Your Own Webbots and Snipers
      4. Further Exploration
      5. Final Thoughts
    3. 19. Webbots and Cryptography
      1. Designing Webbots That Use Encryption
        1. SSL and PHP Built-in Functions
        2. Encryption and PHP/CURL
      2. A Quick Overview of Web Encryption
      3. Final Thoughts
    4. 20. Authentication
      1. What Is Authentication?
        1. Types of Online Authentication
        2. Strengthening Authentication by Combining Techniques
        3. Authentication and Webbots
      2. Example Scripts and Practice Pages
      3. Basic Authentication
      4. Session Authentication
        1. Authentication with Cookie Sessions
          1. How Cookies Work
          2. Cookie Session Example
        2. Authentication with Query Sessions
      5. Final Thoughts
    5. 21. Advanced Cookie Management
      1. How Cookies Work
      2. PHP/CURL and Cookies
      3. How Cookies Challenge Webbot Design
        1. Purging Temporary Cookies
        2. Managing Multiple Users’ Cookies
      4. Further Exploration
    6. 22. Scheduling Webbots and Spiders
      1. Preparing Your Webbots to Run as Scheduled Tasks
      2. The Windows XP Task Scheduler
        1. Scheduling a Webbot to Run Daily
        2. Complex Schedules
      3. The Windows 7 Task Scheduler
      4. Non-calendar-based Triggers
      5. Final Thoughts
        1. Determine the Webbot’s Best Periodicity
        2. Avoid Single Points of Failure
        3. Add Variety to Your Schedule
    7. 23. Scraping Difficult Websites with Browser Macros
      1. Barriers to Effective Web Scraping
        1. AJAX
        2. Bizarre JavaScript and Cookie Behavior
        3. Flash
      2. Overcoming Webscraping Barriers with Browser Macros
        1. What Is a Browser Macro?
        2. The Ultimate Browser-Like Webbot
        3. Installing and Using iMacros
        4. Creating Your First Macro
          1. Macro Initialization
          2. Recording the Google Session
          3. iMacros Commands
          4. Instructions You’ll Want in Every Macro
          5. Running a Macro
      3. Final Thoughts
        1. Are Macros Really Necessary?
        2. Other Uses
    8. 24. Hacking iMacros
      1. Hacking iMacros for Added Functionality
        1. Reasons for Not Using the iMacros Scripting Engine
        2. Creating a Dynamic Macro
          1. Writing a Script That Creates a Dynamic Macro
          2. Integrating External Data into Dynamically Created Macros
        3. Launching iMacros Automatically
          1. Launching iMacros from Windows
          2. Launching iMacros from Linux
      2. Further Exploration
    9. 25. Deployment and Scaling
      1. One-to-Many Environment
      2. One-to-One Environment
      3. Many-to-Many Environment
      4. Many-to-One Environment
      5. Scaling and Denial-of-Service Attacks
        1. Even Simple Webbots Can Generate a Lot of Traffic
        2. Inefficiencies at the Target
        3. The Problems with Scaling Too Well
      6. Creating Multiple Instances of a Webbot
        1. Forking Processes
        2. Leveraging the Operating System
        3. Distributing the Task over Multiple Computers
      7. Managing a Botnet
        1. Botnet Communication Methods
          1. Polling the Botnet Server
          2. Determining If There Is a Task for the Harvester to Perform
          3. The Checkout Process
          4. Assigning Tasks
          5. Performing Tasks
          6. Uploading Harvested Data
          7. Processing the Harvested Data
      8. Further Exploration
  9. IV. Larger Considerations
    1. 26. Designing Stealthy Webbots and Spiders
      1. Why Design a Stealthy Webbot?
        1. Log Files
          1. Access Logs
          2. Error Logs
          3. Custom Logs
        2. Log-Monitoring Software
      2. Stealth Means Simulating Human Patterns
        1. Be Kind to Your Resources
        2. Run Your Webbot During Busy Hours
        3. Don’t Run Your Webbot at the Same Time Each Day
        4. Don’t Run Your Webbot on Holidays and Weekends
        5. Use Random, Intra-fetch Delays
      3. Final Thoughts
    2. 27. Proxies
      1. What Is a Proxy?
      2. Proxies in the Virtual World
      3. Why Webbot Developers Use Proxies
        1. Using Proxies to Become Anonymous
        2. Using a Proxy to Be Somewhere Else
      4. Using a Proxy Server
        1. Using a Proxy in a Browser
        2. Using a Proxy with PHP/CURL
      5. Types of Proxy Servers
        1. Open Proxies
          1. Types of Open Proxies
          2. The Dark Side of Open Proxies
          3. More About Open Proxy Listing Services
        2. Tor
          1. Using Tor
          2. Configuring PHP/CURL to Use Tor
          3. Disadvantages of Tor
        3. Commercial Proxies
      6. Final Thoughts
        1. Anonymity Is a Process, Not a Feature
        2. Creating Your Own Proxy Service
    3. 28. Writing Fault-Tolerant Webbots
      1. Types of Webbot Fault Tolerance
        1. Adapting to Changes in URLs
          1. Avoid Making Requests for Pages That Don’t Exist
          2. Follow Page Redirections
          3. Maintain the Accuracy of Referer Values
        2. Adapting to Changes in Page Content
          1. Avoid Position Parsing
          2. Use Relative Parsing
          3. Look for Landmarks That Are Least Likely to Change
        3. Adapting to Changes in Forms
        4. Adapting to Changes in Cookie Management
        5. Adapting to Network Outages and Network Congestion
      2. Error Handlers
      3. Further Exploration
    4. 29. Designing Webbot-Friendly Websites
      1. Optimizing Web Pages for Search Engine Spiders
        1. Well-Defined Links
        2. Google Bombs and Spam Indexing
        3. Title Tags
        4. Meta Tags
        5. Header Tags
        6. Image alt Attributes
      2. Web Design Techniques That Hinder Search Engine Spiders
        1. JavaScript
        2. Non-ASCII Content
      3. Designing Data-Only Interfaces
        1. XML
        2. Lightweight Data Exchange
          1. How Not to Design a Lightweight Interface
          2. A Safer Method of Passing Variables to Webbots
        3. SOAP
          1. Advantages of SOAP
          2. Disadvantages of SOAP
        4. REST
      4. Final Thoughts
    5. 30. Killing Spiders
      1. Asking Nicely
        1. Create a Terms of Service Agreement
        2. Use the robots.txt File
        3. Use the Robots Meta Tag
      2. Building Speed Bumps
        1. Selectively Allow Access to Specific Web Agents
        2. Use Obfuscation
        3. Use Cookies, Encryption, JavaScript, and Redirection
        4. Authenticate Users
        5. Update Your Site Often
        6. Embed Text in Other Media
      3. Setting Traps
        1. Create a Spider Trap
        2. Fun Things to Do with Unwanted Spiders
      4. Final Thoughts
    6. 31. Keeping Webbots out of Trouble
      1. It’s All About Respect
      2. Copyright
        1. Do Consult Resources
        2. Don’t Be an Armchair Lawyer
          1. Copyrights Do Not Have to Be Registered
          2. Assume “All Rights Reserved”
          3. You Cannot Copyright a Fact
          4. You Can Copyright a Collection of Facts if Presented Creatively
          5. You Can Use Some Material Under Fair Use Laws
      3. Trespass to Chattels
      4. Internet Law
      5. Final Thoughts
  10. A. PHP/CURL Reference
    1. Creating a Minimal PHP/CURL Session
    2. Initiating PHP/CURL Sessions
    3. Setting PHP/CURL Options
      1. CURLOPT_URL
      2. CURLOPT_RETURNTRANSFER
      3. CURLOPT_REFERER
      4. CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS
      5. CURLOPT_USERAGENT
      6. CURLOPT_NOBODY and CURLOPT_HEADER
      7. CURLOPT_TIMEOUT
      8. CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR
      9. CURLOPT_HTTPHEADER
      10. CURLOPT_SSL_VERIFYPEER
      11. CURLOPT_USERPWD and CURLOPT_UNRESTRICTED_AUTH
      12. CURLOPT_POST and CURLOPT_POSTFIELDS
      13. CURLOPT_VERBOSE
      14. CURLOPT_PORT
    4. Executing the PHP/CURL Command
      1. Retrieving PHP/CURL Session Information
      2. Viewing PHP/CURL Errors
    5. Closing PHP/CURL Sessions
  11. B. Status Codes
    1. HTTP Codes
    2. NNTP Codes
  12. C. SMS Gateways
    1. Sending Text Messages
    2. Reading Text Messages
    3. A Sampling of Text Message Email Addresses
  13. Index
  14. About the Author
  15. Colophon