You are previewing Instant Web Scraping with Java.
O'Reilly logo
Instant Web Scraping with Java

Book Description

Build simple scrapers or vast armies of Java-based bots to untangle and capture the Web

  • Learn something new in an Instant! A short, fast, focused guide delivering immediate results

  • Get your Java environment set up and running

  • Gather clean, formatted web data into your own database

  • Learn how to work around crawler-resistant websites and legally subvert security measures

  • Use built-in Java features to perform parallel processing and distributed scraping

  • Build test cases for your own websites using JUnit

In Detail

Java is often thought of as a stuffy enterprise language, while web scraping is the often-murky domain of scripting languages. By combining the robustness and extensibility of Java with the flexibility and power of web scraping, we can create immensely useful tools that can solve very difficult problems.

Instant Web Scraping with Java will guide you, step by step, through setting up your Java environment. You will also learn how to write simple web scrapers and distributed networks of crawlers. Throughout the book, we will provide useful tips, out-of-the-box working code, and additional resources to build expert knowledge.

Instant Web Scraping with Java will teach how to build your own web scrapers using real-world scraping examples that collect and store data from Wikipedia, public records data sites, IP address geolocation services, and more. You will learn how to run scrapers across multiple servers, run them in parallel, and subvert common methods of anti-scraper security used on modern websites. This book will also provide you with detailed step-by-step instructions, out-of-the-box working code, and expert pointers to further resources on key topics.

Instant Web Scraping with Java will show you how to view and collect any Internet data at the speed of your processor!

Table of Contents

  1. Instant Web Scraping with Java
    1. Instant Web Scraping with Java
    2. Credits
    3. About the Author
    4. About the Reviewers
    5. www.PacktPub.com
      1. Support files, eBooks, discount offers and more
      2. Why Subscribe?
      3. Free Access for Packt account holders
    6. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Errata
        3. Piracy
        4. Questions
    7. 1. Instant Web Scraping with Java
      1. How is this legal?
      2. Setting up your Java Environment (Simple)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      3. Writing and executing HelloWorld.java (Simple)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      4. Writing a simple scraper (Simple)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      5. Writing more complicated scraper (Intermediate)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      6. Handling errors (Simple)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      7. Writing robust, scalable code (Advanced)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      8. Persisting data (Advanced)
        1. Getting ready
        2. How to do it...
        3. How it works...
      9. Writing tests (Intermediate)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      10. Going undercover (Intermediate)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      11. Submitting a basic form (Advanced)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      12. Scraping Ajax Pages (Advanced)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...
      13. Faster scraping through threading (Intermediate)
        1. Getting ready
        2. How to do it...
        3. How it Works...
        4. There's more...
      14. Faster scraping with RMI (Advanced)
        1. Getting ready
        2. How to do it...
        3. How it works...
        4. There's more...