You are previewing Scaling Apache Solr.
O'Reilly logo
Scaling Apache Solr

Book Description

Optimize your searches using high-performance enterprise search repositories with Apache Solr

In Detail

This book is for individuals who want to build high-performance, scalable, enterprise-ready search engines for their customers/organizations. The book starts with the basics of Apache Solr, covering different ways to analyze enterprise information and design enterprise-ready search engines using Solr. It also discusses scaling Solr-based enterprise search for the next level.

Each chapter takes you through more advanced levels of Apache Solr with real-world practical details such as configuring instances, installing and setting up instances, and more. This book contains detailed explanations of the basic and advanced features of Apache Solr.

By sequentially working through the steps in each chapter and with the help of real-life industry examples, you will quickly master the features of Apache Solr to build search solutions for enterprises.

What You Will Learn

  • „Gain a complete understanding of Apache Solr and its ecosystem
  • Develop scalable, high-performance search applications using Apache Solr
  • Customize Apache-Solr-based search for different requirements
  • Discover different techniques to build high-speed enterprise searches
  • Design enterprise-ready search engines and implement a scalable enterprise search functionality
  • Integrate an Apache-Solr-based search with different subsystems and legacy systems
  • Scale Apache Solr through sharding, replication, and fault tolerance
  • Learn about performance tuning for your Solr-based application while scaling your data
  • Make your enterprise search cloud-ready to be able to work with multiple clients
  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Scaling Apache Solr
      1. Table of Contents
      2. Scaling Apache Solr
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Understanding Apache Solr
        1. Challenges in enterprise search
        2. Apache Solr – an overview
        3. Features of Apache Solr
          1. Solr for end users
            1. Powerful full text search
            2. Search through rich information
            3. Results ranking, pagination, and sorting
            4. Facets for better browsing experience
            5. Advanced search capabilities
          2. Administration
        4. Apache Solr architecture
          1. Storage
          2. Solr application
          3. Integration
            1. Client APIs and SolrJ client
            2. Other interfaces
        5. Practical use cases for Apache Solr
          1. Enterprise search for a job search agency
            1. Problem statement
            2. Approach
          2. Enterprise search for energy industry
            1. Problem statement
            2. Approach
        6. Summary
      9. 2. Getting Started with Apache Solr
        1. Setting up Apache Solr
          1. Prerequisites
          2. Running Solr on Jetty
          3. Running Solr on Tomcat
          4. Solr administration
          5. What's next?
          6. Common problems and solution
        2. Understanding the Solr structure
          1. The Solr home directory structure
          2. Solr navigation
        3. Configuring the Apache Solr for enterprise
          1. Defining a Solr schema
            1. Solr fields
            2. Dynamic Fields in Solr
            3. Copying the fields
            4. Field types
            5. Other important elements in the Solr schema
          2. Configuring Solr parameters
            1. solr.xml and Solr core
            2. solrconfig.xml
            3. The Solr plugin
          3. Other configurations
        4. Understanding SolrJ
        5. Summary
      10. 3. Analyzing Data with Apache Solr
        1. Understanding enterprise data
          1. Categorizing by characteristics
          2. Categorizing by access pattern
          3. Categorizing by data formats
        2. Loading data using native handlers
          1. Quick and simple data loading – post tool
          2. Working with JSON, XML, and CSV
            1. Handling JSON data
            2. Working with CSV data
            3. Working with XML data
        3. Working with rich documents
          1. Understanding Apache Tika
          2. Using Solr Cell (ExtractingRequestHandler)
          3. Adding metadata to your rich documents
        4. Importing structured data from the database
          1. Configuring the data source
          2. Importing data in Solr
            1. Full import
            2. Delta import
          3. Loading RDBMS tables in Solr
        5. Advanced topics with Solr
          1. Deduplication
          2. Extracting information from scanned documents
          3. Searching through images using LIRE
        6. Summary
      11. 4. Designing Enterprise Search
        1. Designing aspects for enterprise search
          1. Identifying requirements
          2. Matching user expectations through relevance
          3. Access to searched entities and user interface
          4. Improving search performance and ensuring instance scalability
          5. Working with applications through federated search
          6. Other differentiators – mobiles, linguistic search, and security
        2. Enterprise search data-processing patterns
          1. Standalone search engine server
          2. Distributed enterprise search pattern
          3. The replicated enterprise search pattern
          4. Distributed and replicated
        3. Data integrating pattern for search
          1. Data import by enterprise search
          2. Applications pushing data
          3. Middleware-based integration
        4. Case study – designing an enterprise knowledge repository search for software IT services
          1. Gathering requirements
          2. Designing the solution
            1. Designing the schema
            2. Integrating subsystems with Apache Solr
            3. Working on end user interface
        5. Summary
      12. 5. Integrating Apache Solr
        1. Empowering the Java Enterprise application with Solr search
          1. Embedding Apache Solr as a module (web application) in an enterprise application
            1. How to do it?
          2. Apache Solr in your web application
            1. How to do it?
        2. Integration with client technologies
          1. Integrating Apache Solr with PHP for web portals
            1. Interacting directly with Solr
            2. Using the Solr PHP client
              1. How to do it?
            3. Advanced integration with Solarium
              1. How to do it?
          2. Integrating Apache Solr with JavaScript
            1. Using simple XMLHTTPRequest
            2. Integrating Apache Solr using AJAX Solr
          3. Parsing Solr XML with the help of XSLT
        3. Case study – Apache Solr and Drupal
          1. How to do it?
        4. Summary
      13. 6. Distributed Search Using Apache Solr
        1. Need for distributed search
          1. Distributed search architecture
          2. Apache Solr and distributed search
        2. Understanding SolrCloud
          1. Why Zookeeper?
          2. SolrCloud architecture
        3. Building enterprise distributed search using SolrCloud
          1. Setting up a SolrCloud for development
          2. Setting up a SolrCloud for production
          3. Adding a document to SolrCloud
          4. Creating shards, collections, and replicas in SolrCloud
        4. Common problems and resolutions
        5. Case study – distributed enterprise search server for the software industry
        6. Summary
      14. 7. Scaling Solr through Sharding, Fault Tolerance, and Integration
        1. Enabling search result clustering with Carrot2
          1. Why Carrot2?
          2. Enabling Carrot2-based document clustering
          3. Understanding Carrot2 result clustering
          4. Viewing Solr results in the Carrot2 workbench
          5. FAQs and problems
        2. Sharding and fault tolerance
          1. Document routing and sharding
          2. Shard splitting
          3. Load balancing and fault tolerance in SolrCloud
        3. Searching Solr documents in near real time
          1. Strategies for near real-time search in Apache Solr
            1. Explicit call to commit from a client
            2. solrconfig.xml – autocommit
            3. CommitWithin – delegating the responsibility to Solr
            4. Real-time search in Apache Solr
        4. Solr with MongoDB
          1. Understanding MongoDB
          2. Installing MongoDB
          3. Creating Solr indexes from MongoDB
        5. Scaling Solr through Storm
          1. Getting along with Apache Storm
          2. Solr and Apache Storm
        6. Summary
      15. 8. Scaling Solr through High Performance
        1. Monitoring performance of Apache Solr
          1. What should be monitored?
            1. Hardware and operating system
            2. Java virtual machine
            3. Apache Solr search runtime
            4. Apache Solr indexing time
            5. SolrCloud
          2. Tools for monitoring Solr performance
            1. Solr administration user interface
            2. JConsole
            3. SolrMeter
        2. Tuning Solr JVM and container
          1. Deciding heap size
          2. How can we optimize JVM?
          3. Optimizing JVM container
        3. Optimizing Solr schema and indexing
          1. Stored fields
          2. Indexed fields and field lengths
          3. Copy fields and dynamic fields
          4. Fields for range queries
          5. Index field updates
          6. Synonyms, stemming, and stopwords
          7. Tuning DataImportHandler
          8. Speeding up index generation
          9. Committing the change
            1. Limiting indexing buffer size
          10. SolrJ implementation classes
        4. Speeding Solr through Solr caching
          1. The filter cache
          2. The query result cache
          3. The document cache
          4. The field value cache
          5. The warming up cache
        5. Improving runtime search for Solr
          1. Pagination
          2. Reducing Solr response footprint
          3. Using filter queries
          4. Search query and the parsers
          5. Lazy field loading
        6. Optimizing SolrCloud
        7. Summary
      16. 9. Solr and Cloud Computing
        1. Enterprise search on Cloud
          1. Models of engagement
          2. Enterprise search Cloud deployment models
        2. Solr on Cloud strategies
          1. Scaling Solr with a dedicated application
            1. Advantages
            2. Disadvantages
          2. Scaling Solr horizontal as multiple applications
            1. Advantages
            2. Disadvantages
          3. Scaling horizontally through the Solr multicore
            1. Scaling horizontally with replication
            2. Scaling horizontally with Zookeeper
              1. Advantages
              2. Disadvantages
        3. Running Solr on Cloud (IaaS and PaaS)
          1. Running Solr with Amazon Cloud
          2. Running Solr on Windows Azure
        4. Running Solr on Cloud (SaaS) and enterprise search as a service
          1. Running Solr with OpenSolr Cloud
          2. Running Solr with SolrHQ Cloud
          3. Running Solr with Bitnami
          4. Working with Amazon CloudSearch
          5. Drupal-Solr SaaS with Acquia
        5. Summary
      17. 10. Scaling Solr Capabilities with Big Data
        1. Apache Solr and HDFS
        2. Big Data search on Katta
          1. How Katta works?
          2. Setting up Katta cluster
          3. Creating Katta indexes
        3. Using the Solr 1045 patch – map-side indexing
        4. Using the Solr 1301 patch – reduce-side indexing
        5. Apache Solr and Cassandra
          1. Working with Cassandra and Solr
            1. Single node configuration
            2. Integrating with multinode Cassandra
        6. Advanced analytics with Solr
          1. Integrating Solr and R
        7. Summary
      18. A. Sample Configuration for Apache Solr
        1. schema.xml
        2. solrconfig.xml
        3. spellings.txt
        4. synonyms.txt
        5. protwords.txt
        6. stopwords.txt
      19. Index