You are previewing Apache Solr Enterprise Search Server - Third Edition.
O'Reilly logo
Apache Solr Enterprise Search Server - Third Edition

Book Description

Enhance your searches with faceted navigation, result highlighting, relevancy-ranked sorting, and much more with this comprehensive guide to Apache Solr 4

In Detail

Solr is a widely popular open source enterprise search server that delivers powerful search and faceted navigation features—features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, relevancy tuning, geospatial searches, and much more.

This book is a comprehensive resource for just about everything Solr has to offer, and it will take you from first exposure to development and deployment in no time. Even if you wish to use Solr 5, you should find the information to be just as applicable due to Solr's high regard for backward compatibility. The book includes some useful information specific to Solr 5.

What You Will Learn

  • Design a schema to include text indexing details such as tokenization, stemming, and synonyms

  • Import data from databases using various formats including CSV and XML and extract text from different document formats

  • Search using Solr's rich query syntax, perform geospatial searches, "join" relationally, and influence relevancy order

  • Build a query auto-complete/suggester capability with knowledge of the fundamental types of suggestion and ways to implement them

  • Enhance standard searches with faceting for navigation or analytics

  • Deploy Solr to production taking into account logging, security, and monitoring

  • Integrate a host of technologies with Solr including web crawlers, Hadoop, Java, JavaScript, Ruby, PHP, Drupal, and others

  • Tune Solr and use SolrCloud for horizontal scalability

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Apache Solr Enterprise Search ServerThird Edition
      1. Table of Contents
      2. Apache Solr Enterprise Search Server Third Edition
      3. Credits
      4. About the Authors
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Quick Starting Solr
        1. An introduction to Solr
          1. Lucene – the underlying engine
          2. Solr – a Lucene-based search server
          3. Comparison to database technology
        2. A few differences between Solr 4 and Solr 5
        3. Getting started
          1. Solr's installation directory structure
          2. Running Solr
        4. A quick tour of Solr
          1. Loading sample data
          2. A simple query
          3. Some statistics
          4. The sample browse interface
        5. Configuration files
        6. What's next?
          1. Schema design and indexing
          2. Text analysis
          3. Searching
          4. Integration
        7. Resources outside this book
        8. Summary
      9. 2. Schema Design
        1. Is Solr schemaless?
        2. MusicBrainz.org
        3. One combined index or separate indices
          1. One combined index
            1. Problems with using a single combined index
          2. Separate indices
        4. Schema design
          1. Step 1 – determine which searches are going to be powered by Solr
          2. Step 2 – determine the entities returned from each search
          3. Step 3 – denormalize related data
            1. Denormalizing – one-to-one associated data
            2. Denormalizing – one-to-many associated data
          4. Step 4 – omit the inclusion of fields only used in search results (optional)
        5. The schema.xml file
          1. Field definitions
            1. Dynamic field definitions
          2. Advanced field options for indexed fields
          3. The unique key
          4. The default search field and query operator
          5. Copying fields
          6. Our MusicBrainz field definitions
          7. Defining field types
          8. Built-in field type classes
            1. Numbers and dates
            2. Some other field types
        6. Summary
      10. 3. Text Analysis
        1. Configuring field types
          1. Experimenting with text analysis
        2. Character filters
        3. Tokenization
        4. Filtering
          1. Stemming
            1. Correcting and augmenting stemming
          2. Processing synonyms
            1. Synonym expansion at index time versus query time
          3. Working with stop words
          4. Phonetic analysis
          5. Substring indexing and wildcards
            1. ReversedWildcardFilter
            2. N-gram analysis
            3. N-gram costs
          6. Sorting text
          7. Miscellaneous token filters
        5. The multilingual search
          1. The multifield approach
          2. The multicore approach
          3. The single field approach
        6. Summary
      11. 4. Indexing Data
        1. Communicating with Solr
          1. Using direct HTTP or a convenient client API
          2. Pushing data to Solr or have Solr pull it
          3. Data formats
          4. Solr's HTTP POST options
          5. Remote streaming
        2. Solr's Update-XML format
          1. Deleting documents
        3. Commit, optimize, and rollback the transaction log
          1. Don't overlap commits
          2. Index optimization
          3. Rolling back an uncommitted change
          4. The transaction log
        4. Atomic updates and optimistic concurrency
        5. Sending CSV-formatted data to Solr
          1. Configuration options
        6. The DataImportHandler framework
          1. Configuring the DataImportHandler framework
          2. The development console
          3. Writing a DIH configuration file
            1. Data sources
            2. Entity processors
            3. Fields and transformers
          4. Example DIH configurations
            1. Importing from databases
            2. Importing XML from a file with XSLT
            3. Importing multiple rich document files – crawling
          5. Importing commands
            1. Delta imports
        7. Indexing documents with Solr Cell
          1. Extracting text and metadata from files
          2. Configuring Solr
          3. Solr Cell parameters
        8. Update request processors
        9. Summary
      12. 5. Searching
        1. Your first search – a walk-through
          1. A note on response format types
        2. Solr's generic XML structured data representation
        3. Solr's XML response format
          1. Parsing the URL
        4. Understanding request handlers
        5. Query parameters
          1. Search criteria related parameters
          2. Result pagination related parameters
          3. Output-related parameters
            1. More about the fl parameter
          4. Diagnostic parameters
        6. Query parsers and local-params
        7. Query syntax (the lucene query parser)
          1. Matching all the documents
          2. Mandatory, prohibited, and optional clauses
            1. Boolean operators
          3. Subqueries
            1. Limitations of prohibited clauses in subqueries
          4. Querying specific fields
          5. Phrase queries and term proximity
          6. Wildcard queries
            1. Fuzzy queries
            2. Regular expression queries
          7. Range queries
            1. Date math
          8. Score boosting
          9. Existence and nonexistence queries
          10. Escaping special characters
        8. The DisMax query parser – part 1
          1. Searching multiple fields
          2. Limited query syntax
          3. Min-should-match
            1. Basic rules
            2. Multiple rules
            3. What to choose
          4. A default query
          5. The uf parameter
        9. Filtering
        10. Sorting
        11. Joining
          1. The join query parser
          2. Block-join query parsers
            1. The block-join-children parser
            2. The block-join-parent parser
        12. Spatial search
          1. Spatial in Solr 3 – LatLonType and friends
            1. Configuration
          2. Spatial in Solr 4 – SpatialRecursivePrefixTreeFieldType
            1. Configuration – basic
          3. Indexing points
          4. Filtering by distance or rectangle
          5. Sorting by distance
            1. Returning the distance
            2. Boosting by distance
            3. Memory and performance of distance sorting and boosting
          6. Advanced spatial
        13. Summary
      13. 6. Search Relevancy
        1. Scoring
          1. Alternative scoring models
          2. Query-time and index-time boosting
          3. Troubleshooting queries and scoring
            1. Tools – Splainer and Quepid
        2. The DisMax query parser – part 2
          1. Lucene's DisjunctionMaxQuery
          2. Boosting – automatic phrase boosting
            1. Configuring automatic phrase boosting
            2. Phrase slop configuration
            3. Partial phrase boosting
          3. Boosting – boost queries
          4. Boosting – boost functions
            1. Add or multiply boosts
        3. Functions and function queries
          1. Field references
          2. Function references
            1. Mathematical primitives
            2. Other math
            3. Boolean functions
            4. Relevancy statistics functions
            5. Ord and rord
            6. Miscellaneous functions
          3. External field values
          4. Function query boosting
            1. Formula – logarithm
            2. Formula – inverse reciprocal
            3. Formula – reciprocal
            4. Formula – linear
          5. How to boost based on an increasing numeric field
            1. Step by step…
          6. How to boost based on recent dates
            1. Step by step…
        4. Summary
      14. 7. Faceting
        1. A quick example – faceting release types
        2. Field requirements
        3. Types of faceting
        4. Faceting field values
          1. Alphabetic range bucketing
        5. Faceting numeric and date ranges
          1. Range facet parameters
        6. Facet queries
        7. Building a filter query from a facet
          1. Field value filter queries
          2. Facet range filter queries
        8. Pivot faceting
          1. Hierarchical faceting
        9. Excluding filters – multiselect faceting
        10. Summary
      15. 8. Search Components
        1. About components
        2. The highlight component
          1. A highlighting example
          2. Choose the Standard, FastVector, or Postings highlighter
            1. The Standard (default) highlighter
            2. The FastVector highlighter
            3. The Postings highlighter
          3. Highlighting configuration
        3. The SpellCheck component
          1. The schema configuration
          2. Configuration in solrconfig.xml
            1. Configuring spellcheckers – dictionaries
              1. DirectSolrSpellChecker options
              2. IndexBasedSpellChecker options
              3. FileBasedSpellChecker options
              4. WordBreakSolrSpellChecker options
            2. Processing the q parameter
            3. Processing the spellcheck.q parameter
          3. Building index- and file-based spellcheckers
          4. Issuing spellcheck requests
          5. Example usage for a misspelled query
        4. Query complete/suggest
          1. Instant-search via edge n-grams
          2. Query term completion via facet.prefix
          3. Query term completion via the Suggester
          4. Query term completion via the Terms component
          5. Field-value completion via the Suggester
        5. The QueryElevation component
          1. Configuration
        6. The MoreLikeThis component
          1. Configuration parameters
            1. Parameters specific to the MLT search component
            2. Parameters specific to the MLT request handler
            3. Common MLT parameters
          2. The MLT results example
        7. The Stats component
          1. Configuring the stats component
          2. Statistics on track durations
        8. The Clustering component
        9. Collapsing and expanding
          1. The Collapse query parser
          2. The Expand component
          3. An example
          4. Compared to Result grouping
        10. The TermVector component
        11. Summary
      16. 9. Integrating Solr
        1. Working with the included examples
          1. Inventory of examples
        2. Solritas – the integrated search UI
          1. The pros and cons of Solritas
        3. SolrJ – Solr's Java client API
          1. The sample code – BrainzSolrClient
          2. Dependencies and Maven
            1. Declaring logging dependencies
          3. The SolrServer class
            1. Using javabin instead of XML for efficiency
          4. Searching with SolrJ
          5. Indexing with SolrJ
            1. Deleting documents
          6. Annotating your JavaBean – an alternative
          7. Embedding Solr
            1. When should you use embedded Solr? Tests!
        4. Using JavaScript/AJAX with Solr
          1. Wait, what about security?
          2. Building a Solr-powered artists autocomplete widget with jQuery and JSONP
          3. AJAX Solr
        5. Using XSLT to transform XML search results
        6. Accessing Solr from PHP applications
          1. solr-php-client
          2. Drupal options
            1. The Apache Solr Search integration module
            2. Hosted Solr by Acquia
        7. Ruby on Rails integrations
          1. Solr's Ruby response writer
          2. The sunspot_rails gem
            1. Setting up the myFaves project
            2. Populating the myFaves relational database from Solr
            3. Building Solr indexes from a relational database
            4. Completing the myFaves website
          3. Which Rails/Ruby library should I use?
        8. Nutch for crawling web pages
        9. Solr and Hadoop
          1. HDFS
          2. Indexing via MapReduce
            1. Morphlines
          3. Running a Solr build using Hadoop
            1. Looking at the storage
            2. The data ingestion process
        10. ManifoldCF – a connector framework
          1. Connectors
          2. Putting ManifoldCF to use
        11. Document-level security
        12. Summary
      17. 10. Scaling Solr
        1. Tuning complex systems is hard
        2. Use SolrMeter to test Solr performance
        3. Optimizing a single Solr server – scale up
          1. Configuring JVM settings to improve memory usage
            1. Using MMapDirectoryFactory to leverage additional virtual memory
          2. Enabling downstream HTTP caching to reduce load
          3. Solr caching
            1. Tuning caches
          4. Indexing performance
            1. Designing the schema
            2. Sending data to Solr in bulk
            3. Disabling unique key checking
            4. Index optimization and mergeFactor settings
          5. Enhancing faceting performance
          6. Using term vectors
          7. Improving phrase search performance
        4. Configuring Solr for near real-time search
        5. Use SolrCloud to go big – scale wide
          1. SolrCloud glossary
          2. Launching Solr in SolrCloud mode
          3. Managing collections and configurations
            1. Stand up SolrCloud for our MusicBrainz artists index
            2. Choosing the replication factor and number of shards
            3. Creating and deleting collections
            4. Replicas and leaders
            5. Document routing
            6. Shard splitting
            7. Dealing with long running collection tasks
            8. Adding nodes
        6. Summary
      18. 11. Deployment
        1. Deployment methodology for Solr
          1. Questions to ask
        2. Installing Solr into a Servlet container
          1. Differences between Servlet containers
            1. Defining the solr.home property
        3. Configuring logging
          1. HTTP server request access logs
          2. Solr application logging
            1. Configuring logging output
            2. Jetty startup integration
            3. Managing log levels at runtime
        4. A RequestHandler per search interface
        5. Leveraging Solr cores
          1. Configuring solr.xml
            1. Property substitution
            2. Include fragments of XML with XInclude
          2. Managing cores
          3. Some uses of multiple cores
        6. Setting up ZooKeeper for SolrCloud
          1. Installing ZooKeeper
          2. Administering Data in ZooKeeper
        7. Monitoring Solr performance
          1. Stats Admin interface
          2. Monitoring Solr via JMX
            1. Starting Solr with JMX
        8. Securing Solr from prying eyes
          1. Limiting server access
            1. Put Solr behind a Proxy
            2. Securing public searches
            3. Controlling JMX access
          2. Securing index data
            1. Controlling document access
            2. Other things to look at
        9. Summary
      19. A. Quick Reference
        1. Core search
        2. Diagnostic
        3. The Lucene query parser
        4. The DisMax query parser
        5. The Lucene query syntax
        6. Faceting
        7. Highlighting
        8. Spell checking
        9. Miscellaneous nonsearch
      20. Index