You are previewing Elasticsearch: The Definitive Guide.
O'Reilly logo
Elasticsearch: The Definitive Guide

Book Description

Whether you need full-text search or real-time analytics of structured data—or both—the Elasticsearch distributed search engine is an ideal way to put your data to work. This practical guide not only shows you how to search, analyze, and explore data with Elasticsearch, but also helps you deal with the complexities of human language, geolocation, and relationships.

Table of Contents

  1. Foreword
  2. Preface
    1. Who Should Read This Book
    2. Why We Wrote This Book
    3. Elasticsearch Version
    4. How to Read This Book
    5. Navigating This Book
    6. Online Resources
    7. Conventions Used in This Book
    8. Using Code Examples
    9. Safari® Books Online
    10. How to Contact Us
    11. Acknowledgments
  3. I. Getting Started
    1. 1. You Know, for Search…
      1. Installing Elasticsearch
        1. Installing Marvel
      2. Running Elasticsearch
        1. Viewing Marvel and Sense
      3. Talking to Elasticsearch
        1. Java API
        2. RESTful API with JSON over HTTP
      4. Document Oriented
        1. JSON
      5. Finding Your Feet
        1. Let’s Build an Employee Directory
      6. Indexing Employee Documents
      7. Retrieving a Document
      8. Search Lite
      9. Search with Query DSL
      10. More-Complicated Searches
      11. Full-Text Search
      12. Phrase Search
      13. Highlighting Our Searches
      14. Analytics
      15. Tutorial Conclusion
      16. Distributed Nature
      17. Next Steps
    2. 2. Life Inside a Cluster
      1. An Empty Cluster
      2. Cluster Health
      3. Add an Index
      4. Add Failover
      5. Scale Horizontally
        1. Then Scale Some More
      6. Coping with Failure
    3. 3. Data In, Data Out
      1. What Is a Document?
      2. Document Metadata
        1. _index
        2. _type
        3. _id
        4. Other Metadata
      3. Indexing a Document
        1. Using Our Own ID
        2. Autogenerating IDs
      4. Retrieving a Document
        1. Retrieving Part of a Document
      5. Checking Whether a Document Exists
      6. Updating a Whole Document
      7. Creating a New Document
      8. Deleting a Document
      9. Dealing with Conflicts
      10. Optimistic Concurrency Control
        1. Using Versions from an External System
      11. Partial Updates to Documents
        1. Using Scripts to Make Partial Updates
        2. Updating a Document That May Not Yet Exist
        3. Updates and Conflicts
      12. Retrieving Multiple Documents
      13. Cheaper in Bulk
        1. Don’t Repeat Yourself
        2. How Big Is Too Big?
    4. 4. Distributed Document Store
      1. Routing a Document to a Shard
      2. How Primary and Replica Shards Interact
      3. Creating, Indexing, and Deleting a Document
      4. Retrieving a Document
      5. Partial Updates to a Document
      6. Multidocument Patterns
        1. Why the Funny Format?
    5. 5. Searching—The Basic Tools
      1. The Empty Search
        1. hits
        2. took
        3. shards
        4. timeout
      2. Multi-index, Multitype
      3. Pagination
      4. Search <em xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">Lite</em>
        1. The _all Field
        2. More Complicated Queries
    6. 6. Mapping and Analysis
      1. Exact Values Versus Full Text
      2. Inverted Index
      3. Analysis and Analyzers
        1. Built-in Analyzers
        2. When Analyzers Are Used
        3. Testing Analyzers
        4. Specifying Analyzers
      4. Mapping
        1. Core Simple Field Types
        2. Viewing the Mapping
        3. Customizing Field Mappings
        4. Updating a Mapping
        5. Testing the Mapping
      5. Complex Core Field Types
        1. Multivalue Fields
        2. Empty Fields
        3. Multilevel Objects
        4. Mapping for Inner Objects
        5. How Inner Objects are Indexed
        6. Arrays of Inner Objects
    7. 7. Full-Body Search
      1. Empty Search
      2. Query DSL
        1. Structure of a Query Clause
        2. Combining Multiple Clauses
      3. Queries and Filters
        1. Performance Differences
        2. When to Use Which
      4. Most Important Queries and Filters
        1. term Filter
        2. terms Filter
        3. range Filter
        4. exists and missing Filters
        5. bool Filter
        6. match_all Query
        7. match Query
        8. multi_match Query
        9. bool Query
      5. Combining Queries with Filters
        1. Filtering a Query
        2. Just a Filter
        3. A Query as a Filter
      6. Validating Queries
        1. Understanding Errors
        2. Understanding Queries
    8. 8. Sorting and Relevance
      1. Sorting
        1. Sorting by Field Values
        2. Multilevel Sorting
        3. Sorting on Multivalue Fields
      2. String Sorting and Multifields
      3. What Is Relevance?
        1. Understanding the Score
        2. Understanding Why a Document Matched
      4. Fielddata
    9. 9. Distributed Search Execution
      1. Query Phase
      2. Fetch Phase
      3. Search Options
        1. preference
        2. timeout
        3. routing
        4. search_type
      4. scan and scroll
    10. 10. Index Management
      1. Creating an Index
      2. Deleting an Index
      3. Index Settings
      4. Configuring Analyzers
      5. Custom Analyzers
        1. Creating a Custom Analyzer
      6. Types and Mappings
        1. How Lucene Sees Documents
        2. How Types Are Implemented
        3. Avoiding Type Gotchas
      7. The Root Object
        1. Properties
        2. Metadata: _source Field
        3. Metadata: _all Field
        4. Metadata: Document Identity
      8. Dynamic Mapping
      9. Customizing Dynamic Mapping
        1. date_detection
        2. dynamic_templates
      10. Default Mapping
      11. Reindexing Your Data
      12. Index Aliases and Zero Downtime
    11. 11. Inside a Shard
      1. Making Text Searchable
        1. Immutability
      2. Dynamically Updatable Indices
        1. Deletes and Updates
      3. Near Real-Time Search
        1. refresh API
      4. Making Changes Persistent
        1. flush API
      5. Segment Merging
        1. optimize API
  4. II. Search in Depth
    1. 12. Structured Search
      1. Finding Exact Values
        1. term Filter with Numbers
        2. term Filter with Text
        3. Internal Filter Operation
      2. Combining Filters
        1. Bool Filter
        2. Nesting Boolean Filters
      3. Finding Multiple Exact Values
        1. Contains, but Does Not Equal
        2. Equals Exactly
      4. Ranges
        1. Ranges on Dates
        2. Ranges on Strings
      5. Dealing with Null Values
        1. exists Filter
        2. missing Filter
        3. exists/missing on Objects
      6. All About Caching
        1. Independent Filter Caching
        2. Controlling Caching
      7. Filter Order
    2. 13. Full-Text Search
      1. Term-Based Versus Full-Text
      2. The match Query
        1. Index Some Data
        2. A Single-Word Query
      3. Multiword Queries
        1. Improving Precision
        2. Controlling Precision
      4. Combining Queries
        1. Score Calculation
        2. Controlling Precision
      5. How match Uses bool
      6. Boosting Query Clauses
      7. Controlling Analysis
        1. Default Analyzers
        2. Configuring Analyzers in Practice
      8. Relevance Is Broken!
    3. 14. Multifield Search
      1. Multiple Query Strings
        1. Prioritizing Clauses
      2. Single Query String
        1. Know Your Data
      3. Best Fields
        1. dis_max Query
      4. Tuning Best Fields Queries
        1. tie_breaker
      5. multi_match Query
        1. Using Wildcards in Field Names
        2. Boosting Individual Fields
      6. Most Fields
        1. Multifield Mapping
      7. Cross-fields Entity Search
        1. A Naive Approach
        2. Problems with the most_fields Approach
      8. Field-Centric Queries
        1. Problem 1: Matching the Same Word in Multiple Fields
        2. Problem 2: Trimming the Long Tail
        3. Problem 3: Term Frequencies
        4. Solution
      9. Custom _all Fields
      10. cross-fields Queries
        1. Per-Field Boosting
      11. Exact-Value Fields
    4. 15. Proximity Matching
      1. Phrase Matching
        1. Term Positions
        2. What Is a Phrase
      2. Mixing It Up
      3. Multivalue Fields
      4. Closer Is Better
      5. Proximity for Relevance
      6. Improving Performance
        1. Rescoring Results
      7. Finding Associated Words
        1. Producing Shingles
        2. Multifields
        3. Searching for Shingles
        4. Performance
    5. 16. Partial Matching
      1. Postcodes and Structured Data
      2. prefix Query
      3. wildcard and regexp Queries
      4. Query-Time Search-as-You-Type
      5. Index-Time Optimizations
      6. Ngrams for Partial Matching
      7. Index-Time Search-as-You-Type
        1. Preparing the Index
        2. Querying the Field
        3. Edge n-grams and Postcodes
      8. Ngrams for Compound Words
    6. 17. Controlling Relevance
      1. Theory Behind Relevance Scoring
        1. Boolean Model
        2. Term Frequency/Inverse Document Frequency (TF/IDF)
        3. Vector Space Model
      2. Lucene’s Practical Scoring Function
        1. Query Normalization Factor
        2. Query Coordination
        3. Index-Time Field-Level Boosting
      3. Query-Time Boosting
        1. Boosting an Index
        2. t.getBoost()
      4. Manipulating Relevance with Query Structure
      5. Not Quite Not
        1. boosting Query
      6. Ignoring TF/IDF
        1. constant_score Query
      7. function_score Query
      8. Boosting by Popularity
        1. modifier
        2. factor
        3. boost_mode
        4. max_boost
      9. Boosting Filtered Subsets
        1. filter Versus query
        2. functions
        3. score_mode
      10. Random Scoring
      11. The Closer, The Better
      12. Understanding the price Clause
      13. Scoring with Scripts
      14. Pluggable Similarity Algorithms
        1. Okapi BM25
      15. Changing Similarities
        1. Configuring BM25
      16. Relevance Tuning Is the Last 10%
  5. III. Dealing with Human Language
    1. 18. Getting Started with Languages
      1. Using Language Analyzers
      2. Configuring Language Analyzers
      3. Pitfalls of Mixing Languages
        1. At Index Time
        2. At Query Time
        3. Identifying Language
      4. One Language per Document
        1. Foreign Words
      5. One Language per Field
      6. Mixed-Language Fields
        1. Split into Separate Fields
        2. Analyze Multiple Times
        3. Use n-grams
    2. 19. Identifying Words
      1. standard Analyzer
      2. standard Tokenizer
      3. Installing the ICU Plug-in
      4. icu_tokenizer
      5. Tidying Up Input Text
        1. Tokenizing HTML
        2. Tidying Up Punctuation
    3. 20. Normalizing Tokens
      1. In That Case
      2. You Have an Accent
        1. Retaining Meaning
      3. Living in a Unicode World
      4. Unicode Case Folding
      5. Unicode Character Folding
      6. Sorting and Collations
        1. Case-Insensitive Sorting
        2. Differences Between Languages
        3. Unicode Collation Algorithm
        4. Unicode Sorting
        5. Specifying a Language
        6. Customizing Collations
    4. 21. Reducing Words to Their Root Form
      1. Algorithmic Stemmers
        1. Using an Algorithmic Stemmer
      2. Dictionary Stemmers
      3. Hunspell Stemmer
        1. Installing a Dictionary
        2. Per-Language Settings
        3. Creating a Hunspell Token Filter
        4. Hunspell Dictionary Format
      4. Choosing a Stemmer
        1. Stemmer Performance
        2. Stemmer Quality
        3. Stemmer Degree
        4. Making a Choice
      5. Controlling Stemming
        1. Preventing Stemming
        2. Customizing Stemming
      6. Stemming in situ
        1. Is Stemming in situ a Good Idea
    5. 22. Stopwords: Performance Versus Precision
      1. Pros and Cons of Stopwords
      2. Using Stopwords
        1. Stopwords and the Standard Analyzer
        2. Maintaining Positions
        3. Specifying Stopwords
        4. Using the stop Token Filter
        5. Updating Stopwords
      3. Stopwords and Performance
        1. and Operator
        2. minimum_should_match
      4. Divide and Conquer
        1. Controlling Precision
        2. Only High-Frequency Terms
        3. More Control with Common Terms
      5. Stopwords and Phrase Queries
        1. Positions Data
        2. Index Options
        3. Stopwords
      6. common_grams Token Filter
        1. At Index Time
        2. Unigram Queries
        3. Bigram Phrase Queries
        4. Two-Word Phrases
      7. Stopwords and Relevance
    6. 23. Synonyms
      1. Using Synonyms
      2. Formatting Synonyms
      3. Expand or contract
        1. Simple Expansion
        2. Simple Contraction
        3. Genre Expansion
      4. Synonyms and The Analysis Chain
        1. Case-Sensitive Synonyms
      5. Multiword Synonyms and Phrase Queries
        1. Use Simple Contraction for Phrase Queries
        2. Synonyms and the query_string Query
      6. Symbol Synonyms
    7. 24. Typoes and Mispelings
      1. Fuzziness
      2. Fuzzy Query
        1. Improving Performance
      3. Fuzzy match Query
      4. Scoring Fuzziness
      5. Phonetic Matching
  6. IV. Aggregations
    1. 25. High-Level Concepts
      1. Buckets
      2. Metrics
      3. Combining the Two
    2. 26. Aggregation Test-Drive
      1. Adding a Metric to the Mix
      2. Buckets Inside Buckets
      3. One Final Modification
    3. 27. Building Bar Charts
    4. 28. Looking at Time
      1. Returning Empty Buckets
      2. Extended Example
      3. The Sky’s the Limit
    5. 29. Scoping Aggregations
    6. 30. Filtering Queries and Aggregations
      1. Filtered Query
      2. Filter Bucket
      3. Post Filter
      4. Recap
    7. 31. Sorting Multivalue Buckets
      1. Intrinsic Sorts
      2. Sorting by a Metric
      3. Sorting Based on “Deep” Metrics
    8. 32. Approximate Aggregations
      1. Finding Distinct Counts
        1. Understanding the Trade-offs
        2. Optimizing for Speed
      2. Calculating Percentiles
        1. Percentile Metric
        2. Percentile Ranks
        3. Understanding the Trade-offs
    9. 33. Significant Terms
      1. significant_terms Demo
        1. Recommending Based on Popularity
        2. Recommending Based on Statistics
    10. 34. Controlling Memory Use and Latency
      1. Fielddata
      2. Aggregations and Analysis
        1. High-Cardinality Memory Implications
      3. Limiting Memory Usage
        1. Fielddata Size
        2. Monitoring fielddata
        3. Circuit Breaker
      4. Fielddata Filtering
      5. Doc Values
        1. Enabling Doc Values
      6. Preloading Fielddata
        1. Eagerly Loading Fielddata
        2. Global Ordinals
        3. Index Warmers
      7. Preventing Combinatorial Explosions
        1. Depth-First Versus Breadth-First
    11. 35. Closing Thoughts
  7. V. Geolocation
    1. 36. Geo-Points
      1. Lat/Lon Formats
      2. Filtering by Geo-Point
      3. geo_bounding_box Filter
        1. Optimizing Bounding Boxes
      4. geo_distance Filter
        1. Faster Geo-Distance Calculations
        2. geo_distance_range Filter
      5. Caching geo-filters
      6. Reducing Memory Usage
      7. Sorting by Distance
        1. Scoring by Distance
    2. 37. Geohashes
      1. Mapping Geohashes
      2. geohash_cell Filter
    3. 38. Geo-aggregations
      1. geo_distance Aggregation
      2. geohash_grid Aggregation
      3. geo_bounds Aggregation
    4. 39. Geo-shapes
      1. Mapping geo-shapes
        1. precision
        2. distance_error_pct
      2. Indexing geo-shapes
      3. Querying geo-shapes
      4. Querying with Indexed Shapes
      5. Geo-shape Filters and Caching
  8. VI. Modeling Your Data
    1. 40. Handling Relationships
      1. Application-side Joins
      2. Denormalizing Your Data
      3. Field Collapsing
      4. Denormalization and Concurrency
        1. Renaming Files and Directories
      5. Solving Concurrency Issues
        1. Global Locking
        2. Document Locking
        3. Tree Locking
    2. 41. Nested Objects
      1. Nested Object Mapping
      2. Querying a Nested Object
      3. Sorting by Nested Fields
      4. Nested Aggregations
        1. reverse_nested Aggregation
        2. When to Use Nested Objects
    3. 42. Parent-Child Relationship
      1. Parent-Child Mapping
      2. Indexing Parents and Children
      3. Finding Parents by Their Children
        1. min_children and max_children
      4. Finding Children by Their Parents
      5. Children Aggregation
      6. Grandparents and Grandchildren
      7. Practical Considerations
        1. Memory Use
        2. Global Ordinals and Latency
        3. Multigenerations and Concluding Thoughts
    4. 43. Designing for Scale
      1. The Unit of Scale
      2. Shard Overallocation
      3. Kagillion Shards
      4. Capacity Planning
      5. Replica Shards
        1. Balancing Load with Replicas
      6. Multiple Indices
      7. Time-Based Data
        1. Index per Time Frame
      8. Index Templates
      9. Retiring Data
        1. Migrate Old Indices
        2. Optimize Indices
        3. Closing Old Indices
        4. Archiving Old Indices
      10. User-Based Data
      11. Shared Index
      12. Faking Index per User with Aliases
      13. One Big User
      14. Scale Is Not Infinite
  9. VII. Administration, Monitoring, and Deployment
    1. 44. Monitoring
      1. Marvel for Monitoring
      2. Cluster Health
        1. Drilling Deeper: Finding Problematic Indices
        2. Blocking for Status Changes
      3. Monitoring Individual Nodes
        1. indices Section
        2. OS and Process Sections
        3. JVM Section
        4. Threadpool Section
        5. FS and Network Sections
        6. Circuit Breaker
      4. Cluster Stats
      5. Index Stats
      6. Pending Tasks
      7. cat API
    2. 45. Production Deployment
      1. Hardware
        1. Memory
        2. CPUs
        3. Disks
        4. Network
        5. General Considerations
      2. Java Virtual Machine
      3. Transport Client Versus Node Client
      4. Configuration Management
      5. Important Configuration Changes
        1. Assign Names
        2. Paths
        3. Minimum Master Nodes
        4. Recovery Settings
        5. Prefer Unicast over Multicast
      6. Don’t Touch These Settings!
        1. Garbage Collector
        2. Threadpools
      7. Heap: Sizing and Swapping
        1. Give Half Your Memory to Lucene
        2. Don’t Cross 32 GB!
        3. Swapping Is the Death of Performance
      8. File Descriptors and MMap
      9. Revisit This List Before Production
    3. 46. Post-Deployment
      1. Changing Settings Dynamically
      2. Logging
        1. Slowlog
      3. Indexing Performance Tips
        1. Test Performance Scientifically
        2. Using and Sizing Bulk Requests
        3. Storage
        4. Segments and Merging
        5. Other
      4. Rolling Restarts
      5. Backing Up Your Cluster
        1. Creating the Repository
        2. Snapshotting All Open Indices
        3. Snapshotting Particular Indices
        4. Listing Information About Snapshots
        5. Deleting Snapshots
        6. Monitoring Snapshot Progress
        7. Canceling a Snapshot
      6. Restoring from a Snapshot
        1. Monitoring Restore Operations
        2. Canceling a Restore
      7. Clusters Are Living, Breathing Creatures
  10. Index