You are previewing Mastering Elasticsearch - Second Edition.
O'Reilly logo
Mastering Elasticsearch - Second Edition

Book Description

Further your knowledge of the Elasticsearch server by learning more about its internals, querying, and data handling

In Detail

Elasticsearch is a modern, fast, distributed, scalable, fault tolerant, and open source search and analytics engine. Elasticsearch leverages the capabilities of Apache Lucene, providing a new level of control over how you can index and search even huge sets of data.

This book covers intermediate and advanced functionalities of Elasticsearch and walks you through its internals including caches, the Apache Lucene library, and its monitoring capabilities. You'll learn about practical usage of Elasticsearch configuration parameters and how to use the monitoring API.

With this book, you'll delve into Elasticsearch's query rewrite, query template, bulk operation, document grouping, and function score queries. You will also learn how to improve user search experience, index distribution, segment statistics, and merging. By the end of the book, you will be able to enhance Elasticsearch's performance and create your own Elasticsearch plugins.

What You Will Learn

  • Understand Apache Lucene and Elasticsearch's design and architecture

  • Use and configure different scoring models to alter the default scoring mechanism

  • Choose the appropriate amount of shards and replicas for your deployment

  • Improve user search experience by utilizing Elasticsearch functionality

  • Control segment merging and learn why Elasticsearch uses merging

  • Develop custom Elasticsearch plugins and cover detailed examples of how to extend Elasticsearch by writing your own plugins

  • Apply your knowledge to create scalable, efficient, and fault tolerant clusters and monitor your cluster by using and understanding the Elasticsearch API

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Mastering Elasticsearch Second Edition
      1. Table of Contents
      2. Mastering Elasticsearch Second Edition
      3. Credits
      4. About the Author
      5. Acknowledgments
      6. About the Author
      7. Acknowledgments
      8. About the Reviewers
      9. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      10. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      11. 1. Introduction to Elasticsearch
        1. Introducing Apache Lucene
          1. Getting familiar with Lucene
          2. Overall architecture
            1. Getting deeper into Lucene index
              1. Norms
              2. Term vectors
              3. Posting formats
              4. Doc values
          3. Analyzing your data
            1. Indexing and querying
          4. Lucene query language
            1. Understanding the basics
            2. Querying fields
            3. Term modifiers
            4. Handling special characters
        2. Introducing Elasticsearch
          1. Basic concepts
            1. Index
            2. Document
            3. Type
            4. Mapping
            5. Node
            6. Cluster
            7. Shard
            8. Replica
          2. Key concepts behind Elasticsearch architecture
          3. Workings of Elasticsearch
            1. The startup process
            2. Failure detection
          4. Communicating with Elasticsearch
            1. Indexing data
            2. Querying data
        3. The story
        4. Summary
      12. 2. Power User Query DSL
        1. Default Apache Lucene scoring explained
          1. When a document is matched
          2. TF/IDF scoring formula
            1. Lucene conceptual scoring formula
            2. Lucene practical scoring formula
          3. Elasticsearch point of view
          4. An example
        2. Query rewrite explained
          1. Prefix query as an example
          2. Getting back to Apache Lucene
          3. Query rewrite properties
        3. Query templates
          1. Introducing query templates
            1. Templates as strings
          2. The Mustache template engine
            1. Conditional expressions
            2. Loops
            3. Default values
          3. Storing templates in files
        4. Handling filters and why it matters
          1. Filters and query relevance
          2. How filters work
            1. Bool or and/or/not filters
          3. Performance considerations
          4. Post filtering and filtered query
          5. Choosing the right filtering method
        5. Choosing the right query for the job
          1. Query categorization
            1. Basic queries
            2. Compound queries
            3. Not analyzed queries
            4. Full text search queries
            5. Pattern queries
            6. Similarity supporting queries
            7. Score altering queries
            8. Position aware queries
            9. Structure aware queries
          2. The use cases
            1. Example data
            2. Basic queries use cases
              1. Searching for values in range
              2. Simplified query for multiple terms
            3. Compound queries use cases
              1. Boosting some of the matched documents
              2. Ignoring lower scoring partial queries
            4. Not analyzed queries use cases
              1. Limiting results to given tags
              2. Efficient query time stopwords handling
            5. Full text search queries use cases
              1. Using Lucene query syntax in queries
              2. Handling user queries without errors
            6. Pattern queries use cases
              1. Autocomplete using prefixes
              2. Pattern matching
            7. Similarity supporting queries use cases
              1. Finding terms similar to a given one
              2. Finding documents with similar field values
            8. Score altering queries use cases
              1. Favoring newer books
              2. Decreasing importance of books with certain value
            9. Pattern queries use cases
              1. Matching phrases
              2. Spans, spans everywhere
            10. Structure aware queries use cases
              1. Returning parent documents having a certain nested document
              2. Affecting parent document score with the score of nested documents
        6. Summary
      13. 3. Not Only Full Text Search
        1. Query rescoring
          1. What is query rescoring?
          2. An example query
          3. Structure of the rescore query
          4. Rescore parameters
            1. Choosing the scoring mode
          5. To sum up
        2. Controlling multimatching
          1. Multimatch types
            1. Best fields matching
            2. Cross fields matching
            3. Most fields matching
            4. Phrase matching
            5. Phrase with prefixes matching
        3. Significant terms aggregation
          1. An example
          2. Choosing significant terms
          3. Multiple values analysis
            1. Significant terms aggregation and full text search fields
          4. Additional configuration options
            1. Controlling the number of returned buckets
            2. Background set filtering
            3. Minimum document count
            4. Execution hint
            5. More options
          5. There are limits
            1. Memory consumption
            2. Shouldn't be used as top-level aggregation
            3. Counts are approximated
            4. Floating point fields are not allowed
        4. Documents grouping
          1. Top hits aggregation
          2. An example
            1. Additional parameters
        5. Relations between documents
          1. The object type
          2. The nested documents
          3. Parent–child relationship
            1. Parent–child relationship in the cluster
          4. A few words about alternatives
        6. Scripting changes between Elasticsearch versions
          1. Scripting changes
            1. Security issues
            2. Groovy – the new default scripting language
            3. Removal of MVEL language
          2. Short Groovy introduction
            1. Using Groovy as your scripting language
            2. Variable definition in scripts
            3. Conditionals
            4. Loops
            5. An example
            6. There is more
          3. Scripting in full text context
            1. Field-related information
            2. Shard level information
            3. Term level information
              1. More advanced term information
          4. Lucene expressions explained
            1. The basics
            2. An example
            3. There is more
        7. Summary
      14. 4. Improving the User Search Experience
        1. Correcting user spelling mistakes
          1. Testing data
          2. Getting into technical details
          3. Suggesters
            1. Using the _suggest REST endpoint
            2. Understanding the REST endpoint suggester response
            3. Including suggestion requests in query
            4. The term suggester
              1. Configuration
              2. Common term suggester options
              3. Additional term suggester options
            5. The phrase suggester
              1. Usage example
              2. Configuration
              3. Basic configuration
              4. Configuring smoothing models
              5. Configuring candidate generators
              6. Configuring direct generators
            6. The completion suggester
              1. The logic behind the completion suggester
              2. Using the completion suggester
              3. Indexing data
              4. Querying data
              5. Custom weights
              6. Additional parameters
        2. Improving the query relevance
          1. Data
          2. The quest for relevance improvement
            1. The standard query
            2. The multi match query
            3. Phrases comes into play
            4. Let's throw the garbage away
            5. Now, we boost
            6. Performing a misspelling-proof search
            7. Drill downs with faceting
        3. Summary
      15. 5. The Index Distribution Architecture
        1. Choosing the right amount of shards and replicas
          1. Sharding and overallocation
          2. A positive example of overallocation
          3. Multiple shards versus multiple indices
          4. Replicas
        2. Routing explained
          1. Shards and data
          2. Let's test routing
            1. Indexing with routing
          3. Routing in practice
            1. Querying
          4. Aliases
          5. Multiple routing values
        3. Altering the default shard allocation behavior
          1. Allocation awareness
            1. Forcing allocation awareness
          2. Filtering
            1. What include, exclude, and require mean
          3. Runtime allocation updating
            1. Index level updates
            2. Cluster level updates
          4. Defining total shards allowed per node
          5. Defining total shards allowed per physical server
            1. Inclusion
            2. Requirement
            3. Exclusion
            4. Disk-based allocation
        4. Query execution preference
          1. Introducing the preference parameter
        5. Summary
      16. 6. Low-level Index Control
        1. Altering Apache Lucene scoring
          1. Available similarity models
          2. Setting a per-field similarity
          3. Similarity model configuration
          4. Choosing the default similarity model
            1. Configuring the chosen similarity model
              1. Configuring the TF/IDF similarity
              2. Configuring the Okapi BM25 similarity
              3. Configuring the DFR similarity
              4. Configuring the IB similarity
              5. Configuring the LM Dirichlet similarity
              6. Configuring the LM Jelinek Mercer similarity
        2. Choosing the right directory implementation – the store module
          1. The store type
            1. The simple filesystem store
            2. The new I/O filesystem store
            3. The MMap filesystem store
            4. The hybrid filesystem store
            5. The memory store
              1. Additional properties
            6. The default store type
            7. The default store type for Elasticsearch 1.3.0 and higher
            8. The default store type for Elasticsearch versions older than 1.3.0
        3. NRT, flush, refresh, and transaction log
          1. Updating the index and committing changes
            1. Changing the default refresh time
          2. The transaction log
            1. The transaction log configuration
          3. Near real-time GET
        4. Segment merging under control
          1. Choosing the right merge policy
            1. The tiered merge policy
            2. The log byte size merge policy
            3. The log doc merge policy
          2. Merge policies' configuration
            1. The tiered merge policy
            2. The log byte size merge policy
            3. The log doc merge policy
          3. Scheduling
            1. The concurrent merge scheduler
            2. The serial merge scheduler
            3. Setting the desired merge scheduler
        5. When it is too much for I/O – throttling explained
          1. Controlling I/O throttling
          2. Configuration
            1. The throttling type
            2. Maximum throughput per second
            3. Node throttling defaults
            4. Performance considerations
            5. The configuration example
        6. Understanding Elasticsearch caching
          1. The filter cache
            1. Filter cache types
            2. Node-level filter cache configuration
            3. Index-level filter cache configuration
          2. The field data cache
            1. Field data or doc values
            2. Node-level field data cache configuration
            3. Index-level field data cache configuration
            4. The field data cache filtering
              1. Adding field data filtering information
              2. Filtering by term frequency
              3. Filtering by regex
              4. Filtering by regex and term frequency
              5. The filtering example
            5. Field data formats
              1. String-based fields
              2. Numeric fields
              3. Geographical-based fields
            6. Field data loading
          3. The shard query cache
            1. Setting up the shard query cache
          4. Using circuit breakers
            1. The field data circuit breaker
            2. The request circuit breaker
            3. The total circuit breaker
          5. Clearing the caches
          6. Index, indices, and all caches clearing
            1. Clearing specific caches
        7. Summary
      17. 7. Elasticsearch Administration
        1. Discovery and recovery modules
          1. Discovery configuration
            1. Zen discovery
              1. Multicast Zen discovery configuration
              2. The unicast Zen discovery configuration
          2. Master node
            1. Configuring master and data nodes
              1. Configuring data-only nodes
              2. Configuring master-only nodes
              3. Configuring the query processing-only nodes
            2. The master election configuration
              1. Zen discovery fault detection and configuration
            3. The Amazon EC2 discovery
              1. The EC2 plugin installation
              2. The EC2 plugin's generic configuration
              3. Optional EC2 discovery configuration options
              4. The EC2 nodes scanning configuration
            4. Other discovery implementations
          3. The gateway and recovery configuration
            1. The gateway recovery process
            2. Configuration properties
            3. Expectations on nodes
            4. The local gateway
            5. Low-level recovery configuration
              1. Cluster-level recovery configuration
              2. Index-level recovery settings
          4. The indices recovery API
        2. The human-friendly status API – using the Cat API
          1. The basics
          2. Using the Cat API
            1. Common arguments
          3. The examples
            1. Getting information about the master node
            2. Getting information about the nodes
        3. Backing up
          1. Saving backups in the cloud
            1. The S3 repository
            2. The HDFS repository
            3. The Azure repository
        4. Federated search
          1. The test clusters
          2. Creating the tribe node
            1. Using the unicast discovery for tribes
          3. Reading data with the tribe node
            1. Master-level read operations
          4. Writing data with the tribe node
            1. Master-level write operations
          5. Handling indices conflicts
          6. Blocking write operations
        5. Summary
      18. 8. Improving Performance
        1. Using doc values to optimize your queries
          1. The problem with field data cache
          2. The example of doc values usage
        2. Knowing about garbage collector
          1. Java memory
            1. The life cycle of Java objects and garbage collections
          2. Dealing with garbage collection problems
            1. Turning on logging of garbage collection work
            2. Using JStat
            3. Creating memory dumps
            4. More information on the garbage collector work
            5. Adjusting the garbage collector work in Elasticsearch
              1. Using a standard start up script
              2. Service wrapper
          3. Avoid swapping on Unix-like systems
        3. Benchmarking queries
          1. Preparing your cluster configuration for benchmarking
          2. Running benchmarks
          3. Controlling currently run benchmarks
        4. Very hot threads
          1. Usage clarification for the Hot Threads API
          2. The Hot Threads API response
        5. Scaling Elasticsearch
          1. Vertical scaling
          2. Horizontal scaling
            1. Automatically creating replicas
            2. Redundancy and high availability
            3. Cost and performance flexibility
            4. Continuous upgrades
            5. Multiple Elasticsearch instances on a single physical machine
              1. Preventing the shard and its replicas from being on the same node
            6. Designated nodes' roles for larger clusters
              1. Query aggregator nodes
              2. Data nodes
              3. Master eligible nodes
          3. Using Elasticsearch for high load scenarios
            1. General Elasticsearch-tuning advices
              1. Choosing the right store
              2. The index refresh rate
              3. Thread pools tuning
              4. Adjusting the merge process
              5. Data distribution
            2. Advices for high query rate scenarios
              1. Filter caches and shard query caches
              2. Think about the queries
              3. Using routing
              4. Parallelize your queries
              5. Field data cache and breaking the circuit
              6. Keeping size and shard_size under control
            3. High indexing throughput scenarios and Elasticsearch
              1. Bulk indexing
              2. Doc values versus indexing speed
              3. Keep your document fields under control
              4. The index architecture and replication
              5. Tuning write-ahead log
              6. Think about storage
              7. RAM buffer for indexing
        6. Summary
      19. 9. Developing Elasticsearch Plugins
        1. Creating the Apache Maven project structure
        2. Understanding the basics
          1. The structure of the Maven Java project
          2. The idea of POM
          3. Running the build process
          4. Introducing the assembly Maven plugin
        3. Creating custom REST action
          1. The assumptions
          2. Implementation details
            1. Using the REST action class
              1. The constructor
              2. Handling requests
              3. Writing response
            2. The plugin class
            3. Informing Elasticsearch about our REST action
            4. Time for testing
            5. Building the REST action plugin
            6. Installing the REST action plugin
            7. Checking whether the REST action plugin works
        4. Creating the custom analysis plugin
          1. Implementation details
            1. Implementing TokenFilter
            2. Implementing the TokenFilter factory
            3. Implementing the class custom analyzer
            4. Implementing the analyzer provider
            5. Implementing the analysis binder
            6. Implementing the analyzer indices component
            7. Implementing the analyzer module
            8. Implementing the analyzer plugin
            9. Informing Elasticsearch about our custom analyzer
          2. Testing our custom analysis plugin
            1. Building our custom analysis plugin
            2. Installing the custom analysis plugin
            3. Checking whether our analysis plugin works
        5. Summary
      20. Index