O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Elasticsearch 5.x - Third Edition

Book Description

Master the intricacies of Elasticsearch 5 and use it to create flexible and scalable search solutions

About This Book

  • Master the searching, indexing, and aggregation features in ElasticSearch
  • Improve users' search experience with Elasticsearch's functionalities and develop your own Elasticsearch plugins
  • A comprehensive, step-by-step guide to master the intricacies of ElasticSearch with ease

Who This Book Is For

If you have some prior working experience with Elasticsearch and want to take your knowledge to the next level, this book will be the perfect resource for you.If you are a developer who wants to implement scalable search solutions with Elasticsearch, this book will also help you. Some basic knowledge of the query DSL and data indexing is required to make the best use of this book.

What You Will Learn

  • Understand Apache Lucene and Elasticsearch 5's design and architecture
  • Use and configure the new and improved default text scoring mechanism in Apache Lucene 6
  • Know how to overcome the pitfalls while handling relational data in Elasticsearch
  • Learn about choosing the right queries according to the use cases and master the scripting module including new default scripting language, painlessly
  • Explore the right way of scaling production clusters to improve the performance of Elasticsearch
  • Master the searching, indexing, and aggregation features in Elasticsearch
  • Develop your own Elasticsearch plugins to extend the functionalities of Elasticsearch

In Detail

Elasticsearch is a modern, fast, distributed, scalable, fault tolerant, and open source search and analytics engine. Elasticsearch leverages the capabilities of Apache Lucene, and provides a new level of control over how you can index and search even huge sets of data. With this book you will finally be able to fully utilize the power that Elasticsearch.

This book will give you a brief recap of the basics and also introduce you to the new features of Elasticsearch 5. We will guide you through the intermediate and advanced functionalities of Elasticsearch, such as querying, indexing, searching, and modifying data. We'll also explore advanced concepts, including aggregation, index control, sharding, replication, and clustering.

We'll show you the modules of monitoring and administration available in Elasticsearch, and will also cover backup and recovery. You will get an understanding of how you can scale your ElasticSearch cluster to contextualize it and improve its performance. We'll also show you how you can create your own analysis plugin in Elasticsearch.

By the end of the book, you will have all the knowledge necessary to master Elasticsearch and put it to efficient use.

Style and approach

This comprehensive guide covers intermediate and advanced concepts in Elasticsearch as well as their implementation. An easy-to-follow approach means you'll be able to master even advanced querying, searching, and administration tasks with ease.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Mastering Elasticsearch 5.x - Third Edition
    1. Mastering Elasticsearch 5.x - Third Edition
    2. Credits
    3. About the Author
    4. Acknowledgements
    5. About the Reviewer
    6. www.PacktPub.com
      1. Why subscribe?
    7. Customer Feedback
    8. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    9. 1. Revisiting Elasticsearch and the Changes
      1. An overview of Lucene
        1. Getting deeper into the Lucene index
          1. Inverted index
          2. Segments
          3. Norms
          4. Term vectors
          5. Posting formats
          6. Doc values
          7. Document analysis
          8. Basics of the Lucene query language
          9. Querying fields
          10. Term modifiers
          11. Handling special characters
        2. An overview of Elasticsearch
          1. The key concepts
          2. Working of Elasticsearch
      2. Introducing Elasticsearch 5.x
        1. Introducing new features in Elasticsearch
          1. New features in Elasticsearch 5.x
          2. New features in Elasticsearch 2.x
        2. The changes in Elasticsearch
          1. Changes between 1.x to 2.x
          2. Mapping changes
          3. Query and filter changes
          4. Security, reliability, and networking changes
          5. Monitoring parameter changes
        3. Changes between 2.x to 5.x
          1. Mapping changes
            1. No more string fields
            2. Floats are default
          2. Changes in numeric fields
          3. Changes in geo_point fields
          4. Some more changes
      3. Summary
    10. 2. The Improved Query DSL
      1. The changed default text scoring in Lucene - BM25
        1. Precision versus recall
        2. Recalling TF-IDF
          1. Introducing BM25 scoring
          2. BM25 scoring formula
          3. Example - tuning BM25 with custom similarity
        3. How BM25 differs from TF-IDF
          1. Saturation point
          2. Average document length
      2. Re-factored Query DSL
      3. Choosing the right query for the job
        1. Query categorization
          1. Basic queries
          2. Compound queries
          3. Understanding bool queries
          4. Non-analyzed queries
          5. Full text search queries
          6. Pattern queries
          7. Similarity supporting queries
          8. Score altering queries
          9. Position aware queries
          10. Structure aware queries
        2. The use cases
          1. Example data
          2. Basic queries use cases
            1. Searching for values in range
          3. Compound queries use cases
            1. A Boolean query for multiple terms
            2. Boosting some of the matched documents
            3. Ignoring lower scoring partial queries
          4. Not analyzed queries use cases
            1. Limiting results to given tags
          5. Full text search queries use cases
            1. Using Lucene query syntax in queries
            2. Handling user queries without errors
          6. Pattern queries use cases
            1. Autocomplete using prefixes
            2. Pattern matching
          7. Similarity supporting queries use cases
            1. Finding terms similar to a given one
          8. Score altering query use cases
            1. Decreasing importance of books with a certain value
          9. Pattern query use cases
            1. Matching phrases
            2. Spans, spans everywhere
        3. Some more important changes in Query DSL
      4. Query rewrite explained
        1. Prefix query as an example
        2. Getting back to Apache Lucene
        3. Query rewrite properties
          1. An example
      5. Query templates
        1. Introducing search templates
        2. The Mustache template engine
          1. Conditional expressions
          2. Loops
          3. Default values
          4. Storing templates in files
          5. Storing templates in a cluster
      6. Summary
    11. 3. Beyond Full Text Search
      1. Controlling multimatching
      2. Multimatch types
        1. Best fields matching
        2. Cross fields matching
        3. Most fields matching
        4. Phrase matching
        5. Phrase with prefixes matching
      3. Controlling scores using the function score query
      4. Built-in functions under the function score query
        1. The weight function
        2. The field value factor function
        3. The script score function
        4. Decay functions - linear, exp, and gauss
      5. Query rescoring
        1. What is query rescoring?
      6. Structure of the rescore query
        1. Rescore parameters
          1. To sum up
      7. Elasticsearch scripting
        1. The syntax
        2. Scripting changes across different versions
      8. Painless - the new default scripting language
        1. Using Painless as your scripting language
          1. Variable definition in scripts
          2. Conditionals
          3. Loops
        2. An example
        3. Sorting results based on scripts
        4. Sorting based on multiple fields
      9. Lucene expressions
        1. The basics
        2. An example
      10. Summary
    12. 4. Data Modeling and Analytics
      1. Data modeling techniques in Elasticsearch
      2. Managing relational data in Elasticsearch
        1. The object type
        2. The nested documents
        3. Parent - child relationship
          1. Parent-child relationship in the cluster
          2. Finding child documents with a parent ID query
        4. A few words about alternatives
        5. An example of data denormalization
      3. Data analytics using aggregations
        1. Instant aggregations in Elasticsearch 5.0
        2. Revisiting aggregations
          1. Metric aggregations
          2. Bucket aggregations
          3. Pipeline aggregations
            1. Calculating average monthly sales using avg_bucket aggregation
            2. Calculating the derivative for the sum of the monthly sale
        3. The new aggregation category - Matrix aggregation
          1. Understanding matrix stats
          2. Dealing with missing values
      4. Summary
    13. 5. Improving the User Search Experience
      1. Correcting user spelling mistakes
        1. Testing data
        2. Getting into technical details
      2. Suggesters
        1. Using a suggester under the _search endpoint
          1. Understanding the suggester response
          2. Multiple suggestion types for the same suggestion text
        2. The term suggester
          1. Configuring the Elasticsearch term suggester
            1. Common term suggester options
            2. Additional term suggester options
        3. The phrase suggester
          1. Usage example
          2. Configuring the phrase suggester
            1. Basic configuration
            2. Configuring smoothing models
            3. Configuring candidate generators
        4. The completion suggester
          1. The logic behind the completion suggester
            1. Using the completion suggester
          2. Indexing data
          3. Querying data
            1. Custom weights
          4. Using fuzziness with the completion suggester
      3. Implementing your own auto-completion
        1. Creating an index
          1. Understanding the parameters
            1. Configuring settings
            2. Configuring mappings
          2. Indexing documents
          3. Querying documents for auto-completion
      4. Working with synonyms
        1. Preparing settings for synonym search
        2. Formatting synonyms
        3. Synonym expansion versus contraction
      5. Summary
    14. 6. The Index Distribution Architecture
      1. Configuring an example multi-node cluster
      2. Choosing the right amount of shards and replicas
        1. Sharding and overallocation
        2. A positive example of overallocation
        3. Multiple shards versus multiple indices
          1. Replicas
      3. Routing explained
        1. Shards and data
        2. Let's test routing
        3. Indexing with routing
        4. Routing in practice
        5. Querying
        6. Aliases
        7. Multiple routing values
      4. Shard allocation control
        1. Allocation awareness
          1. Forcing allocation awareness
          2. Shard allocation filtering
            1. What include, exclude, and require mean
          3. Runtime allocation updating
            1. Index level updates
            2. Cluster level updates
        2. Defining total shards allowed per node
        3. Defining total shards allowed per physical server
          1. Inclusion
          2. Requirement
          3. Exclusion
          4. Disk-based allocation
      5. Query execution preference
        1. Introducing the preference parameter
        2. An example of using query execution preference
      6. Stripping data on multiple paths
      7. Index versus type - a revised approach for creating indices
      8. Summary
    15. 7. Low-Level Index Control
      1. Altering Apache Lucene scoring
      2. Available similarity models
      3. Setting a per-field similarity
      4. Similarity model configuration
      5. Choosing the default similarity model
        1. Configuring the chosen similarity model
          1. Configuring the TF-IDF similarity
          2. Configuring the BM25 similarity
          3. Configuring the DFR similarity
          4. Configuring the IB similarity
          5. Configuring the LM Dirichlet similarity
          6. Configuring the LM Jelinek Mercer similarity
      6. Choosing the right directory implementation - the store module
      7. The store type
        1. The simple file system store - simplefs
          1. The new I/O filesystem store - niofs
          2. The mmap filesystem store - mmapfs
          3. The default store type - fs
      8. NRT, flush, refresh, and transaction log
        1. Updating the index and committing changes
        2. Changing the default refresh time
        3. The transaction log
          1. The transaction log configuration
          2. Handling corrupted translogs
        4. Near real-time GET
      9. Segment merging under control
        1. Merge policy changes in Elasticsearch
        2. Configuring the tiered merge policy
        3. Merge scheduling
          1. The concurrent merge scheduler
        4. Force merging
      10. Understanding Elasticsearch caching
        1. Node query cache
          1. Configuring node query cache
        2. Shard request cache
          1. Enabling and disabling the shard request cache
          2. Request cache settings
          3. Cache invalidation
        3. The field data cache
          1. Field data or doc values
        4. Using circuit breakers
          1. The parent circuit breaker
          2. The field data circuit breaker
          3. The request circuit breaker
          4. In flight requests circuit breaker
          5. Script compilation circuit breaker
      11. Summary
    16. 8. Elasticsearch Administration
      1. Node types in Elasticsearch
        1. Data node
        2. Master node
        3. Ingest node
        4. Tribe node
        5. Coordinating nodes/Client nodes
      2. Discovery and recovery modules
        1. Discovery configuration
          1. Zen discovery
            1. The unicast Zen discovery configuration
            2. The master election configuration
            3. Zen discovery fault detection and configuration
            4. No Master Block
          2. The Amazon EC2 discovery
            1. The EC2 plugin installation
            2. The EC2 plugin's generic configuration
            3. Optional EC2 discovery configuration options
            4. The EC2 nodes scanning configuration
          3. Other discovery implementations
        2. The gateway and recovery configuration
          1. The gateway recovery process
          2. Configuration properties
          3. The local gateway
          4. Low-level recovery configuration
            1. Cluster-level recovery configuration
        3. The indices recovery API
      3. The human-friendly status API - using the cat API
        1. The basics of cat API
        2. Using the cat API
          1. Cat API common arguments
          2. The examples of cat API
            1. Getting information about the master node
            2. Getting information about the nodes
          3. Changes in cat API - Elasticsearch 5.0
            1. Host field removed from the cat nodes API
            2. Changes to cat recovery API
            3. Changes to cat nodes API
            4. Changes to cat field data API
      4. Backing up
        1. The snapshot API
        2. Saving backups on a filesystem
          1. Creating snapshot
            1. Registering repository path
            2. Registering shared file system repository in Elasticsearch
            3. Creating snapshots
            4. Getting snapshot information
            5. Deleting snapshots
        3. Saving backups in the cloud
          1. The S3 repository
          2. The HDFS repository
          3. The Azure repository
          4. The Google cloud storage repository
      5. Restoring snapshots
        1. Example - restoring a snapshot
          1. Restoring multiple indices
          2. Renaming indices
          3. Partial restore
          4. Changing index settings during restore
          5. Restoring to different cluster
      6. Summary
    17. 9. Data Transformation and Federated Search
      1. Preprocessing data within Elasticsearch with ingest nodes
        1. Working with ingest pipeline
          1. The ingest APIs
            1. Creating a pipeline
            2. Getting pipeline details
            3. Deleting a pipeline
            4. Simulating pipelines for debugging purposes
        2. Handling errors in pipelines
          1. Tagging errors within the same document and index
          2. Indexing error prone documents in a different index
          3. Ignoring errors altogether
        3. Working with ingest processors
          1. Append processor
          2. Convert processor
          3. Grok processor
      2. Federated search
        1. The test clusters
        2. Creating the tribe node
        3. Reading data with the tribe node
        4. Master-level read operations
        5. Writing data with the tribe node
        6. Master-level write operations
        7. Handling indices conflicts
        8. Blocking write operations
      3. Summary
    18. 10. Improving Performance
      1. Query validation and profiling
        1. Validating expensive queries before execution
        2. Query profiling for detailed query execution reports
          1. Understanding the profile API response
        3. Consideration for profiling usage
      2. Very hot threads
        1. Usage clarification for the hot threads API
        2. The hot threads API response
      3. Scaling Elasticsearch
        1. Vertical scaling
        2. Horizontal scaling
          1. Automatically creating replicas
          2. Redundancy and high availability
          3. Cost and performance flexibility
          4. Continuous upgrades
          5. Multiple Elasticsearch instances on a single physical machine
            1. Preventing the shard and its replicas from being on the same node
          6. Designated nodes' roles for larger clusters
            1. Query aggregator nodes
            2. Data nodes
            3. Master eligible nodes
        3. Using Elasticsearch for high load scenarios
          1. General Elasticsearch-tuning advice
            1. The index refresh rate
            2. Thread pools tuning
            3. Data distribution
          2. Advice for high query rate scenarios
            1. Node query cache and shard query cache
            2. Think about the queries
            3. Using routing
            4. Parallelize your queries
            5. Keeping size and shard_size under control
          3. High indexing throughput scenarios and Elasticsearch
            1. Bulk indexing
            2. Keeping your document fields under control
            3. The index architecture and replication
            4. Tuning the write-ahead log
            5. Thinking about storage
            6. RAM buffer for indexing
      4. Managing time-based indices efficiently using shrink and rollover APIs
        1. The shrink API
          1. Requirements for indices to be shrunk
          2. Shrinking an index
        2. Rollover API
          1. Using the rollover API
            1. Passing additional settings with a rollover request
            2. Pattern for creating new index name
      5. Summary
    19. 11. Developing Elasticsearch Plugins
      1. Creating the Apache Maven project structure
        1. Understanding the basics
        2. The structure of the Maven Java project
          1. The idea of POM
          2. Running the build process
          3. Introducing the assembly Maven plugin
            1. Understanding the plugin descriptor file
      2. Creating a custom REST action
        1. The assumptions
        2. Implementation details
          1. Using the REST action class
            1. The constructor
            2. Handling requests
            3. Writing responses
          2. The plugin class
          3. Informing Elasticsearch about our REST action
        3. Time for testing
          1. Building the REST action plugin
          2. Installing the REST action plugin
        4. Checking whether the REST action plugin works
      3. Creating the custom analysis plugin
        1. Implementation details
          1. Implementing TokenFilter
          2. Implementing the TokenFilter factory
          3. Implementing the class custom analyzer
          4. Implementing the analyzer provider
          5. Implementing the analyzer plugin
          6. Informing Elasticsearch about our custom analyzer
        2. Testing our custom analysis plugin
          1. Building our custom analysis plugin
          2. Installing the custom analysis plugin
          3. Checking whether our analysis plugin works
      4. Summary
    20. 12. Introducing Elastic Stack 5.0
      1. Overview of Elastic Stack 5.0
      2. Introducing Logstash, Beats, and Kibana
        1. Working with Logstash
          1. Logstash architecture
          2. Installing Logstash
            1. Installing Logstash from binaries
            2. Installing Logstash from APT repositories
            3. Installing Logstash from YUM repositories
          3. Configuring Logstash
          4. Example - shipping system logs using Logstash
          5. Starting Logstash
        2. Introducing Beats as data shippers
          1. Working with Metricbeat
            1. Installing Metricbeat
            2. Configuring Metricbeat
            3. Running Metricbeat
            4. Loading a sample Kibana dashboard into Elasticsearch
        3. Working with Kibana
          1. Installing Kibana
          2. Kibana configuration
          3. Starting Kibana
          4. Exploring and visualizing data on Kibana
            1. Understanding the Kibana Management screen
            2. Discovering data on Kibana
            3. Using the Dashboard screen to create/load dashboards
          5. Using Sense
      3. Summary