You are previewing Elasticsearch Server - Third Edition.
O'Reilly logo
Elasticsearch Server - Third Edition

Book Description

Leverage Elasticsearch to create a robust, fast, and flexible search solution with ease

About This Book

  • Boost the searching capabilities of your system through synonyms, multilingual data handling, nested objects and parent-child documents

  • Deep dive into the world of data aggregation and data analysis with ElasticSearch

  • Explore a wide range of ElasticSearch modules that define the behavior of a cluster

  • Who This Book Is For

    If you are a competent developer and want to learn about the great and exciting world of ElasticSearch, then this book is for you. No prior knowledge of Java or Apache Lucene is needed.

    What You Will Learn

  • Configure, create, and retrieve data from your indices

  • Use an ElasticSearch query DSL to create a wide range of queries

  • Discover the highlighting and geographical search features offered by ElasticSearch

  • Find out how to index data that is not flat or data that has a relationship

  • Exploit a prospective search to search for queries not documents

  • Use the aggregations framework to get more from your data and improve your client’s search experience

  • Monitor your cluster state and health using the ElasticSearch API as well as third-party monitoring solutions

  • Discover how to properly set up ElasticSearch for various use cases

  • In Detail

    ElasticSearch is a very fast and scalable open source search engine, designed with distribution and cloud in mind, complete with all the goodies that Apache Lucene has to offer. ElasticSearch’s schema-free architecture allows developers to index and search unstructured content, making it perfectly suited for both small projects and large big data warehouses, even those with petabytes of unstructured data.

    This book will guide you through the world of the most commonly used ElasticSearch server functionalities. You’ll start off by getting an understanding of the basics of ElasticSearch and its data indexing functionality. Next, you will see the querying capabilities of ElasticSearch, followed by a through explanation of scoring and search relevance. After this, you will explore the aggregation and data analysis capabilities of ElasticSearch and will learn how cluster administration and scaling can be used to boost your application performance. You’ll find out how to use the friendly REST APIs and how to tune ElasticSearch to make the most of it. By the end of this book, you will have be able to create amazing search solutions as per your project’s specifications.

    Style and approach

    This step-by-step guide is full of screenshots and real-world examples to take you on a journey through the wonderful world of full text search provided by ElasticSearch.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Elasticsearch Server Third Edition
      1. Table of Contents
      2. Elasticsearch Server Third Edition
      3. Credits
      4. About the Authors
      5. About the Reviewer
      6. www.PacktPub.com
        1. eBooks, discount offers, and more
          1. Why subscribe?
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Getting Started with Elasticsearch Cluster
        1. Full text searching
          1. The Lucene glossary and architecture
          2. Input data analysis
          3. Indexing and querying
          4. Scoring and query relevance
        2. The basics of Elasticsearch
          1. Key concepts of Elasticsearch
            1. Index
            2. Document
            3. Document type
            4. Mapping
          2. Key concepts of the Elasticsearch infrastructure
            1. Nodes and clusters
            2. Shards
            3. Replicas
            4. Gateway
          3. Indexing and searching
        3. Installing and configuring your cluster
          1. Installing Java
          2. Installing Elasticsearch
          3. Running Elasticsearch
          4. Shutting down Elasticsearch
          5. The directory layout
          6. Configuring Elasticsearch
          7. The system-specific installation and configuration
            1. Installing Elasticsearch on Linux
              1. Installing Elasticsearch using RPM packages
              2. Installing Elasticsearch using the DEB package
              3. Elasticsearch configuration file localization
            2. Configuring Elasticsearch as a system service on Linux
            3. Elasticsearch as a system service on Windows
        4. Manipulating data with the REST API
          1. Understanding the REST API
          2. Storing data in Elasticsearch
            1. Creating a new document
              1. Automatic identifier creation
          3. Retrieving documents
          4. Updating documents
            1. Dealing with non-existing documents
            2. Adding partial documents
          5. Deleting documents
          6. Versioning
            1. Usage example
            2. Versioning from external systems
        5. Searching with the URI request query
          1. Sample data
          2. URI search
            1. Elasticsearch query response
          3. Query analysis
          4. URI query string parameters
            1. The query
            2. The default search field
            3. Analyzer
            4. The default operator property
            5. Query explanation
            6. The fields returned
            7. Sorting the results
            8. The search timeout
            9. The results window
            10. Limiting per-shard results
            11. Ignoring unavailable indices
            12. The search type
            13. Lowercasing term expansion
            14. Wildcard and prefix analysis
          5. Lucene query syntax
        6. Summary
      9. 2. Indexing Your Data
        1. Elasticsearch indexing
          1. Shards and replicas
            1. Write consistency
          2. Creating indices
            1. Altering automatic index creation
            2. Settings for a newly created index
            3. Index deletion
        2. Mappings configuration
          1. Type determining mechanism
            1. Disabling the type determining mechanism
            2. Tuning the type determining mechanism for numeric types
            3. Tuning the type determining mechanism for dates
          2. Index structure mapping
            1. Type and types definition
            2. Fields
            3. Core types
              1. Common attributes
              2. String
              3. Number
              4. Boolean
              5. Binary
              6. Date
            4. Multi fields
            5. The IP address type
            6. Token count type
          3. Using analyzers
            1. Out-of-the-box analyzers
            2. Defining your own analyzers
            3. Default analyzers
          4. Different similarity models
            1. Setting per-field similarity
            2. Available similarity models
              1. Configuring default similarity
              2. Configuring BM25 similarity
              3. Configuring DFR similarity
              4. Configuring IB similarity
        3. Batch indexing to speed up your indexing process
          1. Preparing data for bulk indexing
          2. Indexing the data
          3. The _all field
          4. The _source field
          5. Additional internal fields
        4. Introduction to segment merging
          1. Segment merging
          2. The need for segment merging
          3. The merge policy
          4. The merge scheduler
          5. Throttling
        5. Introduction to routing
          1. Default indexing
          2. Default searching
          3. Routing
          4. The routing parameters
          5. Routing fields
        6. Summary
      10. 3. Searching Your Data
        1. Querying Elasticsearch
          1. The example data
          2. A simple query
          3. Paging and result size
          4. Returning the version value
          5. Limiting the score
          6. Choosing the fields that we want to return
          7. Source filtering
          8. Using the script fields
            1. Passing parameters to the script fields
        2. Understanding the querying process
          1. Query logic
          2. Search type
          3. Search execution preference
          4. Search shards API
        3. Basic queries
          1. The term query
          2. The terms query
          3. The match all query
          4. The type query
          5. The exists query
          6. The missing query
          7. The common terms query
          8. The match query
            1. The Boolean match query
            2. The phrase match query
            3. The match phrase prefix query
          9. The multi match query
          10. The query string query
            1. Running the query string query against multiple fields
          11. The simple query string query
          12. The identifiers query
          13. The prefix query
          14. The fuzzy query
          15. The wildcard query
          16. The range query
          17. Regular expression query
          18. The more like this query
        4. Compound queries
          1. The bool query
          2. The dis_max query
          3. The boosting query
          4. The constant_score query
          5. The indices query
        5. Using span queries
          1. A span
          2. Span term query
          3. Span first query
          4. Span near query
          5. Span or query
          6. Span not query
          7. Span within query
          8. Span containing query
          9. Span multi query
          10. Performance considerations
        6. Choosing the right query
          1. The use cases
            1. Limiting results to given tags
          2. Searching for values in a range
            1. Boosting some of the matched documents
            2. Ignoring lower scoring partial queries
            3. Using Lucene query syntax in queries
            4. Handling user queries without errors
            5. Autocomplete using prefixes
            6. Finding terms similar to a given one
            7. Matching phrases
            8. Spans, spans everywhere
        7. Summary
      11. 4. Extending Your Querying Knowledge
        1. Filtering your results
          1. The context is the key
          2. Explicit filtering with bool query
        2. Highlighting
          1. Getting started with highlighting
          2. Field configuration
          3. Under the hood
            1. Forcing highlighter type
          4. Configuring HTML tags
          5. Controlling highlighted fragments
          6. Global and local settings
          7. Require matching
          8. Custom highlighting query
          9. The Postings highlighter
        3. Validating your queries
          1. Using the Validate API
        4. Sorting data
          1. Default sorting
          2. Selecting fields used for sorting
            1. Sorting mode
          3. Specifying behavior for missing fields
          4. Dynamic criteria
          5. Calculate scoring when sorting
        5. Query rewrite
          1. Prefix query as an example
          2. Getting back to Apache Lucene
          3. Query rewrite properties
        6. Summary
      12. 5. Extending Your Index Structure
        1. Indexing tree-like structures
          1. Data structure
          2. Analysis
        2. Indexing data that is not flat
          1. Data
          2. Objects
          3. Arrays
          4. Mappings
            1. Final mappings
            2. Sending the mappings to Elasticsearch
          5. To be or not to be dynamic
          6. Disabling object indexing
        3. Using nested objects
          1. Scoring and nested queries
        4. Using the parent-child relationship
          1. Index structure and data indexing
            1. Child mappings
            2. Parent mappings
            3. The parent document
            4. Child documents
          2. Querying
            1. Querying data in the child documents
            2. Querying data in the parent documents
          3. Performance considerations
        5. Modifying your index structure with the update API
          1. The mappings
            1. Adding a new field to the existing index
            2. Modifying fields of an existing index
        6. Summary
      13. 6. Make Your Search Better
        1. Introduction to Apache Lucene scoring
          1. When a document is matched
          2. Default scoring formula
          3. Relevancy matters
        2. Scripting capabilities of Elasticsearch
          1. Objects available during script execution
          2. Script types
            1. In file scripts
            2. Inline scripts
            3. Indexed scripts
          3. Querying with scripts
          4. Scripting with parameters
          5. Script languages
          6. Using other than embedded languages
          7. Using native code
            1. The factory implementation
            2. Implementing the native script
            3. The plugin definition
            4. Installing the plugin
            5. Running the script
        3. Searching content in different languages
          1. Handling languages differently
          2. Handling multiple languages
          3. Detecting the language of the document
          4. Sample document
          5. The mappings
          6. Querying
            1. Queries with an identified language
            2. Queries with an unknown language
          7. Combining queries
        4. Influencing scores with query boosts
          1. The boost
          2. Adding the boost to queries
          3. Modifying the score
            1. Constant score query
            2. Boosting query
            3. The function score query
              1. Structure of the function query
              2. The weight factor function
              3. Field value factor function
              4. The script score function
              5. The random score function
              6. Decay functions
        5. When does index-time boosting make sense?
          1. Defining boosting in the mappings
        6. Words with the same meaning
          1. Synonym filter
            1. Synonyms in the mappings
            2. Synonyms stored on the file system
          2. Defining synonym rules
            1. Using Apache Solr synonyms
              1. Explicit synonyms
              2. Equivalent synonyms
              3. Expanding synonyms
            2. Using WordNet synonyms
          3. Query or index-time synonym expansion
        7. Understanding the explain information
          1. Understanding field analysis
          2. Explaining the query
        8. Summary
      14. 7. Aggregations for Data Analysis
        1. Aggregations
          1. General query structure
          2. Inside the aggregations engine
        2. Aggregation types
          1. Metrics aggregations
            1. Minimum, maximum, average, and sum
              1. Missing values
              2. Using scripts
            2. Field value statistics and extended statistics
            3. Value count
            4. Field cardinality
            5. Percentiles
            6. Percentile ranks
            7. Top hits aggregation
              1. Additional parameters
            8. Geo bounds aggregation
            9. Scripted metrics aggregation
          2. Buckets aggregations
            1. Filter aggregation
            2. Filters aggregation
            3. Terms aggregation
              1. Counts are approximate
              2. Minimum document count
            4. Range aggregation
              1. Keyed buckets
            5. Date range aggregation
            6. IPv4 range aggregation
            7. Missing aggregation
            8. Histogram aggregation
          3. Date histogram aggregation
            1. Time zones
          4. Geo distance aggregations
          5. Geohash grid aggregation
          6. Global aggregation
          7. Significant terms aggregation
            1. Choosing significant terms
            2. Multiple value analysis
          8. Sampler aggregation
          9. Children aggregation
          10. Nested aggregation
          11. Reverse nested aggregation
          12. Nesting aggregations and ordering buckets
            1. Buckets ordering
        3. Pipeline aggregations
          1. Available types
          2. Referencing other aggregations
          3. Gaps in the data
          4. Pipeline aggregation types
            1. Min, max, sum, and average bucket aggregations
            2. Cumulative sum aggregation
            3. Bucket selector aggregation
            4. Bucket script aggregation
            5. Serial differencing aggregation
            6. Derivative aggregation
            7. Moving avg aggregation
              1. Predicting future buckets
              2. The models
        4. Summary
      15. 8. Beyond Full-text Searching
        1. Percolator
          1. The index
          2. Percolator preparation
          3. Getting deeper
            1. Controlling the size of returned results
            2. Percolator and score calculation
            3. Combining percolators with other functionalities
          4. Getting the number of matching queries
          5. Indexed document percolation
        2. Elasticsearch spatial capabilities
          1. Mapping preparation for spatial searches
          2. Example data
            1. Additional geo_field properties
          3. Sample queries
            1. Distance-based sorting
            2. Bounding box filtering
            3. Limiting the distance
          4. Arbitrary geo shapes
            1. Point
            2. Envelope
            3. Polygon
            4. Multipolygon
            5. An example usage
            6. Storing shapes in the index
        3. Using suggesters
          1. Available suggester types
          2. Including suggestions
            1. Suggester response
          3. Term suggester
            1. Term suggester configuration options
            2. Additional term suggester options
          4. Phrase suggester
            1. Configuration
          5. Completion suggester
            1. Indexing data
            2. Querying indexed completion suggester data
            3. Custom weights
            4. Context suggester
              1. Context types
              2. Using context
              3. Using the geo location context
        4. The Scroll API
          1. Problem definition
          2. Scrolling to the rescue
        5. Summary
      16. 9. Elasticsearch Cluster in Detail
        1. Understanding node discovery
          1. Discovery types
          2. Node roles
            1. Master node
            2. Data node
            3. Client node
            4. Configuring node roles
          3. Setting the cluster's name
          4. Zen discovery
            1. Master election configuration
            2. Configuring unicast
            3. Fault detection ping settings
            4. Cluster state updates control
            5. Dealing with master unavailability
          5. Adjusting HTTP transport settings
            1. Disabling HTTP
            2. HTTP port
            3. HTTP host
        2. The gateway and recovery modules
          1. The gateway
          2. Recovery control
            1. Additional gateway recovery options
            2. Indices recovery API
            3. Delayed allocation
            4. Index recovery prioritization
        3. Templates and dynamic templates
          1. Templates
            1. An example of a template
          2. Dynamic templates
            1. The matching pattern
            2. Field definitions
        4. Elasticsearch plugins
          1. The basics
          2. Installing plugins
          3. Removing plugins
        5. Elasticsearch caches
          1. Fielddata cache
            1. Fielddata size
            2. Circuit breakers
          2. Fielddata and doc values
          3. Shard request cache
            1. Enabling and configuring the shard request cache
            2. Per request shard request cache disabling
            3. Shard request cache usage monitoring
          4. Node query cache
          5. Indexing buffers
          6. When caches should be avoided
        6. The update settings API
          1. The cluster settings API
          2. The indices settings API
        7. Summary
      17. 10. Administrating Your Cluster
        1. Elasticsearch time machine
          1. Creating a snapshot repository
          2. Creating snapshots
            1. Additional parameters
          3. Restoring a snapshot
          4. Cleaning up – deleting old snapshots
        2. Monitoring your cluster's state and health
          1. Cluster health API
            1. Controlling information details
            2. Additional parameters
          2. Indices stats API
            1. Docs
            2. Store
            3. Indexing, get, and search
            4. Additional information
          3. Nodes info API
            1. Returned information
          4. Nodes stats API
          5. Cluster state API
          6. Cluster stats API
          7. Pending tasks API
          8. Indices recovery API
          9. Indices shard stores API
          10. Indices segments API
        3. Controlling the shard and replica allocation
          1. Explicitly controlling allocation
            1. Specifying node parameters
            2. Configuration
            3. Index creation
            4. Excluding nodes from allocation
            5. Requiring node attributes
            6. Using the IP address for shard allocation
            7. Disk-based shard allocation
              1. Configuring disk based shard allocation
              2. Disabling disk based shard allocation
          2. The number of shards and replicas per node
          3. Allocation throttling
          4. Cluster-wide allocation
            1. Allocation awareness
            2. Forcing allocation awareness
            3. Filtering
              1. What do include, exclude, and require mean
          5. Manually moving shards and replicas
            1. Moving shards
            2. Canceling shard allocation
            3. Forcing shard allocation
            4. Multiple commands per HTTP request
            5. Allowing operations on primary shards
          6. Handling rolling restarts
        4. Controlling cluster rebalancing
          1. Understanding rebalance
          2. Cluster being ready
          3. The cluster rebalance settings
            1. Controlling when rebalancing will be allowed
            2. Controlling the number of shards being moved between nodes concurrently
            3. Controlling which shards may be rebalanced
        5. The Cat API
          1. The basics
          2. Using Cat API
            1. Common arguments
          3. The examples
            1. Getting information about the master node
            2. Getting information about the nodes
            3. Retrieving recovery information for an index
        6. Warming up
          1. Defining a new warming query
          2. Retrieving the defined warming queries
          3. Deleting a warming query
          4. Disabling the warming up functionality
          5. Choosing queries for warming
        7. Index aliasing and using it to simplify your everyday work
          1. An alias
          2. Creating an alias
          3. Modifying aliases
          4. Combining commands
          5. Retrieving aliases
          6. Removing aliases
          7. Filtering aliases
          8. Aliases and routing
          9. Zero downtime reindexing and aliases
        8. Summary
      18. 11. Scaling by Example
        1. Hardware
          1. Physical servers or a cloud
          2. CPU
          3. RAM memory
          4. Mass storage
          5. The network
          6. How many servers
          7. Cost cutting
        2. Preparing a single Elasticsearch node
          1. The general preparations
            1. Avoiding swapping
            2. File descriptors
            3. Virtual memory
          2. The memory
          3. Field data cache and breaking the circuit
          4. Use doc values
          5. RAM buffer for indexing
          6. Index refresh rate
          7. Thread pools
        3. Horizontal expansion
          1. Automatically creating the replicas
          2. Redundancy and high availability
          3. Cost and performance flexibility
          4. Continuous upgrades
          5. Multiple Elasticsearch instances on a single physical machine
            1. Preventing a shard and its replicas from being on the same node
          6. Designated node roles for larger clusters
            1. Query aggregator nodes
            2. Data nodes
            3. Master eligible nodes
        4. Preparing the cluster for high indexing and querying throughput
          1. Indexing related advice
            1. Index refresh rate
            2. Thread pools tuning
            3. Automatic store throttling
            4. Handling time-based data
            5. Multiple data paths
            6. Data distribution
            7. Bulk indexing
            8. RAM buffer for indexing
          2. Advice for high query rate scenarios
            1. Shard request cache
            2. Think about the queries
            3. Parallelize your queries
            4. Field data cache and breaking the circuit
            5. Keep size and shard size under control
        5. Monitoring
          1. Elasticsearch HQ
          2. Marvel
          3. SPM for Elasticsearch
        6. Summary
      19. Index