Elasticsearch Server(Third Edition)
上QQ阅读APP看书,第一时间看更新

Introduction to segment merging

In the Full text searching section of Chapter 1, Getting Started with Elasticsearch Cluster, we mentioned segments and their immutability. We wrote that the Lucene library, and thus Elasticsearch, writes data to certain structures that are written once and never change. This allows for some simplification, but also introduces the need for additional work. One such example is deletion. Because segment, cannot be altered, information about deletions must be stored alongside and dynamically applied during search. This is done by filtering deleted documents from the returned result set. The other example is the inability to modify the documents (however, some modifications are possible, such as modifying numeric doc values). Of course, one can say that Elasticsearch supports document updates (refer to the Manipulating data with the REST API section of Chapter 1, Getting Started with Elasticsearch Cluster). However, under the hood, the old document is marked as deleted and the one with the updated contents is indexed.

As time passes and you continue to index or delete your data, more and more segments are created. Depending on how often you modify the index, Lucene creates segments with various numbers of documents - thus, segments have different sizes. Because of that, the search performance may be lower and your index may be larger than it should be – it still contains the deleted documents. The equation is simple - the more segments your index has, the slower the search speed is. This is when segment merging comes into play. We don't want to describe this process in detail; in the current Elasticsearch version, this part of the engine was simplified but it is still a rather advanced topic. We decided to mention merging because we think that it is handy to know where to look for the cause of troubles connected with too many open files, suspicious CPU usage, expanding indices, or searching and indexing speed degrading with time.

Segment merging

Segment merging is the process during which the underlying Lucene library takes several segments and creates a new segment based on the information found in them. The resulting segment has all the documents stored in the original segments except the ones that were marked for deletion. After the merge operation, the source segments are deleted from the disk. Because segment merging is rather costly in terms of CPU and I/O usage, it is crucial to appropriately control when and how often this process is invoked.

The need for segment merging

You may ask yourself why you have to bother with segment merging. First of all, the more segments the index is built from, the slower the search will be and the more memory Lucene will use. The second is the disk space and resources, such as file descriptors, used by the index. If you delete many documents from your index then, until the merge happens, those documents are only marked as deleted and not deleted physically. So, it may happen that most of the documents that use our CPU and memory don't exist! Fortunately, Elasticsearch uses reasonable defaults for segment merging and it is very probable that no changes are necessary.

The merge policy

The merge policy defines when the merging process should be performed. Elasticsearch merges segments of approximately similar sizes, taking into account the maximum number of segments allowed per tier. The algorithm of merging can find segments with the lowest cost of merge and the most impact on the resulting segment.

The basic properties of the tiered merge policy are as follows:

  • index.merge.policy.expunge_deletes_allowed: This property tells Elasticsearch to merge segments with percentage of the deleted documents higher than this value, defaults to 10.
  • index.merge.policy.floor_segment: This property defaults to 2mb and tells Elasticsearch to treat smaller segments as ones with size equal to the value of this property. It prevents flushing of tiny segments to avoid their high number.
  • index.merge.policy.max_merge_at_once: In this property, the maximum number of segments to be merged at once defaults to 10.
  • index.merge.policy.max_merge_at_once_explicit: In this property, the maximum number of segments merged at once during expunge deletes or optimize operations defaults to 10.
  • index.merge.policy.max_merged_segment: In this property, the maximum size of segment that can be produced during normal merging defaults to 5gb.
  • index.merge.policy.segments_per_tier: This property defaults to 10 and roughly defines the number of segments. Smaller values mean more merging but fewer segments, which results in higher search speed but lower indexing speed and more I/O pressure. Higher values of the property will result in higher segments count, thus slower search speed but higher indexing speed.
  • index.merge.policy.reclaim_deletes_weight – This property tells Elasticsearch how important it is to choose segments with many deleted documents. It defaults to 2.0.

    For example, to update merge policy settings of already created index we could run a command like this:

    curl -XPUT 'localhost:9200/essb/_settings' -d '{
    "index.merge.policy.max_merged_segment" : "10gb"
    }'

To get deeper into segment merging, refer to our book Mastering Elasticsearch Second Edition, published by Packt Publishing. You can also find more information about the tiered merge policy at https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-merge.html.

Note

Up to the 2.0 version of Elasticsearch, we were able to choose between three merge policies: tiered, log_byte_size, and log_doc. The currently used merge policy is based on the tiered merge policy and we are forced to use it.

The merge scheduler

The merge scheduler tells Elasticsearch how the merge process should occur. The current implementation is based on a concurrent merge scheduler that is started in a separate thread and uses the defined number of threads doing merges in parallel. Elasticsearch allows you to set the number of threads that can be used for simultaneous merging by using the index.merge.scheduler.max_thread_count property.

Throttling

As we have already mentioned, merging may be expensive when it comes to server resources. The merge process usually works in parallel to other operations, so theoretically it shouldn't have too much influence. In practice, the number of disk input/output operations can be so large as to significantly affect the overall performance. In such cases, throttling is something that may help. In fact, this feature can be used for limiting the speed of the merge, but it may also be used for all the operations using the data store. Throttling can be set in the Elasticsearch configuration file (the elasticsearch.yml file) or dynamically by using the settings API (refer to the The update settings API section of Chapter 9, Elasticsearch Cluster, for detail). There are two settings that adjust throttling: type and value.

To set the throttling type, set the indices.store.throttle.type property, which allows us to use the following values:

  • none: This value defines that no throttling is on
  • merge: This value defines that throttling affects only the merge process
  • all: This value defines that throttling is used for all the data store activities

The second property, indices.store.throttle.max_bytes_per_sec, describes how much the throttling limits the I/O operations. As its name suggests, it tells us how many bytes can be processed per second. For example, let's look at the following configuration:

indices.store.throttle.type: merge
indices.store.throttle.max_bytes_per_sec: 10mb

In this example, we limit the merge operations to 10 megabytes per second. By default, Elasticsearch uses the merge throttling type with the max_bytes_per_sec property set to 20mb. This means that all the merge operations are limited to 20 megabytes per second.