NoSQL

It is well known that once Elasticsearch (ES) indexes a document, it does not become searchable right away. For index to be so, the refresh must run which internally copies the documents to a new segment in the OS cache and opens it for searches. Once the index has been refreshed, all documents within it become searchable.

According to ES documentation the refresh process by default runs every second. Alternatively, one can run it manually as well. It can also be disabled entirely by setting “refresh_interval” setting of the index to -1.

A simple experiment reveals that there is another (an undocumented) case which can make index searchable. This is the force merge.

This process runs in the background and periodically merges segments into larger segments. This is necessary in case of updated and/or deleted documents since ES does not implement these operations in-place. For example, when a document is updated, ES creates a new document with a higher version number and the old document is kept until the next merge process which when run out of the several versions of the same document keeps the one with the highest version number.

Let’s put up an experiment and check out how the merge process affects searchability of the index.

First, let’s add an index:

curl -X PUT "localhost:9200/index_custom?pretty"

Let’s then make sure that the refresh is disabled by setting the value of “refresh_interval” to -1:

curl -X PUT "localhost:9200/index_custom/_settings?pretty" -H 'Content-Type: application/json' -d'
{
  "index" : {
    "refresh_interval": "-1"
  }
}
'

Now, let’s add a couple of documents to the index (with ids 1 and 2 respectively):

curl -X PUT "localhost:9200/index_custom/_doc/1?timeout=5m&pretty" -H 'Content-Type: application/json' -d'
{     
    "id": "kimchy",
    "name": "kimchy1"
}
'

curl -X PUT "localhost:9200/index_custom/_doc/2?timeout=5m&pretty" -H 'Content-Type: application/json' -d'
{     
    "id": "kimchy",
    "name": "kimchy2"
}
'

Let’s now flush it to disk to ensure that segments have been created:

curl -X POST "localhost:9200/index_custom/_flush?pretty"

Confirm the number of documents in the index (docs.count):

curl 'localhost:9200/_cat/indices?v'
health status index              uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index_custom       jqTFVQm0R_u9bL2kIGQrUA   1   1          2            0      4.8kb           225b

Let’s now search a document

curl -X GET "localhost:9200/index_custom/_search?pretty&pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
        "term" : { "id" : "kimchy" }
    }
}
'

and check the response

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

As expected, no document is returned since the index has not been refreshed yet.

Let’s confirm it by checking the statuses of the segments of the index. Notice that the value of the “searchable” column is false for both of the segments.

curl -X GET "localhost:9200/_cat/segments?v=true&pretty"
index              shard prirep ip          segment generation docs.count docs.deleted     size size.memory committed searchable version compound

index_custom       0     r      localhost_0               0          2            0    4.3kb           0 true      false      9.0.0   true
index_custom       0     p      localhosd_0               0          2            0    4.3kb           0 true      false      9.0.0   true

Finally, let’s manually invoke the merge process

curl -X POST "localhost:9200/index_custom/_forcemerge?pretty"

and rerun the query. As a result, the documents are returned which means than indexes have become searchable:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.18232156,
    "hits" : [
      {
        "_index" : "index_custom",
        "_id" : "1",
        "_score" : 0.18232156,
        "_source" : {
          "id" : "kimchy",
          "name" : "kimchy1"
        }
      },
      {
        "_index" : "index_custom",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "id" : "kimchy",
          "name" : "kimchy2"
        }
      }
    ]
  }
}

This can also be confirmed by doublechecking the value of the “searchable” column in the list of segments.

Conclusion: The refresh process is not the only one that makes index searchable. The index can be made so by force merging as well. This is not documented.

[centos@centos7 cassandra]$ nodetool status Datacenter: Cassandra ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving/Stopped -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.107 292.99 KiB 1 100.0% c9504a89-653b-4e9d-a289-3444fc608a3b rack1 UN 192.168.1.108 218.35 KiB 1 100.0% de35416a-256a-445b-ab8d-c6a03da29eee rack1

seed_provider: # Addresses of hosts that are deemed contact points. # Database nodes use this list of hosts to find each other and learn # the topology of the ring. You _must_ change this if you are running # multiple nodes! - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: # seeds is actually a comma-delimited list of addresses. # Ex: "<ip1>,<ip2>,<ip3>" - seeds: "192.168.1.108"

# Address or interface to bind to and tell other nodes to connect to. # You _must_ change this address or interface to enable multiple nodes to communicate! # # Set listen_address OR listen_interface, not both. # # When not set (blank), InetAddress.getLocalHost() is used. This # will always do the Right Thing _if_ the node is properly configured # (hostname, name resolution, etc), and the Right Thing is to use the # address associated with the hostname (it might not be). # # Setting listen_address to 0.0.0.0 is always wrong. # listen_address: 192.168.1.109

[centos@centos7 ~]$ nodetool status Datacenter: Cassandra ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving/Stopped -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.107 306.67 KiB 1 35.7% c9504a89-653b-4e9d-a289-3444fc608a3b rack1 UN 192.168.1.108 281.72 KiB 1 85.4% de35416a-256a-445b-ab8d-c6a03da29eee rack1 UN 192.168.1.109 326 KiB 1 78.9% 88965b8c-1ba0-4d70-a034-616001795044 rack1

The “force merge” process marks indexes searchable in Elasticsearch

On Node Addition in Apache Cassandra