Elasticsearch is the most popular, open-source, cross-platform, distributed and scalable search-engine based on Lucene. It is written in Java and released under the terms of the Apache License. Elasticsearch is developed alongside Logstash and Kibana. Logstash is a data-collection and log-parsing engine while Kibana is an analytics and visualization platform. The three products combined are referred to as the Elastic Stack (formerly known as the ELK stack). They are developed to provide integrated solutions and are designed to be used together. The data stored in Elasticsearch is in the form schema-less JSON documents; similar to NO-SQL databases. You can communicate with the Elasticsearch server through an HTTP REST API; the response generated will be in the form of JSON object. Elasticsearch is designed to take chunks of big-data from different sources, analyze it and search through it. It is optimized to work well with huge data set; the searches happen very quickly i.e., almost near real-time!
To understand Elasticsearch in detail; we need to understand its core concepts and terminologies. We will go through each one of them in brief:
In Elasticsearch the data is distributed _and stored in different clusters. So when a change is made on an _index it may not be readily available; a latency of 1-2 seconds is expected! Contrary to this, when a change is made, it is propagated instantly in a relational database as they are deployed on a single machine. We can live with this slight delay as it is due to it’s distributed architecture; it is required to make it scalable and robust. At the end of this post; you will get the clear picture of what happens internally and why this latency is expected!
A cluster is a group of Elasticsearch servers. These servers are called nodes. Depending upon the use-case and scalability preferences; a cluster can have any number of nodes. Each node is identified by a unique name; all the data is distributed amongst these nodes which in turn is grouped into different clusters. A cluster allows you to index and search the stored data.
A node is a server; it is a single unit of the cluster. If it is a single node cluster; all data will be stored in that single node; else the data will be distributed amongst n nodes which are the part of that cluster. Nodes participate in a cluster’s search and indexing capabilities. Depending upon the type of query fired; they will collaborate and will return the matching response.
An index is a collection or grouping of documents; the index has a property called type. In relational database terms; an index is something like database and type is something like a table. This comparison may not be always true because it very much depends on how you design your cluster, but in most cases, it will hold true. Any number of indexes can be defined; just like node and cluster, an index is identified by a unique name; this name should be in lower-case.
Type is a category or a class of similar documents; as explained in the above paragraph; it comes close to a table in relational database terms. It consists of a unique name and mapping. Whenever you query an index; Elasticsearch reads _type from a metadata file and applies a filter on this field; Lucene internally has no idea of what the type is! An index can have any number of types, and types can have their own mapping!
Mapping is somewhat similar to the schema of a relational database. It is not mandatory to define mapping explicitly; if a mapping is not provided Elasticsearch will add it dynamically based on its data when the document is added. Mapping generally describes the field and its datatype in a document. It also includes information on how to index and store fields by Lucene.
A document is the smallest and most basic unit of information that can be indexed. It consists of key-value pairs; the values can of datatype string, number, date, object etc. An index can have any number of documents stored within it; in object-oriented terms, a document is something like an object. It is in the form of JSON. In relational database terms; a document can be thought of as a single row of a table.
An index can be divided into multiple independent sub-indexes; these sub-indexes are fully functional on their own and are called shards. They are useful when an index needs to have more data than the hardware capability of a node(server) supports; for example, 800 GB data on 500 GB disk! Sharding allows you to horizontally scale by volume and space; it enhances the performance of a cluster by running parallel operations and distributing loads across different shards. By default; Elasticsearch adds 5 primary shards for an index. This can be manually configured to suit your requirements.
A replica is a copy of an individual shard. Elasticsearch creates a replica for each shard; the replica and the original shard never reside on the same node.
This image is downloaded from google; copyright infringement is not intended.
Replica comes into picture when nodes in a cluster fail, shards in a node fail or a spike in read-throughput _is encountered; replica promises the _high availability of the data in such situations. When a write query is fired; the original shard is updated first and then the replicas are updated with some overlying latency. But read queries can run in parallel across replicas; this will improve the performance of read operations overall. By default; a single copy of each primary shard is created, but a shard can have more than one replicas in some special cases.
It is a unique type of data structure that Lucene uses to make huge dataset readily searchable. Inverted-Index is a set of words, phrases or tokens associated with different documents to allow full-text-search. In simple terms, an inverted index is something like an appendix page at the end of the book; it will have mappings of words to documents.
This image is downloaded from google; copyright infringement is not intended.
Each shard consists of multiple segments; these segments are nothing but inverted-indexes; which will search in parallel, get results and combine them in the final output for that particular shard.
Visual representation of Internal ES Architecture
As and when the documents are indexed; Elasticsearch writes it to new segments, refreshes the search data and updates transaction logs. This happens very frequently to make data in new segment visible to all queries. Elasticsearch is not meant for updates and delete, so if data needs to be deleted or updated it actually just marks the old document as deleted, and indexes a new document. The merge process also expunges these old deleted documents. Elasticsearch constantly merges similar segments into common big segments in the background; querying too many small segments is not very optimum. After the bigger segment is written; the smaller segments are dropped and log files are again updated to reflect new changes.
It may seem complicated.. But
You don’t have to deal with the internal working of Lucene and ElasticSearch as it is abstracted; you just have to configure clusters with the right number of nodes and create indexes with appropriate mappings! Everything else is done internally. Several organizations like IBM, AWS, Searchly, Elastic Cloud etc., offer Elasticsearch as a managed service; so you don’t have to worry about managing servers, doing deployments, taking backups etc. It will take care of these things for you to save your time and effort to operate these servers. This post was meant to cover basics of ElasticSearch and a brief idea of how it works internally. I hope that I have done justice to it. In my next post; I aim to cover ‘How to query on Elasticsearch index using Kibana?’.