Introduction to Elasticsearch Database and Key Terminologies

Published at

1/16/2025

What is Elasticsearch?

Elasticsearch is basically a distributed database where data is stored as JSON documents. When you say a word distributed database, that would mean a database can run in multiple nodes at a time.

For example, you have an Elasticsearch cluster running in four nodes. That means the Elasticsearch database is running in four servers, so the data is distributed here. You can even have replication of data between nodes, and you can add more nodes in your cluster in the future. This database is horizontally scalable.

JSON Document Example:

{
  "name": "John Doe",
  "location": "New York",
  "age": 30,
  "email": "[email protected]"
}

You know we said Elasticsearch can store data as documents or JSON documents. In an Elasticsearch database, the fundamental unit of data is a JSON document. It's like a row in a database table. For example, a JSON document can be something like a person's biodata: name, location, age, email.

Elasticsearch vs Relational Database

To compare Elasticsearch to a relational database like Postgres or Oracle, in a relational database you will have something like a database table where you have rows in it. For example, I have an employee database table here, and it has columns: name, location, age, email. To store the data of three employees, I am creating three rows in this database table.

Relational Database Example:

CREATE TABLE employees (
    name VARCHAR(50),
    location VARCHAR(50),
    age INT,
    email VARCHAR(100)
);
INSERT INTO employees (name, location, age, email) VALUES 
('Alice', 'London', 25, '[email protected]'),
('Bob', 'Paris', 30, '[email protected]'),
('Charlie', 'Berlin', 35, '[email protected]');

Now let's try to do this in Elasticsearch. In Elasticsearch, you have something called an index. An index is something like a database table. An index is a group of documents. A document is something like a database row. To implement an employee table in Elasticsearch, you create an employee index, and add data for all the employees as a JSON document.
Here, the first employee data is stored in this first JSON, the second employee data is stored in the second JSON, and the third employee data is in this JSON. The collection of these three JSON documents forms an index, and which is similar to a database table. So this is how data is stored in an Elasticsearch database.

Elasticsearch Index Example:

POST /employees/_doc/1
{
  "name": "Alice",
  "location": "London",
  "age": 25,
  "email": "[email protected]"
}

POST /employees/_doc/2
{
  "name": "Bob",
  "location": "Paris",
  "age": 30,
  "email": "[email protected]"
}

POST /employees/_doc/3
{
  "name": "Charlie",
  "location": "Berlin",
  "age": 35,
  "email": "[email protected]"
}

Why Elasticsearch?

Why should we use Elasticsearch? One reason is it’s a distributed database, and it’s horizontally scalable, which is not possible with a relational database. Another thing is Elasticsearch is designed for faster queries. Literally, the data is stored as queries in an Elasticsearch database. That is the reason why the queries are really fast, even with huge amounts of data in Elasticsearch.
Inverted Index Example:

Suppose you have the following documents:

Document 1: "Welcome to Europe."
Document 2: "Paris is in France."
Document 3: "Europe includes France."

{
  "Europe": [1, 3],
  "France": [2, 3],
  "Paris": [2]
}

When searching for "Europe," Elasticsearch instantly retrieves document IDs [1, 3].

Elasticsearch uses a structured data known an inverted index. Let’s demonstrate with simple examples. Here, I have a table where I have eight documents, and let’s think there is an attribute called GeoScope ID in those eight documents. If you want to search for “Europe,” you will go through all these documents or create some kind of index and search to get 1, 2, and 7 as the document IDs containing the word “Europe.”

But in Elasticsearch, it’s stored like this: the data is stored as such. It will say “Europe” is present in these three documents, “France” is present in this document. When a search query comes for “Where is Europe?”, instantaneously you can say that it’s in these documents.

Let’s try another example. Suppose there are three documents in an index where you have words. First, Elasticsearch tokenizes them. That means it will try to find the unique words and then store them like this in an index. It will say the word “B” is present in these three documents and occurs with this frequency and at this location. For example, “B” is present in the first document and comes two times. You can say “B” is occurring two times here and is present at the second and sixth positions. The same way, in the second document, it occurs only once and is the second word.
This way, the data is saved as search. If you want to search the occurrence of the word “B” in these documents, it’s literally stored as a search here, and that’s why you will get faster queries in Elasticsearch. In short, Elasticsearch uses a data structure called an inverted index, which is the reason why Elasticsearch is fast.

Terminology

Node

The server running an Elasticsearch instance is called a node. In this example, there are four nodes in this cluster. That means there are four servers running Elasticsearch in this cluster.

Index

An index is basically a group of documents in an Elasticsearch database. For example, there is this index called “employee index” and it contains a group of documents.
POST /employee_index/_doc

{
  "name": "Diana",
  "location": "Rome",
  "age": 28,
  "email": "[email protected]"
}

Shard

A shard is a unit of an index. You know index is actually a logical term. Physically, index is implemented as a group of one or more shards. A shard is like an independent Elasticsearch index. Shards are the reason why Elasticsearch can implement high availability and redundancy.

PUT /employee_index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

Let’s say you configure that an index will have three shards, and you have two nodes in the Elasticsearch cluster. Now, Elasticsearch will rebalance these shards so that they are distributed among the nodes. If a query is made to this Elasticsearch cluster, two nodes can work on the query and provide data. Using multiple shards in multiple nodes, you can achieve parallel queries. The number of shards in an index can be configurable. You can even have one shard in an index.
There are two types of shards: primary and replicas. Replica shards are read-only shards, which are useful for serving parallel data queries.

Index Template

Index template is like a blueprint for creating an index. An index can be created from an index template, and this index template will contain all the settings like the number of shards, the data mapping, the priority of the index, etc.

PUT /_index_template/employee_template
{
  "index_patterns": ["employee_*"],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 1
    },
    "mappings": {
      "properties": {
        "name": { "type": "text" },
        "location": { "type": "text" },
        "age": { "type": "integer" },
        "email": { "type": "keyword" }
      }
    }
  }
}

Index Alias

An alias is used to group Elasticsearch indices. For example, you have three indices like “logs_dash1,” “logs_dash2,” and “logs_dash3.” You can create an alias that includes all indices matching a pattern like “logs_dash*.” Now, you can query directly to this alias, and it will query all the indices in that group at the same time.

POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "logs_dash1",
        "alias": "logs"
      }
    },
    {
      "add": {
        "index": "logs_dash2",
        "alias": "logs"
      }
    }
  ]
}

Data Stream

Data streams are designed for time-series data. For example, if you define a data stream called “logs,” you can insert data into it, and Elasticsearch will manage the data in backing indices. Once the threshold for the lifecycle is reached, a new index is created. Data streams create an abstraction for a set of indices, allowing Elasticsearch to split the data as needed and manage it efficiently.

PUT /_data_stream/logs
POST /logs/_doc
{
  "@timestamp": "2025-01-15T12:00:00Z",
  "message": "Log entry example"
}

Conclusion

That’s an introduction to Elasticsearch database and terminology. We covered topics like what an index is, the analogy to a relational database, nodes, indices, shards, index templates, index aliases, and data streams.

Thank you for reading. Feel free to ask questions or post your valuable feedback in the comments section.

What do you think of this article, or you can say guide? If you think I did a good job, consider to give it a heart and following me.

Follow me on these socials as well so I can continue providing you best content.

LinkedIn | Medium | Bluesky

dev-resources.site