ElasticSearch and Search Indexes🔍

Understand how search indexes improves search performance and scalability

Parvesh Saini

4 min read

Introduction

In a large scale system, one crucial aspect is designing efficient search functionality. Think about searching for the latest Iron Man action figure on Amazon. You type "Iron Man" into the search bar, and the system shows you many options. The search needs to be fast and accurate, making sure all listings with "Iron Man" in the product description show up, no matter where the term is located.

The Limitations of Traditional Database Indexes

Databases traditionally struggle with text search. The initial thought might be to create an index on the text field we're searching for. For instance, if we index the term "Thor" in our product database, you'd expect products starting with "Thor" to appear at the top. However, the challenge is that "Thor" could be a substring anywhere in the product description, not just at the start. A conventional index sorts entries alphabetically, which is insufficient for our needs because it doesn't handle substrings effectively.

The Power of Search Indexes

To overcome these limitations, we use search indexes, which are far more adept at handling text searches. A search index works by tokenizing documents, breaking down text into manageable tokens. For example, a product titled "Avengers: Infinity War" might be tokenized into "avengers," "infinity," and "war." This tokenization process involves converting text to lowercase, removing punctuation, and ignoring insignificant words.

Example:

Id	Movie
1	Thor Ragnarok
2	The Incredible Hulk
3	Thor Love and Thunder

Inverted index for the token "thor" would look something like:

thor: [1,3]

Prefix Searching

One advantage of inverted indexes is efficient prefix searching. If we want to find documents with words starting with "spi," the inverted index allows us to quickly locate tokens starting with "spi," such as "spider-man."

Suffix Searching

Suffix searching is another powerful feature. By creating a secondary inverted index with reversed tokens, we can search for terms by their suffix. For instance, to find all terms ending in "man," we reverse "man" to "nam" and search the reversed index.

Leveraging Apache Lucene

Apache Lucene is a popular open-source tool that helps build great search features. It uses a special type of data structure called an LSM tree to store and search through information quickly and accurately. Lucene is really good at finding things that start or end with certain words, but it can also handle more complex searches, like finding words that are similar but not exactly the same. This makes Lucene super useful for creating powerful search tools that work well in all kinds of situations.

While Lucene is powerful, it is typically designed for single-node use. For larger-scale applications, distributing the search index across multiple nodes becomes essential.

What is ElasticSearch?

ElasticSearch is like a helpful assistant that makes using Lucene a lot easier. Lucene is great at doing complex search stuff, like finding things that start or end with certain words, or searching based on location. But Lucene can be a bit tricky to use on its own. That's where ElasticSearch comes in, it takes all the power of Lucene and wraps it up in a nice, easy to use package.

With ElasticSearch, you get a simple REST API that lets you easily interact with the search engine, manage your documents, and run all kinds of advanced queries. It handles all the complicated stuff behind the scenes, so you can focus on building your awesome search-powered application.

ElasticSearch maintains a local index on each node, similar to the concept of local vs. global indexes. A local index means that each node manages its own subset of the data, improving efficiency by reducing the need to distribute every piece of data across all nodes.

Local vs. Global Indexing

In a global index scenario, every partition contains all possible values for a given key. For example, if we were indexing Marvel movie titles, a global index would store "Avengers" and all related data on every partition. This method quickly becomes inefficient as the data grows.

Conversely, a local index stores references to the actual documents, significantly reducing storage requirements. However, this necessitates querying multiple partitions and aggregating results, which can increase latency.

Note: ElasticSearch excels when searches can be confined to a single partition. For example, in a recommendation system for a streaming platform, each user's watch history can be stored in a single partition. Searches within a single user's watch history can then be handled efficiently on one node.

Concluding thoughts

In conclusion, ElasticSearch is a powerful tool that makes search features in large-scale systems much better. It has great ways to manage search indexes and caching, which helps it handle queries quickly and accurately. This makes it super useful for apps that need advanced search, like personalized recommendations in streaming or managing big product catalogs in e-commerce. As systems get more complex and bigger, being able to efficiently index and search data becomes crucial. That's why ElasticSearch is a key part of modern data architecture.

I hope you found this article useful. If you have any doubts or suggestions, feel free to ping me on LinkedIn. Your engagement is greatly appreciated. Happy coding :)