Logo

dev-resources.site

for different kinds of informations.

Semantic Search with Elasticsearch in .NET

Published at
10/29/2024
Categories
dotnet
elasticsearch
Author
nikiforovall
Categories
2 categories in total
dotnet
open
elasticsearch
open
Author
12 person written this
nikiforovall
open
Semantic Search with Elasticsearch in .NET

TL;DR

In this post, we will explore how to perform Semantic Search in .NET.

Source code: https://github.com/NikiforovAll/elasticsearch-dotnet-playground/blob/main/src/elasticsearch-getting-started/00-quick-start.ipynb

Introduction

Semantic search is a technique used to improve search accuracy by understanding the contextual meaning of terms within a search query. Unlike traditional keyword-based search, which matches exact words, semantic search aims to understand the intent and contextual meaning behind the words. This approach improves search results and provides more relevant information to the user.

Getting Started

I’ve prepared a Jupyter notebook that demonstrates how to perform a semantic search using the Elastic.Clients.Elasticsearch. You can find the source code here.


📝 Down below, I will guide you through the main steps of the notebook:

  1. Initialize the Elasticsearch Client
  2. Generate Embeddings
  3. Index Data
  4. Making queries

Initialize the Elasticsearch Client

We can use Testcontainers to run Elasticsearch from the notebook. Here is how you can do it:

var elasticsearchContainer = new ElasticsearchBuilder()
    .WithPortBinding(9200, 9200)
    .WithPortBinding(9300, 9300)
    .WithReuse(true)
    .Build();
await elasticsearchContainer.StartAsync();
var connectionString = elasticsearchContainer.GetConnectionString(); // https://elastic:[email protected]:9200/
Enter fullscreen mode Exit fullscreen mode

Now, we can initialize the Elasticsearch client:

var elasticSettings = new ElasticsearchClientSettings(connectionString)
    .DisableDirectStreaming()
    .ServerCertificateValidationCallback(CertificateValidations.AllowAll);

var client = new ElasticsearchClient(elasticSettings);
Enter fullscreen mode Exit fullscreen mode

Let’s see if it works:

var info = await client.InfoAsync();

DumpResponse(info);
Enter fullscreen mode Exit fullscreen mode

And here is the output:

{
  "name": "35937efa7867",
  "cluster_name": "docker-cluster",
  "cluster_uuid": "IZOZjoDyRpKHFN1sNGjs1g",
  "version": {
    "number": "8.6.1",
    "build_flavor": "default",
    "build_type": "docker",
    "build_hash": "180c9830da956993e59e2cd70eb32b5e383ea42c",
    "build_date": "2023-01-24T21:35:11.506992272Z",
    "build_snapshot": false,
    "lucene_version": "9.4.2",
    "minimum_wire_compatibility_version": "7.17.0",
    "minimum_index_compatibility_version": "7.0.0"
  },
  "tagline": "You Know, for Search"
}
Enter fullscreen mode Exit fullscreen mode

🙌 Everything looks good so far, let’s continue and see how to generate and embeddings.

Generate Embeddings

Embeddings are a type of representation for text where words, phrases, or even entire documents are mapped to vectors of real numbers. These vectors capture the semantic meaning of the text, allowing for more nuanced and context-aware comparisons between different pieces of text.

Traditional keyword-based search might not recognize “car” and “automobile” as related, but embeddings will map these words to similar vectors, understanding that they are synonyms and thus improving search relevance.

We can use Microsoft.Extensions.AI.OpenAI and Azure.AI.OpenAI NuGet packages to create an instance of IEmbeddingGenerator:

var client = new AzureOpenAIClient(new Uri(envs["AZURE_OPENAI_ENDPOINT"]), new ApiKeyCredential(envs["AZURE_OPENAI_APIKEY"]));

IEmbeddingGenerator<string,Embedding<float>> generator = client.AsEmbeddingGenerator(modelId: "text-embedding-3-small");
Enter fullscreen mode Exit fullscreen mode

We can implement ToEmbedding method to convert a string to an embedding:

async Task<float[]> ToEmbedding(string text) {
    var dimension = 384;
    GeneratedEmbeddings<Embedding<float>> embeddings = await generator
        .GenerateAsync(text, new EmbeddingGenerationOptions{
            AdditionalProperties = new AdditionalPropertiesDictionary{
                {"dimensions", dimension}
            }
        });

    return embeddings.First().Vector.ToArray();
}

float[] embedding = await ToEmbedding("The quick brown fox jumps over the lazy dog");
display($"Dimensions length = {embedding.Length}");
Enter fullscreen mode Exit fullscreen mode

Index Data

Assume we have a dataset with information about popular programming books. The data model can be defined as following:

public class Book
{
    [JsonPropertyName("title")]
    public string Title { get; set; }

    [JsonPropertyName("summary")]
    public string Summary { get; set; }

    [JsonPropertyName("authors")]
    public List<string> Authors { get; set; }

    [JsonPropertyName("publish_date")]
    public DateTime publish_date { get; set; }

    [JsonPropertyName("num_reviews")]
    public int num_reviews { get; set; }

    [JsonPropertyName("publisher")]
    public string Publisher { get; set; }

    public float[] TitleVector { get; set; }
}
Enter fullscreen mode Exit fullscreen mode

Now, we can create an index with the following mapping:

var indexDescriptor = new CreateIndexRequestDescriptor<Book>("book_index")
    .Mappings(m => m
        .Properties(pp => pp
            .Text(p => p.Title)
            .DenseVector(
                Infer.Property<Book>(p => p.TitleVector),
                d => d.Dims(dimension).Index(true).Similarity(DenseVectorSimilarity.Cosine))
            .Text(p => p.Summary)
            .Date(p => p.publish_date)
            .IntegerNumber(p => p.num_reviews)
            .Keyword(p => p.Publisher)
        )
    );

await client.Indices.CreateAsync<Book>(indexDescriptor);
Enter fullscreen mode Exit fullscreen mode

Note that we are using the DenseVector type to store the embeddings. We also specify the Cosine similarity function to compare the vectors.

Let’s download the test data and calculate “Title” field embeddings:

var http = new HttpClient();
var url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json";
var books = await http.GetFromJsonAsync<Book[]>(url);

foreach (var book in books)
{
    book.TitleVector = await ToEmbedding(book.Title);
}
Enter fullscreen mode Exit fullscreen mode

Now we can use Bulk API to upload data to Elasticsearch.

await client.BulkAsync("book_index", d => d.IndexMany<Book>(books, (bd, b) => bd.Index("book_index")));
Enter fullscreen mode Exit fullscreen mode

Making queries

Let’s use the keyword search to see if we have relevant data indexed. For example, we can search for books that contain “JavaScript” in the title:

var searchResponse = await client.SearchAsync<Book>(s => s
    .Index("book_index")
    .Query(q => q.Match(m => m.Field(f => f.Title).Query("JavaScript")))
);

DumpRequest(searchResponse);
searchResponse.Documents.Select(x => x.Title).DisplayTable();
Enter fullscreen mode Exit fullscreen mode

⚙️Output:

Semantic Search

🎯 We want to perform a semantic search for books that are similar to a given query. We embed the query and perform a search.

Let’s say we want to find “javascript books”. We can use the KNN search to find the top 5 books that are similar to the searchQuery.

var searchQuery = "javascript books";
var queryEmbedding = await ToEmbedding(searchQuery);
var searchResponse = await client.SearchAsync<Book>(s => s
    .Index("book_index")
    .Knn(d => d
        .Field(f => f.TitleVector)
        .QueryVector(queryEmbedding)
        .k(5)
        .NumCandidates(100))
);

var threshold = 0.7;
searchResponse.Hits
    .Where(x => x.Score > threshold)
    .Select(x => new { x.Source.Title, x.Score })
    .DisplayTable();
Enter fullscreen mode Exit fullscreen mode

⚙️Output:

Semantic Search and Filtering

Filter context is mostly used for filtering structured data. For example, use filter context to answer questions like:

  • Does this timestamp fall into the range 2015 to 2016?
  • Is the status field set to “published”?

Filter context is in effect whenever a query clause is passed to a filter parameter, such as the filter or must_not parameters in a bool query.

Learn more about filter context in the Elasticsearch docs.

The example below retrieves the top books that are similar to “javascript books” based on their title vectors, and also Addison-Wesley as publisher.

var searchQuery = "javascript books";
var queryEmbedding = await ToEmbedding(searchQuery);
var searchResponse = await client.SearchAsync<Book>(s => s
    .Index("book_index")
    .Knn(d => d
        .Field(f => f.TitleVector)
        .QueryVector(queryEmbedding)
        .k(5)
        .NumCandidates(100)
        .Filter(f => f.Term(t => t.Field(p => p.Publisher).Value("addison-wesley"))) 
    )
);

searchResponse.Hits
    .Select(x => new { x.Source.Title, x.Score })
    .DisplayTable(); 
Enter fullscreen mode Exit fullscreen mode

⚙️Output:

Conclusion

🙌 I hope you found it helpful. If you have any questions, please feel free to reach out. If you’d like to support my work, a star on GitHub would be greatly appreciated! 🙏

References

elasticsearch Article's
30 articles in total
Favicon
Intelligent PDF Data Extraction and database creation
Favicon
Debugging Elasticsearch Cluster Issues: Insights from the Field
Favicon
Search Engine Optimisation
Favicon
Advantages of search databases
Favicon
Advanced Search in .NET with Elasticsearch(Full Video)
Favicon
Real-Time Data Indexing: Powering Instant Insights and Scalable Querying
Favicon
Coding challenge: Design and Implement an Advanced Text Search System
Favicon
tuistash: A Terminal User Interface for Logstash
Favicon
Navigating Search Solutions: A Comprehensive Comparison Guide to Meilisearch, Algolia, and ElasticSearch
Favicon
Elastic Cloud on Kubernetes (ECK) with custom domain name
Favicon
Step-by-Step Guide to Configuring Cribl and Grafana for Data Processing
Favicon
Exploring Logging Best Practices
Favicon
Building a Smart Log Pipeline: Syslog Parsing, Data Enrichment, and Analytics with Logstash, Elasticsearch, and Ruby
Favicon
How to connect to AWS OpenSearch or Elasticsearch clusters using python
Favicon
Elasticsearch Was Great, But Vector Databases Are the Future
Favicon
Building Real-Time Data Pipelines with Debezium and Kafka: A Practical Guide
Favicon
AI + Search + Real Time Data = 🔥 (𝒮𝑒𝒶𝓇𝒸𝒽 𝓌𝒾𝓁𝓁 𝒷𝑒 𝓉𝒽𝑒 𝒻𝓊𝓉𝓊𝓇𝑒 𝑜𝒻 𝒜𝐼)
Favicon
Size Doesn't Matter: Why Your Elasticsearch Fields Need to Stop Caring About Length
Favicon
ELK Stack Mastery: Building a Scalable Log Management System
Favicon
Elastop: An HTOP Inspired Elasticsearch Monitoring Tool
Favicon
Hybrid Search with Elasticsearch in .NET
Favicon
Proximity Search: A Complete Guide for Developers
Favicon
How I can run elasticsearch locally for development using docker?
Favicon
Improving search experience using Elasticsearch
Favicon
How to integrate Elasticsearch in Express
Favicon
Advanced Techniques for Search Indexing with Go: Implementing Full-Text Search for Product Catalogs
Favicon
Semantic Search with Elasticsearch in .NET
Favicon
15 WordPress Search Plugins to Supercharge Your Website’s Search Functionality
Favicon
Building a Web Search Engine in Go with Elasticsearch
Favicon
github action services: mysql, redis and elasticsearch

Featured ones: