dev-resources.site
for different kinds of informations.
Building a vector-based search engine using Amazon Bedrock and Amazon Open Search Service
Hi everyone,
For those who are thinking about what is meant by a vector, here is the definition.
Note: You can find the code used in this article in the link provided at the end of this article.
Terminology:
Vector: In machine learning, a vector typically refers to an array of numbers or data points. These vectors are used in algorithms to perform tasks such as classification, regression, etc, enabling the machine to understand and process the data effectively.
Embedding: An embedding also is a type of vector. These are majorly used in Natural Language Processing where words or phrases are converted into vectors in a way to capture semantic meaning.
Problem: If you want to build a search engine where you want to search for content that is similar or relative to the the search query user entered or if you want to recommend some content to the user based on the current content he’s watching, then we can use these embeddings to build a search engine by using amazon bedrock and open search service
Amazon Bedrock: This service provides some Machine Learning models built by AWS and other leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, etc. By using those models we can build our ML applications even without much knowledge in Machine Learning.
Amazon OpenSearch: OpenSearch is built on top of the Elastic Search fork taken from 2021. From there AWS built their own search service. It’s almost similar to the OpenSearch
Let’s build the search engine
Process:
Step 1: Create an Open Search Cluster from the AWS console
Step 2: Using the **amazon.titan-embed-text-v1 **model, we will generate the embeddings for the content
Step 3: Dump all the embeddings into the Open Search Index
Step 4: Query the vector data using open-search queries
Create an Open Search Cluster:
Visit the Open Search service from the AWS search
Click on Create a domain
Give a name for the domain, select standard create and select dev/test template(For Production usage please use Production Template)
- For the deployment options select the domain without standby for now (For production deploy in multiple AZs for higher availability)
- Select instance type as t3.small.search for testing. For production, go for higher-level instances. 1 Node is enough for us right now. EBS storage of 10GB is enough for now. That’s the minimum
- For the Network, deploy it with public access. If we deploy it in a VPC, we need an EC2 instance to access the Kibana dashboard. For production purposes deploying it in VPC will be a good
- For fine-grained access control create a master user
- In the access policy change Deny to Allow for now. Later change it to your needs
. This will be the final review of the cluster
- click on Create and wait for some time for the cluster to come online. Once the cluster is ready, You can find the URLs to the Kibana dashboard and to the endpoint like this
- Access the Kibana dashboard using the URL provided in the dashboard and login with the credentials provided during the setup
Creating an index for storing embeddings:
-
Visit Dev Tools from the dashboard and run the below query to create the index
PUT contents
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"show_id":{
"type": "text"
}
"title": {
"type": "knn_vector",
"dimension": 1536
},
"description": {
"type": "knn_vector",
"dimension": 1536
}
}
}
} This code will create a table with 3 columns. show_id, title, description
Generating and Storing Embeddings:
Here we will use the below code to call the Titan model and generate embeddings. After embeddings are generated, they will be stored in the above index we created
For the data, I downloaded Netflix titles from Kaggle using the below link
https://www.kaggle.com/datasets/padmapriyatr/netflix-titles
import boto3
import json
import os
import sys
import boto3
import botocore
import pandas as pd
from opensearchpy import OpenSearch
##opensearch configs
host = 'search-contents-oflzhkvsjgukdwvszyd5erztza.us-east-1.es.amazonaws.com'
port = 443
auth = ('admin', '*****')
index_name = "contents"
##creating opensearch client
client = OpenSearch(
hosts=[{'host': host, 'port': port}],
http_auth=auth,
use_ssl=True,
verify_certs=True
)
##reading titles using pandas
df = pd.read_csv('netflix_titles.csv')
selected_columns = df[['show_id','title','description']]
refined_df = selected_columns.head(100)
##connecting to bedrock runtime
session = boto3.Session(region_name='us-east-1')
bedrock_client = session.client('bedrock-runtime')
##generate embedding using titan model
def generate_embedding(value):
try:
body = json.dumps({"inputText": value})
modelId = "amazon.titan-embed-text-v1"
accept = "application/json"
contentType = "application/json"
response = bedrock_client.invoke_model(
body=body, modelId=modelId, accept=accept, contentType=contentType
)
response_body = json.loads(response.get("body").read())
return response_body
except botocore.exceptions.ClientError as error:
print(error)
##creating a document to insert
def create_document(show_id,title,description):
document = {
'show_id':show_id,
'title':title['embedding'],
'description':description['embedding']
}
insert_document(document)
##inserting document into opensearch
def insert_document(document):
client.index(index=index_name, body=document)
##iterating thorough each row in data frame created through pandas and requesting embedding
for index, row in refined_df.iterrows():
show_id = row['show_id']
title = row['title']
description = row['description']
title_embedding = generate_embedding(title)
description_embedding = generate_embedding(description)
create_document(show_id,title_embedding,description_embedding)
print(f"inserted:{index}")
- Visit the Kibana dashboard and click on the Query WorkBench from the side panel
- Run the Select query and see whether the embedding data is stored in the table or not
Querying to find similar content:
Here we will write another Python script to run the query on top of the embeddings. By using the below code we can run the basic query on top of the embeddings
from opensearchpy import OpenSearch
import boto3
import json
import os
import sys
import boto3
import botocore
##open search configs
host = 'search-contents-oflzhkvsjgukdwvszyd5erztza.us-east-1.es.amazonaws.com'
port = 443
auth = ('admin', '*******')
index_name = "contents"
##bedrock connection
session = boto3.Session(region_name='us-east-1')
bedrock_client = session.client('bedrock-runtime')
##creating opensearch client
client = OpenSearch(
hosts=[{'host': host, 'port': port}],
http_auth=auth,
use_ssl=True,
verify_certs=True
)
#requesting user for input query
input_query = input("Enter search string:")
#generating embedding for user input
def generate_embedding(value):
try:
body = json.dumps({"inputText": value})
modelId = "amazon.titan-embed-text-v1"
accept = "application/json"
contentType = "application/json"
response = bedrock_client.invoke_model(
body=body, modelId=modelId, accept=accept, contentType=contentType
)
response_body = json.loads(response.get("body").read())
run_query(response_body['embedding'])
except botocore.exceptions.ClientError as error:
print(error)
#running a query with user input query embedding with description column in the contents index
def run_query(query_embedding):
query = {
"size": 5,
"_source": "show_id",
"query": {
"knn": {
"description": {
"vector": query_embedding,
"k": 5
}
}
} }
response = client.search(index=index_name, body=query)
print(response)
generate_embedding(input_query)
This program will ask the user for the input string, once the user inputs the query string we’ll call the Titan model again and generate an embedding for that
Now using the query in the script we will compare the user input query embedding with all the embeddings we stored in the index.
For now, I am querying only on the description column. You can modify the query to get better results.
- I gave input as the investigate, it returned some records that are almost similar. You can see the result in the above image. Mostly got the crime, investigate thriller titles.
That’s it. We successfully created a search engine using Amazon Bedrock and Open Search.
If you have any doubts about the implementation or need any help or want to report any mistake, feel free to reach me through comments. I am happy to help and open to suggestions.
Scripts code repo: https://github.com/shaiksalam9182/bedrock-scripts
Thanks :)
Featured ones: