Logo

dev-resources.site

for different kinds of informations.

Text-to-SQL: Creating Embeddings with Nebius AI Studio (part 1)

Published at
12/6/2024
Categories
nebius
rag
text2sql
llm
Author
Sophia Parafina
Categories
4 categories in total
nebius
open
rag
open
text2sql
open
llm
open
Text-to-SQL: Creating Embeddings with Nebius AI Studio (part 1)

Nebius AI Studio is an inference service offering state-of-the-art open-source large language models. The models include text, embeddings, and vision. This is the first article in a series demonstrating how to build a text-to-SQL Retrieval Augmented Generation (RAG) system.

Embeddings are an essential part of machine learning and artificial intelligence systems. They convert high-dimensional data such as text, sound, and images into numerical vectors that capture the meaning and relationships of the data.

As a first step, we will use the Nebius AI Studio API to create embeddings for a database. We can use the Northwinds Trader database, a sample database of a fictional company's transactions. This project uses Postgresql, and there's a Github repository to create the database. The Data Definition Language (DDL) in the SQL script will populate the RAG database, and we can instantiate it to test SQL generated by the LLM.

Converting SQL to Text

The first step is to create the text populating the vector database. We can use raw SQL, but providing context will improve results.

Here is the DDL for the customers table.

CREATE TABLE customers (
    customer_id character varying(5) NOT NULL,
    company_name character varying(40) NOT NULL,
    contact_name character varying(30),
    contact_title character varying(30),
    address character varying(60),
    city character varying(15),
    region character varying(15),
    postal_code character varying(10),
    country character varying(15),
    phone character varying(24),
    fax character varying(24)
);

We can rewrite the SQL in markdown, providing context about the table and the columns.

Table 1: DB.NORTHWIND.CUSTOMERS (Stores customer information)

This table contains customer information.

- customer_id character varying(5) [Primary Key, Not Null] - Unique identifier for customers
- company_name character varying(40) - [NOT NULL]- Company where contact works 
- contact_name character varying(30) - Name of the customer
- contact_title character varying(30) - Title of the customer
- address character varying(60) - Physical address of the customer
- city character varying(15) - City where customer is located
- region character varying(15) - Region where customer is located
- postal_code character varying(10) - Customer's postal code
- country character varying(15) - Country where customer is located
- phone character varying(24) - Customer's phone number
- fax character varying(24) - Customer's fax number

There are 11 Notwind tables annotated and converted to markdown.

Creating Embeddings

We can create vectors using the Nebius embedding models with the markdown data prepped. You'll need an API key from Nebius. Select API Keys in the menu bar and follow the instructions.

Nebius menu bar

Set the API key as an environment variable in Linux/macOS:

export NEBIUS_API_KEY-"xxxxxxxxxxxxxxxx...xxxxxxxxx"

Set the API key as an environment variable in a Windows Powershell session.

$env:NEBIUS_API_KEY="xxxxxxxxxxxxxxxx...xxxxxxxxx"

Let's dive into the code.

Create the database

Assuming that PostgreSQL is running, this function creates a new database called rag. If needed, it adds the pgvector extension and creates a table called items.

If you use Postgress.app on macOS or a hosted provider, pgvector is pre-installed. See your database's installation instructions if needed.

dimensions = 4096

def database_config():
    # enable extension
    conn = psycopg.connect(dbname='postgres',user="postgres", autocommit=True)
    conn.execute('CREATE DATABASE rag')
    conn = psycopg.connect(dbname='rag',user="postgres", autocommit=True)
    conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
    register_vector(conn)

    # create table
    conn.execute('DROP TABLE IF EXISTS items')
    conn.execute('CREATE TABLE items (id bigserial, chunk text, embedding vector({dimensions}))')
    return conn

Create vectors and database records

The embedding process takes the markdown documents and splits the text into smaller chunks. The text is passed to the embedding service to convert them into vectors. The vectors and text are combined to make a record.

API_KEY = os.environ.get('NEBIUS_API_KEY')

client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=API_KEY,
)

def create_embeddings(files, docs_dir):
    data = []
    for filename in files:
        file_path = os.path.join(docs_dir,filename)
        with open(file_path, 'r', errors="replace") as file:
            document_text = file.read()
            chunks = recursive_chracter_splitter_chunking(document_text)

            for chunk in chunks:
                chunk = chunk.replace('\n',' ')
                chunk = chunk.replace("\x00", "")
                embedding = create_vectors(chunk)
                pc_list = [chunk, embedding]
                data.append(pc_list)

    return data

def create_vectors(text):
     embedding = client.embeddings.create(
           model="BAAI/bge-en-icl",
           input=text,
           dimensions=1024,
           encoding_format="float",
           ).data[0].embedding    
     return str(embedding)

Insert the embeddings

This function adds the embeddings to the items table.

def upsert(embeddings, conn):
# insert records into the database
    for record in embeddings:
        chunk = str(record[0])
        vector = record[1]
        conn.execute('INSERT INTO items (chunk, embedding) VALUES (%s, %s)', (chunk, vector))

Typically, we could create an index to improve retrieval. However, the Nebius embedding service only returns vectors with the maximum number of tokens, e.g., 4096. Many pgvector HNSW indices support only 2000 tokens, except for bit indices. While you can create a bit quantized index, recall can be negatively impacted. I've added sample code to create the index, but given the size of the data, an index is not necessary.

# create index
# database_conn.execute('CREATE INDEX ON items USING hnsw ((binary_quantize(embedding)::bit(4096)) bit_hamming_ops);')

Next steps

This article demonstrates how to create embeddings from the Northwinds database DDL and insert them in PostgreSQL with the pgvector extension. The following article shows how to build the client to generate SQL from a prompt.

Featured ones: