dev-resources.site
for different kinds of informations.
Quick tip: Using R, OpenAI and SingleStore Notebooks
Update: As of September 2024, running R is no longer possible on the SingleStore Portal. I will find another way to show the integration between SingleStore and R.
Abstract
In this short article, we'll see how to create vector embeddings using R and OpenAI. We'll also store the embeddings in SingleStore. R is not officially supported by OpenAI, but a community library provides access to the API.
The notebook file used in this article is available on GitHub.
Create a SingleStore Cloud account
A previous article showed the steps to create a free SingleStore Cloud account. We'll use the following settings:
- Workspace Group Name: R Demo Group
- Cloud Provider: AWS
- Region: US East 1 (N. Virginia)
- Workspace Name: r-demo
- Size: S-00
Create a new notebook
From the left navigation pane in the cloud portal, we'll select DEVELOP > Data Studio.
In the top right of the web page, we'll select New Notebook > New Notebook, as shown in Figure 1.
We'll call the notebook r_openai_demo, select a Blank notebook template from the available options, and save it in the Personal location.
Create a database and table
In our SingleStore Cloud account, let's use the SQL Editor to create a new database and table. We'll call these r_demo
and articles
, respectively, as follows:
DROP DATABASE IF EXISTS r_demo;
CREATE DATABASE IF NOT EXISTS r_demo;
USE r_demo;
CREATE TABLE IF NOT EXISTS articles (
id SMALLINT,
text TEXT,
vector VECTOR(1536) NOT NULL
);
Fill out the notebook
First, let's install the R kernel and some other packages we need for this article:
!conda install -y --quiet -c conda-forge r-irkernel r-rjava r-rjdbc r-remotes
Next, we need to change the kernel. Refreshing the page will help the notebook detect any changes, including the installation of a new kernel.
In the top right, we can see that Python is currently selected, as shown in Figure 2.
Selecting Python 3 will present a box with a pull-down as shown in Figure 3.
Clicking the pull-down will show some options and R should be one of the options. We'll choose R, as shown in Figure 4.
Next, we'll click the Select button.
To connect to SingleStore, we'll use JDBC, as follows:
library(RJDBC)
This will also load DBI
and rJava
:
Loading required package: DBI
Loading required package: rJava
Next, we'll download the SingleStore JDBC Client and save it in a jars
directory:
# URL of the JDBC driver file
driver_url <- "https://repo1.maven.org/maven2/com/singlestore/singlestore-jdbc-client/1.2.1/singlestore-jdbc-client-1.2.1.jar"
# Set the JDBC driver class name
driver <- "com.singlestore.jdbc.Driver"
# Local directory to save the driver file
local_dir <- "jars"
dir.create(local_dir, showWarnings = FALSE, recursive = TRUE)
# Check if the driver file already exists
driver_file <- file.path(
local_dir,
"singlestore-jdbc-client-1.2.1.jar"
)
if (!file.exists(driver_file)) {
# Download the JDBC driver file if it doesn't exist
download.file(
driver_url,
destfile = driver_file,
mode = "wb",
quiet = TRUE
)
}
# Check if the driver file has been downloaded successfully
if (file.exists(driver_file)) {
print("Driver file downloaded successfully")
} else {
print("Failed to download the driver file")
}
Now we'll create the connection details to SingleStore:
host <- "<HOST>"
port <- 3306
database <- "r_demo"
user <- "admin"
password <- "<PASSWORD>"
url <- paste0("jdbc:singlestore://", host, ":", port, "/", database)
Replace <HOST>
and <PASSWORD>
with the values for your environment. These values can be obtained from the workspace using Connect > SQL IDE.
We'll now prepare the connection to SingleStore.
# Establish the JDBC connection
conn <- dbConnect(
drv = JDBC(driver, driver_file),
url = url,
user = user,
password = password
)
Next, we'll install rgpt3:
remotes::install_github("ben-aaron188/rgpt3", force = TRUE)
and import some libraries:
library(jsonlite)
library(rgpt3)
We'll now enter and save our OpenAI Key
:
# Prompt for the OpenAI Key
openai_key <- readline(prompt = "OpenAI Key:")
# Specify the file path
file_path <- "access_key.txt"
# Write the string to the file with a newline
writeLines(openai_key, file_path)
and check that it was saved correctly:
rgpt_authenticate("access_key.txt")
Next, we'll load a built-in dataset:
data("travel_blog_data")
We'll select some text from the dataset and create OpenAI embeddings:
my_text = travel_blog_data$gpt_content[1]
my_embeddings = rgpt_single_embedding(
input = my_text,
model = "text-embedding-3-small"
)
Next, we'll create a new DataFrame:
# Create a DataFrame with one row
article <- data.frame(
id = travel_blog_data$n[1],
text = my_text,
vector = as.character(toJSON(my_embeddings))
)
and then write the DataFrame to SingleStore:
# Write the DataFrame to the database
dbWriteTable(conn, "articles", article, append = TRUE)
We'll read the data back and check it, as follows:
# Read the data from the database
article_from_db <- dbReadTable(conn, "articles")
# Print the summary of the DataFrame
str(article_from_db)
Example output:
'data.frame': 1 obs. of 3 variables:
$ id : num 1
$ text : chr "Title: Paws Across the UK: A Tail-Wagging Journey Through Britain\n\nAs someone who firmly believes that advent"| __truncated__
$ vector: chr "[0.0215000007,0.0126999998,0.0524000004,0.0340999998,-0.0328000002,-0.0188999996,-0.0214000009,0.0250000004,-0."| __truncated__
We can also create embeddings for multiple rows, as follows:
multiple_embeddings = rgpt_embeddings(
input_var = travel_blog_data$gpt_content,
id_var = travel_blog_data$n,
param_model = "text-embedding-3-small"
)
Next, we'll create a new DataFrame:
# Convert each row of embeddings to a JSON array
embeddings_json <- apply(multiple_embeddings[, 1:1536], 1, function(row) {
toJSON((row))
})
# Create the DataFrame with id and vector columns
articles <- data.frame(
id = multiple_embeddings$id,
vector = embeddings_json
)
# Merge articles with travel_blog_data based on id and n columns
merged_data <- merge(articles, travel_blog_data[, c("n", "gpt_content")], by.x = "id", by.y = "n", all.x = TRUE)
# Update the articles DataFrame with the merged gpt_content
articles$text <- merged_data$gpt_content
# Reorder the columns to match the database schema
articles <- articles[, c("id", "text", "vector")]
and then write the DataFrame to SingleStore:
# Write the DataFrame to the database
dbWriteTable(conn, "articles", articles, append = TRUE)
We'll read the data back and check it, as follows:
# Read the data from the database
articles_from_db <- dbReadTable(conn, "articles")
# Print the summary of the DataFrame
str(articles_from_db)
Example output:
'data.frame': 11 obs. of 3 variables:
$ id : num 10 7 4 6 2 1 9 8 3 5 ...
$ text : chr "# A Tail-Wagging Tour: Roaming the UK on Four Paws\n\nAs a devoted dog parent, planning a trip that's both exci"| __truncated__ "### Paws Across the UK: A Tail-Wagging Adventure\n\n#### Introduction\nHello fellow adventurers and animal love"| __truncated__ "**Paws Across the Pond: A Dog's Journey Through the UK**\n\nEmbarking on an adventure across the United Kingdom"| __truncated__ "# A Tail-Wagging Tour: A Dog's Journey Through the UK\n\n### Day 1: London Calling\n\nMy journey began in the b"| __truncated__ ...
$ vector: chr "[0.0399999991,0.0186000001,0.0647,0.0353000015,-0.0469999984,-0.0313999988,-0.0147000002,0.0162000004,-0.012299"| __truncated__ "[0.0282000005,0.0137,0.0423999988,0.0421000011,-0.0549000017,-0.00639999984,-0.0324000008,0.0272000004,-0.00680"| __truncated__ "[0.0339000002,0.00939999986,0.0269000009,0.0445999987,-0.0397000015,-0.00779999979,-0.0337000005,0.0299999993,-"| __truncated__ "[0.0245999992,0.00980000012,0.0460000001,0.0223999992,-0.0593000017,-0.0110999998,-0.0113000004,-0.00400000019,"| __truncated__ ...
There are 11 rows. One row was added earlier and 10 rows were added later.
Finally, we'll close the JDBC connection:
# Close the JDBC connection
dbDisconnect(conn)
Summary
In this short article, we've seen how to create vector embeddings using R and OpenAI. We've also seen how we can store and retrieve the vector embeddings using SingleStore.
Featured ones: