dev-resources.site
for different kinds of informations.
Quick tip: Cosine Similarity revisited in SingleStore
Abstract
In a previous article, we saw how to represent Cosine Similarity in SingleStore by combining the DOT_PRODUCT
and SQRT
functions. The SingleStore documentation also provides a way to implement a COSINE_SIMILARITY
function. In this article, we'll see how.
Create a SingleStoreDB Cloud account
A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:
- Workspace Group Name: Iris Demo Group
- Cloud Provider: AWS
- Region: US East 1 (N. Virginia)
- Workspace Name: iris-demo
- Size: S-00
Create a Database and Table
In our SingleStore Cloud account, let's use the SQL Editor to create a new database. Call this iris_db
, as follows:
CREATE DATABASE IF NOT EXISTS iris_db;
We'll also create the iris
table using the new VECTOR
data type, as follows:
USE iris_db;
CREATE TABLE IF NOT EXISTS iris (
vector VECTOR(4),
species VARCHAR(20)
);
The Iris data set contains 150 rows of data for three different species of flowers. Each flower has four columns of data: sepal_length
, sepal_width
, petal_length
and petal_width
. We can store these four column values together, as follows:
INSERT INTO iris VALUES
('[5.1,3.5,1.4,0.2]','Iris-setosa'),
('[4.9,3,1.4,0.2]','Iris-setosa'),
('[4.7,3.2,1.3,0.2]','Iris-setosa'),
('[4.6,3.1,1.5,0.2]','Iris-setosa'),
('[5,3.6,1.4,0.2]','Iris-setosa'),
...
('[6.7,3,5.2,2.3]','Iris-virginica'),
('[6.3,2.5,5,1.9]','Iris-virginica'),
('[6.5,3,5.2,2]','Iris-virginica'),
('[6.2,3.4,5.4,2.3]','Iris-virginica'),
('[5.9,3,5.1,1.8]','Iris-virginica');
Only the first five and last five rows are shown above. The complete INSERT
code listing is available in a GitHub Gist.
Create a Cosine Similarity Function
From the product documentation, we'll now define two functions that we can use:
DELIMITER //
CREATE OR REPLACE FUNCTION NORMALIZE(v VECTOR(4)) RETURNS VECTOR(4) AS
DECLARE
squares VECTOR(4) = VECTOR_MUL(v, v);
length FLOAT = SQRT(VECTOR_ELEMENTS_SUM(squares));
BEGIN
RETURN SCALAR_VECTOR_MUL(1/length, v);
END //
DELIMITER ;
and
DELIMITER //
CREATE OR REPLACE FUNCTION COSINE_SIMILARITY(v1 VECTOR(4), v2 VECTOR(4)) RETURNS FLOAT AS
BEGIN
RETURN DOT_PRODUCT(NORMALIZE(v1), NORMALIZE(v2));
END //
DELIMITER ;
In the SingleStore notebook environment, the code would be as follows:
CREATE OR REPLACE FUNCTION NORMALIZE(v VECTOR(4)) RETURNS VECTOR(4) AS
DECLARE
squares VECTOR(4) = VECTOR_MUL(v, v);
length FLOAT = SQRT(VECTOR_ELEMENTS_SUM(squares));
BEGIN
RETURN SCALAR_VECTOR_MUL(1/length, v);
END;
and
CREATE OR REPLACE FUNCTION COSINE_SIMILARITY(v1 VECTOR(4), v2 VECTOR(4)) RETURNS FLOAT AS
BEGIN
RETURN DOT_PRODUCT(NORMALIZE(v1), NORMALIZE(v2));
END;
In other words, remove the references to DELIMITER
and //
.
Query 1
First, let's try a query where we want to find the name of the flower species using an exact match for the sepal_length
, sepal_width
, petal_length
and petal_width
. We'll use the values [5.9,3,5.1,1.8]
from the last row of the iris
table, shown above.
SELECT species
FROM iris
ORDER BY COSINE_SIMILARITY(vector, '[5.9,3,5.1,1.8]') DESC
LIMIT 1;
The result should be:
+----------------+
| species |
+----------------+
| Iris-virginica |
+----------------+
Query 2
Now, let's use some fictitious data values [5.2,3.6,1.5,0.3]
to make a prediction.
SELECT species
FROM iris
ORDER BY COSINE_SIMILARITY(vector, '[5.2,3.6,1.5,0.3]') DESC
LIMIT 1;
The result should be:
+-------------+
| species |
+-------------+
| Iris-setosa |
+-------------+
Cleanup:
DROP TABLE IF EXISTS iris;
DROP DATABASE IF EXISTS iris_db;
Summary
SingleStore provides direct support for the DOT_PRODUCT
and EUCLIDEAN_DISTANCE
functions. We can easily implement our own COSINE_SIMILARITY
function as shown in this example.
Featured ones: