dev-resources.site
for different kinds of informations.
Quick tip: Dot Product, Euclidean Distance and Cosine Similarity in SingleStoreDB
Abstract
SingleStoreDB has supported vector functions since 2017. In this article, we'll see examples of using Dot Product, Euclidean Distance and Cosine Similarity in SingleStoreDB to assess vector relationships. New vector capabilities will be available in the next major release of SingleStoreDB - stay tuned.
Introduction
Dot Product
The Dot Product is a mathematical procedure that combines two lists of numbers into a single value. This is achieved by multiplying corresponding numbers from each list and then adding up the products. This operation is useful for tasks like measuring the commonality between two sets of values or assessing the influence of one set on another.
Euclidean Distance
Euclidean Distance is a mathematical metric that measures the straight-line distance between two points in a Cartesian coordinate system. To compute it, square the differences between the x and y coordinates of the two points, sum these squared differences, and then take the square root of the result. The final output is a single numerical representation of the spatial separation between the points, like the direct walking distance between them on a graph.
Cosine Similarity
Cosine Similarity is a mathematical measure that gauges the similarity between two sets, such as two lists of words. Instead of focusing on the specific words, it considers the angle between the sets, irrespective of their sizes. Visualise these sets as vectors in a multi-dimensional space. Cosine Similarity essentially assesses how well the directions of these vectors align. High similarity arises when the vectors point in the same direction, while low similarity occurs when they are perpendicular. This method is commonly employed in comparing documents or texts, providing insights into whether they discuss similar topics, regardless of the specific wording.
SingleStoreDB
SingleStoreDB provides direct support for Dot Product and Euclidean Distance using the vector functions DOT_PRODUCT
and EUCLIDEAN_DISTANCE
, respectively. Cosine Similarity is supported by combining the DOT_PRODUCT
and SQRT
functions. In this article, we'll see some simple examples of each function.
For further details, see:
Create a SingleStoreDB Cloud account
A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:
- Workspace Group Name: Iris Demo Group
- Cloud Provider: AWS
- Region: US East 1 (N. Virginia)
- Workspace Name: iris-demo
- Size: S-00
Create a Database and Table
In our SingleStore Cloud account, let's use the SQL Editor to create a new database. Call this iris_db
, as follows:
CREATE DATABASE IF NOT EXISTS iris_db;
We'll also create the iris
table, as follows:
USE iris_db;
CREATE TABLE IF NOT EXISTS iris (
vector BLOB,
species VARCHAR(20)
);
The Iris data set contains 150 rows of data for three different species of flowers. Each flower has four columns of data: sepal_length
, sepal_width
, petal_length
and petal_width
. We can store these four column values together using JSON_ARRAY_PACK
in a BLOB format, as follows:
INSERT INTO iris VALUES
(JSON_ARRAY_PACK('[5.1,3.5,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.9,3,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.7,3.2,1.3,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.6,3.1,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.6,1.4,0.2]'),'Iris-setosa'),
...
(JSON_ARRAY_PACK('[6.7,3,5.2,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.3,2.5,5,1.9]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.5,3,5.2,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.2,3.4,5.4,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'),'Iris-virginica');
Only the first five and last five rows are shown above. The complete INSERT
code listing is available in a GitHub Gist.
Query 1
First, let's try a query where we want to find the name of the flower species using an exact match for the sepal_length
, sepal_width
, petal_length
and petal_width
. We'll use the values [5.9,3,5.1,1.8]
from the last row of the iris
table, shown above.
Dot Product
SELECT species
FROM iris
WHERE DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) = DOT_PRODUCT(vector, vector);
This SQL query retrieves the species
from the iris
table for rows where the Dot Product of the vector in the vector
column with the specified vector [5.9,3,5.1,1.8]
is equal to the Dot Product of the vector
column with itself. Since there's only one row with the specified values in the vector
column, the query essentially identifies the species
for that particular set of values in the iris
table.
Euclidean Distance
SELECT species
FROM iris
WHERE EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) = 0;
This SQL query retrieves the species
from the iris
table for rows where the Euclidean Distance between the vector in the vector
column and the specified vector [5.9,3,5.1,1.8]
is equal to 0. In other words, they are one and the same.
Cosine Similarity
SELECT species
FROM iris
WHERE DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) / SQRT(DOT_PRODUCT(vector, vector) * DOT_PRODUCT(JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'), JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'))) = 1;
This SQL query retrieves the species
values from the iris
table for rows where the normalised Dot Product of the vector in the vector
column with the specified vector [5.9,3,5.1,1.8]
equals 1. The condition checks for parallel vectors, indicating a high similarity or identical direction between the two vectors.
The result in each case should be:
+----------------+
| species |
+----------------+
| Iris-virginica |
+----------------+
Query 2
Now, let's use some fictitious data values [5.2,3.6,1.5,0.3]
to make a prediction.
Dot Product
SELECT species
FROM iris
ORDER BY DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]')) DESC
LIMIT 1;
This SQL query retrieves the species
from the iris
table and orders the results in descending order based on the Dot Product between the vectors in the vector
column and the specified vector [5.2,3.6,1.5,0.3]
. The LIMIT 1 ensures that only the top result is returned.
The result should be:
+----------------+
| species |
+----------------+
| Iris-virginica |
+----------------+
Euclidean Distance
SELECT species
FROM iris
ORDER BY EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'))
LIMIT 1;
This SQL query retrieves the species
from the iris
table and orders the results based on the Euclidean Distance between the vectors in the vector
column and the specified vector [5.2,3.6,1.5,0.3]
. The LIMIT 1 ensures that only the top result, closest in Euclidean Distance, is returned.
The result should be:
+-------------+
| species |
+-------------+
| Iris-setosa |
+-------------+
Cosine Similarity
If we want to use DOT_PRODUCT
and achieve a similar result to EUCLIDEAN_DISTANCE
, we can use the following approach. The DOT_PRODUCT
doesn't directly represent distance, so we need to modify the query. One possible adjustment is to consider the angle between vectors.
SELECT species
FROM iris
ORDER BY DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]')) / SQRT(DOT_PRODUCT(vector, vector) * DOT_PRODUCT(JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'), JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'))) DESC
LIMIT 1;
This SQL query normalises the DOT_PRODUCT
by dividing it by the product of the magnitudes of the vectors. This adjustment helps to account for differences in vector magnitudes and makes the DOT_PRODUCT
more comparable to a Cosine Similarity, which considers the angle between vectors. The ordering is done in descending order to prioritise higher similarity.
The result should be:
+-------------+
| species |
+-------------+
| Iris-setosa |
+-------------+
Cleanup:
DROP TABLE IF EXISTS iris;
DROP DATABASE IF EXISTS iris_db;
Summary
In this short article, we've seen several examples of how to use SingleStoreDB's built-in DOT_PRODUCT
and EUCLIDEAN_DISTANCE
vector functions. We've also seen how we can easily represent Cosine Similarity by combining the DOT_PRODUCT
and SQRT
functions.
We've used the functions to find an exact match and made predictions based on new, previously unknown, values.
Featured ones: