dev-resources.site
for different kinds of informations.
1 billion rows challenge in MySQL
Earlier this month I wrote a piece on solving Gunnar Morling interesting 1 billion rows challenge in PostgreSQL and ClickHouse. Since Aiven provides also MySQL, I wanted to give it a try. TLDR; The results are much slower than PG and ClickHouse, do you have any suggestion on how to improve?
Alert: the following is NOT a benchmark! The test is done with default installations of both databases and NO optimization. The blog only shows the technical viability of a solution.
If you need a FREE MySQL database? 🦀 Check Aiven's FREE plans! 🦀
If you need to optimize your SQL query? 🐧 Check EverSQL! 🐧
Install MySQL on Mac
The first step is to have an handy MySQL, I could use the Aiven FREE tier, but, as for the previous PostgreSQL and ClickHouse example, decided to install it locally. To do it, you can execute:
brew install mysql
then we can start with
brew services start mysql
Connect to MySQL and setup a database
Once the installation process is completed, we can connect to the local MySQL with the following command:
mysql -u root
For this test we're going to use the
root
user. However is highly suggested to create a non root user for any sensible work with a database.
We can now create a database with:
CREATE DATABASE 1brow;
And start using it with:
USE 1brow;
Loading the data in MySQL
In order to properly solve the challenge we need to load the data into MySQL. We can create a table called TEST
with the needed columns with:
CREATE TABLE TEST (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
city VARCHAR(100),
temperature FLOAT
);
And populate it with:
LOAD DATA LOCAL INFILE "measurements.txt.new" IGNORE
INTO TABLE TEST
COLUMNS TERMINATED BY ';'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
If you receive the error ERROR 3948 (42000): Loading local data is disabled; this must be enabled on both the client and server sides
, we can enable it with:
SET GLOBAL local_infile=1;
Compared to PostgreSQL and ClickHouse, MySQL doesn't seem to have a native option to read from an external CSV
file. The MySQL CSV Engine is available, but it requires you to move the CSV within MySQL's data folder.
MySQL Results
Once the data is loaded, we can use a similar query to the PostgreSQL and ClickHouse ones to perform the correct ordering.
WITH AGG AS(
SELECT city,
MIN(temperature) min_measure,
cast(AVG(temperature) AS DECIMAL(8,1)) mean_measure,
MAX(temperature) max_measure
FROM test
GROUP BY city
)
SELECT GROUP_CONCAT(CITY, '=', CONCAT(min_measure,'/', mean_measure,'/', max_measure) ORDER BY CITY SEPARATOR ', ')
FROM AGG
;
The above query is using GROUP_CONCAT
allows to return the concatenated list of cities once min/mean/max measures have been calculated.
MySQL Timing
To get the timing I created a file called test.sql
with the entire set of commands:
CREATE TABLE TEST (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
city VARCHAR(100),
temperature float
) ENGINE=MEMORY;
SET GLOBAL local_infile=1;
LOAD DATA LOCAL INFILE "measurements.txt" IGNORE
INTO TABLE TEST
COLUMNS TERMINATED BY ';'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(city, temperature);
WITH AGG AS(
SELECT city,
MIN(temperature) min_measure,
cast(AVG(temperature) AS DECIMAL(8,1)) mean_measure,
MAX(temperature) max_measure
FROM test
GROUP BY city
)
SELECT GROUP_CONCAT(CITY, '=', CONCAT(min_measure,'/', mean_measure,'/', max_measure) ORDER BY CITY SEPARATOR ', ')
FROM AGG
;
And a then timed with:
time mysql --local-infile=1 -u root -vvv 1brow < test.sql
The timing is 43 min 57.99 sec
with over 33 min 32.15 sec
spent on the ingestion phase and 10 min 25.77 sec
on the query. The file test.ibd
being over 40GB
almost 4x times the original size of the measurements.txt
file (13GB
). You can check the size of the table file with the following command:
ls -lh /usr/local/var/mysql/1brow/test.idb
Alert: the following is NOT a benchmark! The test is done with default installations of both databases and NO optimization. The blog only shows the technical viability of a solution.
You might need to tweak the connect_timeout
parameter in my.cnf
to raise the connection timeout to 100 minutes in order to wait for the loading and query process to finish.
Speed up MySQL timing by using the MEMORY Engine
MySQL has a Memory storage engine which will avoid writing to disk and store the entire table in RAM. As per documentation, this is not a safe option for production workloads since:
Because the data is vulnerable to crashes, hardware issues, or power outages, only use these tables as temporary work areas or read-only caches for data pulled from other tables.
The only change we have to perform, in order to use the MEMORY
storage engine is to redefine our test
table as:
CREATE TABLE TEST (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
city VARCHAR(100),
temperature FLOAT
) ENGINE = MEMORY;
But, when making the change and retrying the script you might hit the ERROR 1114 (HY000) at line 9: The table 'test' is full
since the data doesn't fit the memory allocated to MySQL. To avoid this we can set the following flags to raise the allocated memory to 64GB
:
SET max_heap_table_size = 1024 * 1024 * 1024 * 64;
However, also this didn't work, and stopped my test after waiting for over 40 minutes
for just the loading time.
Do you have ideas on how to speed up the process in MySQL? I'm all 👂 on X (ex Twitter) @ftisiot!
Featured ones: