Logo

dev-resources.site

for different kinds of informations.

1 billion rows challenge in MySQL

Published at
1/29/2024
Categories
mysql
sorting
1brc
Author
ftisiot
Categories
3 categories in total
mysql
open
sorting
open
1brc
open
Author
7 person written this
ftisiot
open
1 billion rows challenge in MySQL

Earlier this month I wrote a piece on solving Gunnar Morling interesting 1 billion rows challenge in PostgreSQL and ClickHouse. Since Aiven provides also MySQL, I wanted to give it a try. TLDR; The results are much slower than PG and ClickHouse, do you have any suggestion on how to improve?

Alert: the following is NOT a benchmark! The test is done with default installations of both databases and NO optimization. The blog only shows the technical viability of a solution.

If you need a FREE MySQL database? 🦀 Check Aiven's FREE plans! 🦀
If you need to optimize your SQL query? 🐧 Check EverSQL! 🐧

Install MySQL on Mac

The first step is to have an handy MySQL, I could use the Aiven FREE tier, but, as for the previous PostgreSQL and ClickHouse example, decided to install it locally. To do it, you can execute:

brew install mysql
Enter fullscreen mode Exit fullscreen mode

then we can start with

brew services start mysql
Enter fullscreen mode Exit fullscreen mode

Connect to MySQL and setup a database

Once the installation process is completed, we can connect to the local MySQL with the following command:

mysql -u root
Enter fullscreen mode Exit fullscreen mode

For this test we're going to use the root user. However is highly suggested to create a non root user for any sensible work with a database.

We can now create a database with:

CREATE DATABASE 1brow;
Enter fullscreen mode Exit fullscreen mode

And start using it with:

USE 1brow;
Enter fullscreen mode Exit fullscreen mode

Loading the data in MySQL

In order to properly solve the challenge we need to load the data into MySQL. We can create a table called TEST with the needed columns with:

CREATE TABLE TEST (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  city VARCHAR(100),
  temperature FLOAT
);
Enter fullscreen mode Exit fullscreen mode

And populate it with:

LOAD DATA LOCAL INFILE "measurements.txt.new" IGNORE
INTO TABLE TEST
COLUMNS TERMINATED BY ';'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
Enter fullscreen mode Exit fullscreen mode

If you receive the error ERROR 3948 (42000): Loading local data is disabled; this must be enabled on both the client and server sides, we can enable it with:

SET GLOBAL local_infile=1;
Enter fullscreen mode Exit fullscreen mode

Compared to PostgreSQL and ClickHouse, MySQL doesn't seem to have a native option to read from an external CSV file. The MySQL CSV Engine is available, but it requires you to move the CSV within MySQL's data folder.

MySQL Results

Once the data is loaded, we can use a similar query to the PostgreSQL and ClickHouse ones to perform the correct ordering.

WITH AGG AS(
    SELECT city,
           MIN(temperature) min_measure,
           cast(AVG(temperature) AS DECIMAL(8,1)) mean_measure,
           MAX(temperature) max_measure
    FROM test
    GROUP BY city
    )
SELECT GROUP_CONCAT(CITY, '=', CONCAT(min_measure,'/', mean_measure,'/', max_measure) ORDER BY CITY SEPARATOR ', ')
FROM AGG
;
Enter fullscreen mode Exit fullscreen mode

The above query is using GROUP_CONCAT allows to return the concatenated list of cities once min/mean/max measures have been calculated.

MySQL Timing

To get the timing I created a file called test.sql with the entire set of commands:

CREATE TABLE TEST (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  city VARCHAR(100),
  temperature float
) ENGINE=MEMORY;

SET GLOBAL local_infile=1;

LOAD DATA LOCAL INFILE "measurements.txt" IGNORE 
INTO TABLE TEST
COLUMNS TERMINATED BY ';'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(city, temperature);

WITH AGG AS(
    SELECT city,
           MIN(temperature) min_measure,
           cast(AVG(temperature) AS DECIMAL(8,1)) mean_measure,
           MAX(temperature) max_measure
    FROM test
    GROUP BY city
    )
SELECT GROUP_CONCAT(CITY, '=', CONCAT(min_measure,'/', mean_measure,'/', max_measure) ORDER BY CITY SEPARATOR ', ')
FROM AGG
;
Enter fullscreen mode Exit fullscreen mode

And a then timed with:

time mysql --local-infile=1 -u root -vvv 1brow < test.sql
Enter fullscreen mode Exit fullscreen mode

The timing is 43 min 57.99 sec with over 33 min 32.15 sec spent on the ingestion phase and 10 min 25.77 sec on the query. The file test.ibd being over 40GB almost 4x times the original size of the measurements.txt file (13GB). You can check the size of the table file with the following command:

ls -lh /usr/local/var/mysql/1brow/test.idb
Enter fullscreen mode Exit fullscreen mode

Alert: the following is NOT a benchmark! The test is done with default installations of both databases and NO optimization. The blog only shows the technical viability of a solution.

You might need to tweak the connect_timeout parameter in my.cnf to raise the connection timeout to 100 minutes in order to wait for the loading and query process to finish.

Speed up MySQL timing by using the MEMORY Engine

MySQL has a Memory storage engine which will avoid writing to disk and store the entire table in RAM. As per documentation, this is not a safe option for production workloads since:

Because the data is vulnerable to crashes, hardware issues, or power outages, only use these tables as temporary work areas or read-only caches for data pulled from other tables.

The only change we have to perform, in order to use the MEMORY storage engine is to redefine our test table as:

CREATE TABLE TEST (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  city VARCHAR(100),
  temperature FLOAT
) ENGINE = MEMORY;
Enter fullscreen mode Exit fullscreen mode

But, when making the change and retrying the script you might hit the ERROR 1114 (HY000) at line 9: The table 'test' is full since the data doesn't fit the memory allocated to MySQL. To avoid this we can set the following flags to raise the allocated memory to 64GB:

SET max_heap_table_size = 1024 * 1024 * 1024 * 64;
Enter fullscreen mode Exit fullscreen mode

However, also this didn't work, and stopped my test after waiting for over 40 minutes for just the loading time.

Do you have ideas on how to speed up the process in MySQL? I'm all 👂 on X (ex Twitter) @ftisiot!

sorting Article's
30 articles in total
Favicon
Difference Between Merge Sort and Quick Sort
Favicon
Leetcode 75. Sort Colors
Favicon
Sorted Data Structures in Python
Favicon
Sorting Algorithms That Use Hash Tables
Favicon
C# Essentials: Operator Overloading and Custom Sorting Made Simple
Favicon
Recap the highlight of the sorting algorithms using JavaScript for beginners
Favicon
Merge Sort Demystified: A Beginner's Guide to Divide and Conquer Sorting
Favicon
Understanding Bubble Sort: Simple Sorting Method
Favicon
Introduction to Sorting Algorithms in JavaScript
Favicon
Understanding the SQL ORDER BY Clause
Favicon
Demystifying Sorting Algorithms: Making Order Out of Chaos
Favicon
Merge Intervals : A unique Graph-based approach
Favicon
Bubble Sort
Favicon
COMPARATOR vs COMPARABLE - A Java Surprise You did in School!
Favicon
Streamlining Data Management with Python's sorted() Function
Favicon
1 billion rows challenge in MySQL
Favicon
1 billion rows challenge in PostgreSQL and ClickHouse
Favicon
Sorting in Java – how to, and how not to do it anymore
Favicon
Reversing sort order in Rust
Favicon
Priority Queue: Creating order from chaos
Favicon
Mastering Array Sorting in PHP: usort & uasort 🚀🚀
Favicon
QuickSort - Time Analysis (Part2)
Favicon
Quicksort (Grokking Algorithms)
Favicon
Better Bogo Sort
Favicon
Sorting Array of Objects in Javascript
Favicon
Bubble Sort
Favicon
Understanding insertion sort algorithm
Favicon
Sorting Visualizer [ A web app to visualize sorting algorithm ]
Favicon
How to sort complex objects with custom criteria in Python
Favicon
Iterative Sorting algorithms in Javascript

Featured ones: