Logo

dev-resources.site

for different kinds of informations.

Fast PITR and MVCC reads with Key-Value LSM Tree

Published at
9/9/2024
Categories
yugabytedb
mvcc
database
pitr
Author
franckpachot
Categories
4 categories in total
yugabytedb
open
mvcc
open
database
open
pitr
open
Author
12 person written this
franckpachot
open
Fast PITR and MVCC reads with Key-Value LSM Tree

YugabyteDB uses PostgreSQL code to handle SQL queries without the challenges associated with PostgreSQL storage, such as bloat, vacuum, and transaction ID wraparound. Additionally, it avoids the complexities of a separate undo segment, which can lead to lengthy rollbacks and recovery. The storage system in YugabyteDB, using RocksDB LSM tree, is designed for distribution, replication, and auto-sharding. It leverages modern shared-nothing infrastructure with Solid-State Disks (SSD) commonly found in public and private clouds, where random reads are quick, but sequential writes are better than random writes.

Unlike traditional databases that store tuples in fixed blocks, LSM Trees store them in a key-value format, allowing new versions to be added to a key. To version the new values, YugabyteDB includes the commit time's Hybrid Logical Clock timestamp in the key. This timestamp determines the visibility of Multi-Version Concurrency Control (MVCC) for other transactions. The background compaction process removes intermediate MVCC versions after the specified retention period. When querying, the MVCC snapshot can be built without additional random reads to rollback segments or new tuple copies.


Consider this extreme scenario: I execute a query on the same row twice within a repeatable read transaction while multiple concurrent transactions carry out updates on this row a hundred thousand times. Block-based databases must review all the intermediate versions with random reads to locate the correct one. However, YugabyteDB can quickly access any version since the commit timestamp is part of the key.

I create a one-row table:

create extension orafce;
create table speaking_clock (id bigint, primary key(id asc), message text);
insert into speaking_clock values ( 42 );
Enter fullscreen mode Exit fullscreen mode

I use the following script run with pgbench -t10000 -c 10 to update that same row a hundred thousand times from concurrent transactions:

update speaking_clock set message = (
 select to_char(now(),'"At the third stroke, it will be "HH24SP hours MISP "minutes and" SSSP "seconds"')
);
Enter fullscreen mode Exit fullscreen mode

I use explain analyze to show the time for two selects:

yugabyte=# begin transaction isolation level repeatable read;
BEGIN

yugabyte=*# explain (analyze, dist, debug, costs off, summary off)
yugabyte-*# select * from speaking_clock where id=42;

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Index Scan using speaking_clock_pkey on speaking_clock (actual time=1.163..1.165 rows=1 loops=1)
   Index Cond: (id = 42)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 1.019 ms
   Storage Table Rows Scanned: 1
   Metric rocksdb_number_db_seek: 2.000
   Metric rocksdb_number_db_next: 3.000
   Metric rocksdb_number_db_seek_found: 1.000
   Metric rocksdb_number_db_next_found: 3.000
   Metric rocksdb_iter_bytes_read: 354.000
   Metric docdb_keys_found: 1.000
   Metric ql_read_latency: sum: 65.000, count: 1.000
(12 rows)

yugabyte=*# \! pgbench -nt10000 -c10 -f update.sql --max-tries=10

scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 10
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
number of transactions retried: 5 (0.005%)
total number of retries: 5
latency average = 105.273 ms
initial connection time = 380.216 ms
tps = 94.991185 (without initial connection time)

yugabyte=*# explain (analyze, dist, debug, costs off, summary off)
yugabyte-*# select * from speaking_clock where id=42;

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Index Scan using speaking_clock_pkey on speaking_clock (actual time=3.648..3.650 rows=1 loops=1)
   Index Cond: (id = 42)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 0.913 ms
   Storage Table Rows Scanned: 1
   Metric rocksdb_number_db_seek: 3.000
   Metric rocksdb_number_db_next: 6.000
   Metric rocksdb_number_db_seek_found: 2.000
   Metric rocksdb_number_db_next_found: 6.000
   Metric rocksdb_iter_bytes_read: 782.000
   Metric docdb_keys_found: 1.000
   Metric ql_read_latency: sum: 118.000, count: 1.000
(12 rows)

yugabyte=*# commit;
COMMIT

Enter fullscreen mode Exit fullscreen mode

The first statement, which reads the table's current value, takes 1.165 milliseconds (2 seeks and 3 nexts in the LSM tree). It immediately finds the key's current version.
The second statement, after 100000 concurrent updates, has to find the version before those updates to read at the same time as the first statement of the repeatable read transaction. It took 3.650 milliseconds (3 seeks and 6 nexts in the LSM tree).
The impact of 100000 MVCC versions in between did only increase the response time from 1ms to 3ms.

After the COMMIT and before any garbage collection, I run the same query again that gets the current state:

yugabyte=# explain (analyze, dist, debug, costs off, summary off)
yugabyte-# select * from speaking_clock where id=42;

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Index Scan using speaking_clock_pkey on speaking_clock (actual time=1.005..1.018 rows=1 loops=1)
   Index Cond: (id = 42)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 0.885 ms
   Storage Table Rows Scanned: 1
   Metric rocksdb_number_db_seek: 2.000
   Metric rocksdb_number_db_next: 3.000
   Metric rocksdb_number_db_seek_found: 1.000
   Metric rocksdb_number_db_next_found: 3.000
   Metric rocksdb_iter_bytes_read: 354.000
   Metric docdb_keys_found: 1.000
   Metric ql_read_latency: sum: 53.000, count: 1.000
(12 rows)
Enter fullscreen mode Exit fullscreen mode

Each key is sorted with the MVCC timestamp in descending order so that the latest version is immediately read.


Another benefit of this storage method is that RocksDB stores the data in immutable SST files. This means that taking a snapshot for cloning or point-in-time recovery is a fast operation because the SST files serve as the snapshot. Moreover, incremental backups come for free, as the SST files are already incremental.
When combined with MVCC, which allows for fast flashback query to any point within the retention period, obtaining a consistent database state with snapshots taken on different nodes becomes easy. The Hybrid Logical Clock defines the cluster-wide point in time for the clone or recovery process.

Point-in-time recovery | YugabyteDB Docs

Restore data to a specific point in time in YugabyteDB

favicon docs.yugabyte.com
yugabytedb Article's
30 articles in total
Favicon
PostgreSQL plan_cache_mode
Favicon
Index Filtering in PostgreSQL and YugabyteDB (Index Scan instead of Index Only Scan)
Favicon
Unique Index on NULL Values in SQL & NoSQL
Favicon
More details in pg_locks for YugabyteDB
Favicon
Large IntentsDB MemTable with Many Small SST Files
Favicon
Aurora DSQL: How it Compares to YugabyteDB
Favicon
Document data modeling to avoid write skew anomalies
Favicon
When to replace IN() with EXISTS() - correlated and uncorrelated subqueries
Favicon
2024.2: Faster with Shared Memory Between PostgreSQL and TServer Layers
Favicon
Aurora DSQL - Simple Inserts Workload from an AWS CloudShell
Favicon
Amazon Aurora DSQL: Which PostgreSQL Service Should I Use on AWS ?
Favicon
YugabyteDB MVCC and Updates: columns vs. JSON
Favicon
No Gap Ordered Numbering in SQL: A Unique Index to Serialize In Read Committed
Favicon
Starting a YugabyteDB lab cluster with AWS CLI
Favicon
Speeding Up Foreign Key Constraints During Migrations
Favicon
Indexing for a Scalable Serialization Isolation Level
Favicon
The Doctor's On-Call Shift example and a Normalized Relational Schema to Avoid Write Skew
Favicon
You Probably Don't Need Serializable Isolation
Favicon
A brief example of an SQL serializable transaction
Favicon
YugabyteDB as a Graph database with PuppyGraph
Favicon
Native GLIBC instead of Linuxbrew since 2.21
Favicon
pgSphere and Q3C on Distributed SQL
Favicon
IN() Index Scan in PostgreSQL 17 and YugabyteDB LSM Tree
Favicon
Frequent Re-Connections improved by Connection Manager
Favicon
Maintaining Throughput With Less Physical Connections
Favicon
YugabyteDB Connection Manager: a Database Resident Connection Pool with Shared Processes
Favicon
ERROR: index row size 3056 exceeds btree version 4 maximum 2704 for index
Favicon
Write Buffering to Reduce Raft Consensus Latency in YugabyteDB
Favicon
Asynch replication for Disaster Recovery, Read Replicas, and Change Data Capture
Favicon
Fast PITR and MVCC reads with Key-Value LSM Tree

Featured ones: