Logo

dev-resources.site

for different kinds of informations.

RocksDB, Key-Value Storage, and Packed Rows: the backbone of YugabyteDB's distributed tablets flexibility

Published at
9/8/2024
Categories
yugabytedb
rocksdb
lsm
distributed
Author
franckpachot
Author
12 person written this
franckpachot
open
RocksDB, Key-Value Storage, and Packed Rows: the backbone of YugabyteDB's distributed tablets flexibility

The LSM Tree implementation in YugabyteDB is based on RocksDB, a highly customizable datastore widely used in various databases (embedded or as an alternative backend). While some others have developed their own RocksDB in different programming languages, the original C++ implementation utilized by YugabyteDB remains the most efficient and best integrated with the PostgreSQL code in C.

In an LSM tree, all writes, such as inserts, updates, or deletes, are appended to a memory table without attempting to update existing blocks. This differs from traditional databases, where each write must pin a shared buffer before updating it in place. Writing to LSM tree is fast and works well with Multi-Version Concurrent Control (MVCC) databases. In MVCC databases, updates are similar to inserting new column values, and deletes are represented by inserting markers for the end of the row's life. Intermediate versions must be kept for a while for MVCC reads, so it's best to defer the in-place update. In YugabyteDB, the memory table is periodically flushed to Sorted Sequence Table (SST) files, which are read with a merge sort and compacted in the background.

YugabyteDB has made various improvements to the RocksDB codebase. These include integrating data-model-aware bloom filters, optimizing range queries, and implementing a scan-resistant global cache, all of which significantly speed up read operations. Additionally, the Write-Ahead Logging (WAL) and Multi-Version Concurrency Control (MVCC) are handled differently at a higher level in the Raft log, not using the RocksDB ones. Compaction has been improved to execute MVCC garbage collection at the storage level, effectively resolving the issues of vacuum and undo found in other databases.

DocDB performance enhancements to RocksDB | YugabyteDB Docs

Learn how DocDB enhances RocksDB for scale and performance.

favicon docs.yugabyte.com

RocksDB is a key-value data store. The key represents the tables' primary key and the secondary indexes' indexed columns. This key-value storage is not a NoSQL database because it is transactional and operates beneath the SQL layer, which contrasts with NoSQL databases. In another article, I debunked the myth that distributed SQL databases run on top of NoSQL engines:

Distributed SQL architecture and what Oracle didn't grasp about it

It's impressive to see how Spanner innovated with its database architecture, especially when many traditional database vendors still struggle to understand it even after ten years. Some database service providers believe they can achieve Distributed SQL by inserting a sharding coordinator between th

favicon linkedin.com

YugabyteDB stores tuples in RocksDB, unlike traditional databases that store them in fixed-size disk blocks. The key-value structure makes sharding and re-sharding easier by splitting ranges of keys. It also makes Multi-Version Concurrency Control (MVCC) more efficient, avoiding the need for rollback segments used by other databases to rebuild a previous image of an entire block.

Inserts in YugabyteDB group all column values as one key value with 'packed row'. When updates are made, the new column values are appended with a different sub-key to prevent the write amplification often seen in other databases, where the entire tuple must be copied. The background compaction process will repack them when the MVCC retention period allows it to eliminate intermediate versions. RocksDB has been improved to eliminate redundant key prefixes while maintaining efficient forward scans and, to a lesser extent, backward scans using restart blocks.


RocksDB is a high-performance, log-structured database engine written in C++. It uses arbitrarily sized byte streams for keys and values and is optimized for fast storage, such as flash drives and high-speed disk drives. RocksDB is adaptable to different workloads and has been significantly improved to provide a more flexible storage structure in YugabyteDB compared to traditional block storage in monolithic databases.

RocksDB, an excellent choice for modern SQL Databases (LSM Tree vs. B-Tree)

RocksDB is a high-performance embedded data store that powers many modern databases. It is highly customizable and an ideal storage structure for more complex databases.

favicon linkedin.com
distributed Article's
30 articles in total
Favicon
PostgreSQL plan_cache_mode
Favicon
Index Filtering in PostgreSQL and YugabyteDB (Index Scan instead of Index Only Scan)
Favicon
Book Review: Designing Data-Intensive Applications
Favicon
More details in pg_locks for YugabyteDB
Favicon
Large IntentsDB MemTable with Many Small SST Files
Favicon
MapReduce - A Simplified Approach to Big Data Processing
Favicon
Challenges of Asynchronous Messaging in Software Design
Favicon
Aurora DSQL: How it Compares to YugabyteDB
Favicon
Document data modeling to avoid write skew anomalies
Favicon
When to replace IN() with EXISTS() - correlated and uncorrelated subqueries
Favicon
2024.2: Faster with Shared Memory Between PostgreSQL and TServer Layers
Favicon
DynamoDB-style Limits for Predictable SQL Performance?
Favicon
Aurora DSQL: Create a Serverless Cluster and Connect with PostgreSQL Client
Favicon
Amazon Aurora DSQL: Which PostgreSQL Service Should I Use on AWS ?
Favicon
YugabyteDB MVCC and Updates: columns vs. JSON
Favicon
Aurora Limitless - Creation
Favicon
No Gap Ordered Numbering in SQL: A Unique Index to Serialize In Read Committed
Favicon
What's behind the Call Home option?
Favicon
Reverse Proxy and Load Balancing: Do we need both?
Favicon
AWS re:Invent 2024 - Which sessions I'll try to attend.
Favicon
pgSphere and Q3C on Distributed SQL
Favicon
IN() Index Scan in PostgreSQL 17 and YugabyteDB LSM Tree
Favicon
Frequent Re-Connections improved by Connection Manager
Favicon
Maintaining Throughput With Less Physical Connections
Favicon
YugabyteDB Connection Manager: a Database Resident Connection Pool with Shared Processes
Favicon
Parallel JavaScript Machine
Favicon
Asynch replication for Disaster Recovery, Read Replicas, and Change Data Capture
Favicon
RocksDB, Key-Value Storage, and Packed Rows: the backbone of YugabyteDB's distributed tablets flexibility
Favicon
SQL as fast as NoSQL, Bulk Loads, Covering and Partial Indexes
Favicon
Fault Tolerance with Raft and no Single Point of Failure

Featured ones: