dev-resources.site
for different kinds of informations.
RocksDB, Key-Value Storage, and Packed Rows: the backbone of YugabyteDB's distributed tablets flexibility
The LSM Tree implementation in YugabyteDB is based on RocksDB, a highly customizable datastore widely used in various databases (embedded or as an alternative backend). While some others have developed their own RocksDB in different programming languages, the original C++ implementation utilized by YugabyteDB remains the most efficient and best integrated with the PostgreSQL code in C.
In an LSM tree, all writes, such as inserts, updates, or deletes, are appended to a memory table without attempting to update existing blocks. This differs from traditional databases, where each write must pin a shared buffer before updating it in place. Writing to LSM tree is fast and works well with Multi-Version Concurrent Control (MVCC) databases. In MVCC databases, updates are similar to inserting new column values, and deletes are represented by inserting markers for the end of the row's life. Intermediate versions must be kept for a while for MVCC reads, so it's best to defer the in-place update. In YugabyteDB, the memory table is periodically flushed to Sorted Sequence Table (SST) files, which are read with a merge sort and compacted in the background.
YugabyteDB has made various improvements to the RocksDB codebase. These include integrating data-model-aware bloom filters, optimizing range queries, and implementing a scan-resistant global cache, all of which significantly speed up read operations. Additionally, the Write-Ahead Logging (WAL) and Multi-Version Concurrency Control (MVCC) are handled differently at a higher level in the Raft log, not using the RocksDB ones. Compaction has been improved to execute MVCC garbage collection at the storage level, effectively resolving the issues of vacuum and undo found in other databases.
RocksDB is a key-value data store. The key represents the tables' primary key and the secondary indexes' indexed columns. This key-value storage is not a NoSQL database because it is transactional and operates beneath the SQL layer, which contrasts with NoSQL databases. In another article, I debunked the myth that distributed SQL databases run on top of NoSQL engines:
YugabyteDB stores tuples in RocksDB, unlike traditional databases that store them in fixed-size disk blocks. The key-value structure makes sharding and re-sharding easier by splitting ranges of keys. It also makes Multi-Version Concurrency Control (MVCC) more efficient, avoiding the need for rollback segments used by other databases to rebuild a previous image of an entire block.
Inserts in YugabyteDB group all column values as one key value with 'packed row'. When updates are made, the new column values are appended with a different sub-key to prevent the write amplification often seen in other databases, where the entire tuple must be copied. The background compaction process will repack them when the MVCC retention period allows it to eliminate intermediate versions. RocksDB has been improved to eliminate redundant key prefixes while maintaining efficient forward scans and, to a lesser extent, backward scans using restart blocks.
RocksDB is a high-performance, log-structured database engine written in C++. It uses arbitrarily sized byte streams for keys and values and is optimized for fast storage, such as flash drives and high-speed disk drives. RocksDB is adaptable to different workloads and has been significantly improved to provide a more flexible storage structure in YugabyteDB compared to traditional block storage in monolithic databases.
Featured ones: