Logo

dev-resources.site

for different kinds of informations.

Fault Tolerance with Raft and no Single Point of Failure

Published at
9/5/2024
Categories
yugabytedb
distributed
postgres
database
Author
franckpachot
Author
12 person written this
franckpachot
open
Fault Tolerance with Raft and no Single Point of Failure

The two layers, SQL processing and distributed storage, are deployed in one binary: yb-server. The yb-tservers in YugabyteDB play a pivotal role, handling both query processing and data storage in a linearly scalable manner. The shared components, such as cluster metadata and the PostgreSQL catalog, are stored separately in a yb-master component. This yb-master, functioning like a single tablet, is replicated for high availability but does not need to scale. The yb-tservers cache the necessary data to run these statements efficiently.

When deploying on Kubernetes, yb-master and yb-tserver are two StatefulSets:
Image description

The yb-server tablets and the yb-master single tablet are replicated to ensure high availability. The fault tolerance of the database, a critical aspect of its design, is determined by the placement information (cloud, region, and zone) and the replication factor. For instance, with a Replication Factor of 3 over three availability zones, the database can withstand the failure of one availability zone. There is one yb-master per zone and multiple yb-tservers. Similarly, with a replication factor of 5 across three regions, the database remains accessible if one entire region becomes inaccessible, and it can withstand an additional failure in the remaining region. This is typical resilience in a cloud environment: transient failures happen frequently, and you may want to be resilient even during a less frequent regional outage.

YugabyteDB employs the Raft algorithm to elect a leader for each tablet, which offers superior performance for SQL applications where read activity is critical. In SQL, even write operations must be read first to ensure ACID properties, so read performance is essential.
Reads and writes are directed to the table leader, which synchronizes replication and eliminates the need to contact additional nodes for consistent reads. Writes are synchronized to Raft followers and are only acknowledged after the majority of them confirm the write operation reached their local storage. This approach delivers strong consistency along with high performance. In the event of leader inaccessibility due to failure, a multi-node quorum is required only to elect a new leader without any data loss, and reads can be served from it.

Each tablet forms an independent Raft group, generating its own Write-Ahead Logging (WAL) – the Raft log distributed to the tablet peers. As the database grows, tablets can split, allowing for linear scalability, unlike traditional databases with a single stream of WAL for the entire database. It's essential to note that this horizontal scalability adheres to the CP aspect of the CAP theorem. In the event of a network partition, only the part with a quorum remains available to ensure consistency, offering high availability but not full availability.

Image description

There have been claims that Raft may encounter lost updates, but this is incorrect. Raft is a consensus protocol that ensures all servers agree on a value. If they cannot reach a consensus, the write is not acknowledged. YugabyteDB employs additional techniques on top of Raft to guarantee this. Leaders have a lease to prevent split-brain scenariosβ€”a newly elected leader must wait for a two-second lease before accepting reads and writes.
Additionally, clock synchronization with a Hybrid Logical Clock (combining Lamport clocks for the logical part and maximum clock skew for the physical part) helps YugabyteDB remain linearly scalable while guaranteeing consistency. These techniques ensure consistency at the cost of briefly reducing availability in the event of failure.
The probability of simultaneous failure of all tablet peers is improbable, but you can choose to fsync each WAL write for additional protection with extra latency.

YugabyteDB's resilience relies on the Raft consensus. It can withstand failures as long as the majority of replicas can communicate. No node will serve consistent reads and writes if they are not part of this quorum, and this guarantees no data loss according to the defined fault tolerance.


Raft Protocol: What is the Raft Consensus Algorithm?

The Raft consensus algorithm allows a distributed system to agree on values in the presence of failure while ensuring consistent performance.

favicon yugabyte.com
distributed Article's
30 articles in total
Favicon
PostgreSQL plan_cache_mode
Favicon
Index Filtering in PostgreSQL and YugabyteDB (Index Scan instead of Index Only Scan)
Favicon
Book Review: Designing Data-Intensive Applications
Favicon
More details in pg_locks for YugabyteDB
Favicon
Large IntentsDB MemTable with Many Small SST Files
Favicon
MapReduce - A Simplified Approach to Big Data Processing
Favicon
Challenges of Asynchronous Messaging in Software Design
Favicon
Aurora DSQL: How it Compares to YugabyteDB
Favicon
Document data modeling to avoid write skew anomalies
Favicon
When to replace IN() with EXISTS() - correlated and uncorrelated subqueries
Favicon
2024.2: Faster with Shared Memory Between PostgreSQL and TServer Layers
Favicon
DynamoDB-style Limits for Predictable SQL Performance?
Favicon
Aurora DSQL: Create a Serverless Cluster and Connect with PostgreSQL Client
Favicon
Amazon Aurora DSQL: Which PostgreSQL Service Should I Use on AWS ?
Favicon
YugabyteDB MVCC and Updates: columns vs. JSON
Favicon
Aurora Limitless - Creation
Favicon
No Gap Ordered Numbering in SQL: A Unique Index to Serialize In Read Committed
Favicon
What's behind the Call Home option?
Favicon
Reverse Proxy and Load Balancing: Do we need both?
Favicon
AWS re:Invent 2024 - Which sessions I'll try to attend.
Favicon
pgSphere and Q3C on Distributed SQL
Favicon
IN() Index Scan in PostgreSQL 17 and YugabyteDB LSM Tree
Favicon
Frequent Re-Connections improved by Connection Manager
Favicon
Maintaining Throughput With Less Physical Connections
Favicon
YugabyteDB Connection Manager: a Database Resident Connection Pool with Shared Processes
Favicon
Parallel JavaScript Machine
Favicon
Asynch replication for Disaster Recovery, Read Replicas, and Change Data Capture
Favicon
RocksDB, Key-Value Storage, and Packed Rows: the backbone of YugabyteDB's distributed tablets flexibility
Favicon
SQL as fast as NoSQL, Bulk Loads, Covering and Partial Indexes
Favicon
Fault Tolerance with Raft and no Single Point of Failure

Featured ones: