Logo

dev-resources.site

for different kinds of informations.

Indexing for a Scalable Serialization Isolation Level

Published at
10/19/2024
Categories
sql
yugabytedb
postgres
database
Author
franckpachot
Categories
4 categories in total
sql
open
yugabytedb
open
postgres
open
database
open
Author
12 person written this
franckpachot
open
Indexing for a Scalable Serialization Isolation Level

The previous post used the primary key to record read intents in YugabyteDB. Serializable isolation can be scalable, but it requires optimal schema and indexes with a good knowledge of how the database works.
Indexes define the predicate locks for PostgreSQL Serializable Snapshot Isolation, and the primary key defines the range locks used by YugabyteDB's Two-Phase Commit.
Here an example taken from nathanl/postgresql_serializable_isolation.sql which simply inserts two rows from two transactions:


create table users(
  id serial not null primary key,
  username varchar not null
);

-- create index on users(username asc);

-- session 1

begin transaction isolation level serializable;
select * from users where username = 'alice';
insert into users ("username") values ('alice');

-- session 2
\! psql -c '\timing on' -c "begin transaction isolation level serializable; select * from users where username = 'bob'; insert into users ("username") values ('bob'); commit;" &

-- session 1

\! sleep 1
commit;
\! sleep 1
select * from users;

Enter fullscreen mode Exit fullscreen mode

I've run this at a serializable isolation level. Let's examine the performance in PostgreSQL and YugabyteDB, with and without an index on "username".

PostgreSQL Seq Scan

Without an index on "username", the SELECT does a Full Table Scan.
Session 2 can commit, but session 1 detects a conflict on commit:


COMMIT
Time: 2.844 ms

postgres=*# commit;
ERROR:  could not serialize access due to read/write dependencies among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during commit attempt.
HINT:  The transaction might succeed if retried.

Enter fullscreen mode Exit fullscreen mode

This serializable error is a false positive. The two sessions read different values, and they are serializable. PostgreSQL employs predicate locks but only records the predicates used for the scan operation. The lock covers the entire table since it was read with a Seq Scan.

PostgreSQL Index Scan

I run the same with the index (uncomment the CREATE INDEX in the script), and there's no error:


COMMIT
Time: 2.581 ms

postgres=*# commit;
COMMIT
postgres=# \! sleep 1
postgres=# select * from users;
 id | username
----+----------
  1 | alice
  2 | bob

Enter fullscreen mode Exit fullscreen mode

In this case, PostgreSQL uses the index condition as a predicate lock, and they do not conflict between the two transactions.

Note that in this scenario, the "username" column should be declared unique instead of creating an additional index. This creates an implicit index and has the same effect:

create table users(
  id serial not null primary ley,
  username varchar not null unique
);

Enter fullscreen mode Exit fullscreen mode

YugabyteDB

YugabyteDB uses Wait-on-Conflict to prevent the serializable error, allowing both transactions to commit even without the index.

yugabyte=*# commit;
COMMIT

COMMIT
Time: 926.484 ms

Enter fullscreen mode Exit fullscreen mode

I waited one second before committing to the first session, and this delay is reflected in the response time of the second session, which waited for the first one. The reason for this is similar to what we have observed with PostgreSQL: the read lock intent is at the table level.

This is visible as STRONG_READ at the level of the "relation," which is blocked by the insert operation that acquires a WEAK_WRITE lock on the relation:

yugabyte=*# select
 substr(ybdetails->>'transactionid',1,8) transactionid , substr(ybdetails->>'blocked_by',1,8) blocked_by,
 relkind, locktype, mode, relname , ybdetails->'keyrangedetails'->'cols'
from pg_locks natural join (select oid relation, relname, relkind from pg_class) pg_class
order by transactionid, relkind desc, relname, case locktype when 'relation' then 1 when 'keyrange' then 2 when 'row' then 3 end, mode desc, locktype desc;
 transactionid | blocked_by | relkind | locktype |         mode         | relname | ?column?
---------------+------------+---------+----------+----------------------+---------+----------
 74e1329c      |            | r       | relation | WEAK_READ,WEAK_WRITE | users   | null
 74e1329c      |            | r       | relation | STRONG_READ          | users   | null
 74e1329c      |            | r       | row      | STRONG_WRITE         | users   | ["1"]
 74e1329c      |            | r       | row      | STRONG_READ          | users   | ["1"]
 c5611a22      | ["74e132   | r       | relation | STRONG_READ          | users   | null

Enter fullscreen mode Exit fullscreen mode

The index will add the range lock intents in the index, but there's still a conflict with the table lock intents at "relation" level:

yugabyte=*# select
 substr(ybdetails->>'transactionid',1,8) transactionid , substr(ybdetails->>'blocked_by',1,8) blocked_by,
 relkind, locktype, mode, relname , ybdetails->'keyrangedetails'->'cols'
from pg_locks natural join (select oid relation, relname, relkind from pg_class) pg_class
order by transactionid, relkind desc, relname, case locktype when 'relation' then 1 when 'keyrange' then 2 when 'row' then 3 end, mode desc, locktype desc;
 transactionid | blocked_by | relkind | locktype |         mode         |      relname       |                  ?column?
---------------+------------+---------+----------+----------------------+--------------------+---------------------------------------------
 02a108e3      |            | r       | relation | WEAK_READ,WEAK_WRITE | users              | null
 02a108e3      |            | r       | relation | STRONG_READ          | users              | null
 02a108e3      |            | r       | row      | STRONG_WRITE         | users              | ["1"]
 02a108e3      |            | r       | row      | STRONG_READ          | users              | ["1"]
 02a108e3      |            | i       | relation | WEAK_WRITE           | users_username_idx | null
 02a108e3      |            | i       | keyrange | WEAK_WRITE           | users_username_idx | ["\"alice\""]
 02a108e3      |            | i       | row      | STRONG_WRITE         | users_username_idx | ["\"alice\"", "\"H\\x80\\x00\\x00\\x01!\""]
 5b82a9e3      | ["02a108   | r       | relation | STRONG_READ          | users              | null

Enter fullscreen mode Exit fullscreen mode

As a consequence of it, simply adding the index in YugabyteDB doesn't reduce the wait. If we want to minimize the latency for this use case, it is preferable to define the username as a primary key rather than the generated identifier:

create table users(
  id serial not null unique,
  username varchar not null primary key
);

Enter fullscreen mode Exit fullscreen mode

In PostgreSQL, table rows are stored in heap tables, and all indexes, including the primary key, are considered secondary indexes. However, in YugabyteDB, the rows are stored in the primary key LSM Tree, so defining the primary key for the main access pattern is essential. There are no limitations because all secondary indexes are global and shared on their key.
Another reason the primary key matters in YugabyteDB is to improve serializable locks. In PostgreSQL, even if the scan predicates define the read intents, they are stored in memory. On the other hand, YugabyteDB is distributed and doesn't use a single-node shared memory, so it stores the intents with the table row in the primary key index.

With such a definition, both transactions can be committed without waiting for the other:


COMMIT
Time: 147.917 ms

yugabyte=*# commit;
COMMIT

Enter fullscreen mode Exit fullscreen mode

Here are the locks just before the commits:

 transactionid | blocked_by | relkind | locktype |         mode         |   relname    |    ?column?
---------------+------------+---------+----------+----------------------+--------------+-----------------
 350a076c      |            | r       | relation | WEAK_READ,WEAK_WRITE | users        | null
 350a076c      |            | r       | relation | WEAK_READ            | users        | null
 350a076c      |            | r       | row      | STRONG_WRITE         | users        | ["\"alice\""]
 350a076c      |            | r       | row      | STRONG_READ          | users        | ["\"alice\""]
 350a076c      |            | r       | row      | STRONG_READ          | users        | ["\"alice\""]
 350a076c      |            | i       | relation | WEAK_READ,WEAK_WRITE | users_id_key | null
 350a076c      |            | i       | keyrange | WEAK_READ,WEAK_WRITE | users_id_key | ["1"]
 350a076c      |            | i       | row      | STRONG_WRITE         | users_id_key | ["1", "null"]
 350a076c      |            | i       | row      | STRONG_READ          | users_id_key | ["1", "null"]
 97f5bd76      |            | r       | relation | WEAK_READ,WEAK_WRITE | users        | null
 97f5bd76      |            | r       | relation | WEAK_READ            | users        | null
 97f5bd76      |            | r       | row      | STRONG_WRITE         | users        | ["\"bob\""]
 97f5bd76      |            | r       | row      | STRONG_READ          | users        | ["\"bob\""]
 97f5bd76      |            | r       | row      | STRONG_READ          | users        | ["\"bob\""]
 97f5bd76      |            | i       | relation | WEAK_READ,WEAK_WRITE | users_id_key | null
 97f5bd76      |            | i       | keyrange | WEAK_READ,WEAK_WRITE | users_id_key | ["101"]
 97f5bd76      |            | i       | row      | STRONG_WRITE         | users_id_key | ["101", "null"]
 97f5bd76      |            | i       | row      | STRONG_READ          | users_id_key | ["101", "null"]

Enter fullscreen mode Exit fullscreen mode

This situation does not involve any blocking. The STRONG_READ intents only affect individual rows because each transaction accesses different rows.

The serializable isolation level can be scalable. However, like many other aspects of SQL database performance, its performance depends on the definition of primary keys and secondary indexes. In SQL, "serializable" doesn't mean "serialized". The example above is serializable (with the same effect as if serialized), but the two transactions can run concurrently.

yugabytedb Article's
30 articles in total
Favicon
PostgreSQL plan_cache_mode
Favicon
Index Filtering in PostgreSQL and YugabyteDB (Index Scan instead of Index Only Scan)
Favicon
Unique Index on NULL Values in SQL & NoSQL
Favicon
More details in pg_locks for YugabyteDB
Favicon
Large IntentsDB MemTable with Many Small SST Files
Favicon
Aurora DSQL: How it Compares to YugabyteDB
Favicon
Document data modeling to avoid write skew anomalies
Favicon
When to replace IN() with EXISTS() - correlated and uncorrelated subqueries
Favicon
2024.2: Faster with Shared Memory Between PostgreSQL and TServer Layers
Favicon
Aurora DSQL - Simple Inserts Workload from an AWS CloudShell
Favicon
Amazon Aurora DSQL: Which PostgreSQL Service Should I Use on AWS ?
Favicon
YugabyteDB MVCC and Updates: columns vs. JSON
Favicon
No Gap Ordered Numbering in SQL: A Unique Index to Serialize In Read Committed
Favicon
Starting a YugabyteDB lab cluster with AWS CLI
Favicon
Speeding Up Foreign Key Constraints During Migrations
Favicon
Indexing for a Scalable Serialization Isolation Level
Favicon
The Doctor's On-Call Shift example and a Normalized Relational Schema to Avoid Write Skew
Favicon
You Probably Don't Need Serializable Isolation
Favicon
A brief example of an SQL serializable transaction
Favicon
YugabyteDB as a Graph database with PuppyGraph
Favicon
Native GLIBC instead of Linuxbrew since 2.21
Favicon
pgSphere and Q3C on Distributed SQL
Favicon
IN() Index Scan in PostgreSQL 17 and YugabyteDB LSM Tree
Favicon
Frequent Re-Connections improved by Connection Manager
Favicon
Maintaining Throughput With Less Physical Connections
Favicon
YugabyteDB Connection Manager: a Database Resident Connection Pool with Shared Processes
Favicon
ERROR: index row size 3056 exceeds btree version 4 maximum 2704 for index
Favicon
Write Buffering to Reduce Raft Consensus Latency in YugabyteDB
Favicon
Asynch replication for Disaster Recovery, Read Replicas, and Change Data Capture
Favicon
Fast PITR and MVCC reads with Key-Value LSM Tree

Featured ones: