dev-resources.site
for different kinds of informations.
IN() Index Scan in PostgreSQL 17 and YugabyteDB LSM Tree
TL;DR: This optimization in PG17 was made for B-tree indexes. YugabyteDB uses LSM Tree indexes and has already implemented this to optimize distributed joins.
When querying with an IN()
list or Scalar Array Operation Expression (SAOP), PostgreSQL, until PG16, performed one Index Scan for each value.
Demonstration
For each example, I created a table with an index on 'value' that includes 'id'. I then inserted rows, performed a vacuum, analyzed the table, and checked the number of scans from 'pg_stat_user_tables'.
postgres=# create table demo ( id bigserial primary key, value int );
CREATE TABLE
postgres=# create index on demo ( value asc , id asc);
CREATE INDEX
postgres=# insert into demo ( value ) select generate_series(1,1000000);
INSERT 0 1000000
postgres=# vacuum analyze demo;
VACUUM
postgres=# \! sleep 5
postgres=# select seq_scan, seq_tup_read, idx_scan, idx_tup_fetch
from pg_stat_user_tables where relid='demo'::regclass
;
seq_scan | seq_tup_read | idx_scan | idx_tup_fetch
----------+--------------+----------+---------------
2 | 0 | 0 | 0
(1 row)
I checked the table statistics and found no index scans. The sequential scans are a result of the table and index creation.
PostgreSQL 16
With this table created, I query for a list of 42 values.
postgres=# explain (analyze, buffers) select id
from demo
where value in (83, 31, 11, 19, 137, 149, 79, 167, 3, 47, 59, 5, 113, 71, 53, 163, 109, 101, 2, 107, 17, 179, 131, 43, 23, 67, 151, 97, 139, 61, 73, 157, 29, 89, 13, 127, 41, 7, 103, 173, 181, 37)
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using demo_value_id_idx on demo (cost=0.42..186.58 rows=42 width=8) (actual time=0.015..0.063 rows=42 loops=1)
Index Cond: (value = ANY ('{83,31,11,19,137,149,79,167,3,47,59,5,113,71,53,163,109,101,2,107,17,179,131,43,23,67,151,97,139,61,73,157,29,89,13,127,41,7,103,173,181,37}'::integer[]))
Heap Fetches: 0
Buffers: shared hit=127
Planning:
Buffers: shared hit=22
Planning Time: 0.218 ms
Execution Time: 0.075 ms
This query has read 127 buffers to retrieve 42 rows, averaging three buffers per row. It's clear that it has traversed through the B-Tree levels (one root, one branch, one leaf) for each value. Even if this operation is carried out from shared memory, it could pose scalability issues because multiple statements running this, even for different values, will have to pin the same root and branches.
Looking at the table statistics, it is evident that this Index Scan was executed 42 times, even if that's not visible in the execution plan (it shows only one loop - there's no visible INLIST ITERATOR like Oracle).
postgres=# select seq_scan, seq_tup_read, idx_scan, idx_tup_fetch
from pg_stat_user_tables where relid='demo'::regclass
;
seq_scan | seq_tup_read | idx_scan | idx_tup_fetch
----------+--------------+----------+---------------
2 | 0 | 42 | 0
This has been improved in PostgreSQL 17.
PostgreSQL 17
I've run the same example with PostgreSQL 17
postgres=# explain (analyze, buffers) select id
from demo
where value in (83, 31, 11, 19, 137, 149, 79, 167, 3, 47, 59, 5, 113, 71, 53, 163, 109, 101, 2, 107, 17, 179, 131, 43, 23, 67, 151, 97, 139, 61, 73, 157, 29, 89, 13, 127, 41, 7, 103, 173, 181, 37)
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using demo_value_id_idx on demo (cost=0.42..186.58 rows=42 width=8) (actual time=0.025..0.029 rows=42 loops=1)
Index Cond: (value = ANY ('{83,31,11,19,137,149,79,167,3,47,59,5,113,71,53,163,109,101,2,107,17,179,131,43,23,67,151,97,139,61,73,157,29,89,13,127,41,7,103,173,181,37}'::integer[]))
Heap Fetches: 0
Buffers: shared hit=4
Planning:
Buffers: shared hit=20 read=1
Planning Time: 0.155 ms
Execution Time: 0.042 ms
postgres=# select seq_scan, seq_tup_read, idx_scan, idx_tup_fetch from pg_stat_user_tables where relid='demo'::regclass;
seq_scan | seq_tup_read | idx_scan | idx_tup_fetch
----------+--------------+----------+---------------
2 | 0 | 1 | 0
The table statistics clearly show that only one Index Scan was performed to retrieve the 42 values. The number of buffers read is small, just four blocks. This probably includes one root, one branch, and two leaf blocks.
My query resulted in only a few leaf blocks because I was searching for values smaller than 200 within a range of one million. If I were to query with values scattered across the index range, the number of buffers would be higher:
postgres=# explain (analyze, buffers) select id
from demo
where value in (100083, 100031, 100011, 100019, 100137, 100149, 100079, 100167, 1003, 100047, 100059, 1005, 100113, 100071, 100053, 100163, 100109, 100101, 1002, 100107, 100017, 100179, 100131, 10043, 10023, 10067, 100151, 10097, 100139, 10061, 10073, 100157, 10029, 10089, 10013, 100127, 10041, 1007, 100103, 100173, 100181, 10037)
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using demo_value_id_idx on demo (cost=0.42..186.58 rows=42 width=8) (actual time=0.026..0.052 rows=42 loops=1)
Index Cond: (value = ANY ('{100083,100031,100011,100019,100137,100149,100079,100167,1003,100047,100059,1005,100113,100071,100053,100163,100109,100101,1002,100107,100017,100179,100131,10043,10023,10067,100151,10097,100139,10061,10073,100157,10029,10089,10013,100127,10041,1007,100103,100173,100181,10037}'::integer[]))
Heap Fetches: 0
Buffers: shared hit=10
Planning Time: 0.081 ms
Execution Time: 0.066 ms
In any case, the value will always be less than or equal to the number of values in the IN() list and does not have to be read from the B-Tree root again.
YugabyteDB
I am using YugabyteDB (version 2024.1.2) and created the same table. There's no need to VACUUM, as the MVCC garbage collection occurs transparently in the distributed storage. YugabyteDB shows no buffers
for distributed tables because it bypasses the monolithic shared buffers to scale horizontally. However, the dist
option indicates the number of distributed read requests, and the debug
option shows additional internal metrics when reading from the LSM Tree on each node.
yugabyte=# create table demo ( id bigserial primary key, value int );
CREATE TABLE
yugabyte=# create index on demo ( value asc , id asc);
CREATE INDEX
yugabyte=# insert into demo ( value ) select generate_series(1,1000000);
INSERT 0 1000000
yugabyte=# analyze demo;
ANALYZE
yugabyte=# explain (analyze, buffers, dist, debug, summary off) select id
from demo
where value in (83, 31, 11, 19, 137, 149, 79, 167, 3, 47, 59, 5, 113, 71, 53, 163, 109, 101, 2, 107, 17, 179, 131, 43, 23, 67,
151, 97, 139, 61, 73, 157, 29, 89, 13, 127, 41, 7, 103, 173, 181, 37)
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------
Index Only Scan using demo_value_id_idx on demo (cost=180.00..983.22 rows=42 width=8) (actual time=0.735..0.749 rows=42 loops=
1)
Index Cond: (value = ANY ('{83,31,11,19,137,149,79,167,3,47,59,5,113,71,53,163,109,101,2,107,17,179,131,43,23,67,151,97,139,6
1,73,157,29,89,13,127,41,7,103,173,181,37}'::integer[]))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 0.605 ms
Storage Index Rows Scanned: 42
Metric rocksdb_block_cache_hit: 3.000
Metric rocksdb_block_cache_index_hit: 1.000
Metric rocksdb_block_cache_data_hit: 1.000
Metric rocksdb_block_cache_bytes_read: 32751.000
Metric rocksdb_number_db_seek: 28.000
Metric rocksdb_number_db_next: 109.000
Metric rocksdb_number_db_seek_found: 28.000
Metric rocksdb_number_db_next_found: 109.000
Metric rocksdb_iter_bytes_read: 9641.000
Metric rocksdb_block_cache_single_touch_hit: 1.000
Metric rocksdb_block_cache_single_touch_bytes_read: 32699.000
Metric rocksdb_block_cache_multi_touch_hit: 2.000
Metric rocksdb_block_cache_multi_touch_bytes_read: 52.000
Metric docdb_keys_found: 42.000
Metric ql_read_latency: sum: 148.000, count: 1.000
Estimated Seeks: 43
Estimated Nexts: 86
Estimated Docdb Result Width: 9
One "Index Read Request," indicates that there is only one Index Scan to read the 42 values. In the LSM Tree, which is implemented based on RocksDB, seeking to a key is accounted for as "rocksdb_number_db_seek" and reading the following key is accounted for as "rocksdb_number_db_next."
In YugabyteDB, "seek" is equivalent to "buffers" in PostgreSQL. They are the most important metrics for evaluating the cost of a query. They represent a random read to access a set of rows stored together.
If the 42 values were scattered throughout the LSM Tree, we would have expected to see 42 seek operations. However, some of the values are close enough to be read with faster next operations. In this case, we observed 28 instances of rocksdb_number_db_seek
and 109 instances of rocksdb_number_db_next
.
With more scattered values, it shows a little more seek
and less next
:
yugabyte=# explain (analyze, buffers, dist, debug, summary off) select id
from demo
where value in (100083, 100031, 100011, 100019, 100137, 100149, 100079, 100167, 1003, 100047, 100059, 1005, 100113, 100071, 100
053, 100163, 100109, 100101, 1002, 100107, 100017, 100179, 100131, 10043, 10023, 10067, 100151, 10097, 100139, 10061, 10073, 100157, 10029, 10089, 10013, 100127, 10041, 1007, 100103, 100173, 100181, 10037)
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using demo_value_id_idx on demo (cost=180.00..983.22 rows=42 width=8) (actual time=0.748..0.761 rows=42 loops=1)
Index Cond: (value = ANY ('{100083,100031,100011,100019,100137,100149,100079,100167,1003,100047,100059,1005,100113,100071,100053,100163,100109,100101,1002,100107,100017,100179,100131,10043,10023,10067,100151,10097,100139,10061,10073,100157,10029,10089,10013,100127,10041,1007,100103,100173,100181,10037}'::integer[]))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 0.593 ms
Storage Index Rows Scanned: 42
Metric rocksdb_block_cache_hit: 5.000
Metric rocksdb_block_cache_index_hit: 1.000
Metric rocksdb_block_cache_data_hit: 3.000
Metric rocksdb_block_cache_bytes_read: 98236.000
Metric rocksdb_number_db_seek: 32.000
Metric rocksdb_number_db_next: 113.000
Metric rocksdb_number_db_seek_found: 32.000
Metric rocksdb_number_db_next_found: 113.000
Metric rocksdb_iter_bytes_read: 10292.000
Metric rocksdb_block_cache_multi_touch_hit: 5.000
Metric rocksdb_block_cache_multi_touch_bytes_read: 98236.000
Metric docdb_keys_found: 42.000
Metric ql_read_latency: sum: 167.000, count: 1.000
Estimated Seeks: 43
Estimated Nexts: 86
Estimated Docdb Result Width: 9
(22 rows)
YugabyteDB implemented this optimization early because the overhead of performing multiple index scans is high in distributed databases, where each scan can result in a network call.
YugabyteDB uses this optimization to minimize the number of iterations in a Nested Loop join.
Batched Nested Loop
In PostgreSQL 17, the scalar array optimization is used with an IN() list but not with a join. For example, I have defined my list of values in a WITH clause:
postgres=# explain (analyze, buffers, summary off)
with my_values(value) as (
values (100083), (100031), (100011), (100019), (100137), (100149), (100079), (100167), (1003), (100047), (100059), (1005), (100113), (100071), (100053), (100163), (100109), (100101), (1002), (100107), (100017), (100179), (100131), (10043), (10023), (10067), (100151), (10097), (100139), (10061), (10073), (100157), (10029), (10089), (10013), (100127), (10041), (1007), (100103), (100173), (100181), (10037)
)
select id from demo where value in ( select value from my_values );
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=1.05..188.06 rows=42 width=8) (actual time=0.038..0.114 rows=42 loops=1)
Buffers: shared hit=127
-> HashAggregate (cost=0.63..1.05 rows=42 width=4) (actual time=0.019..0.024 rows=42 loops=1)
Group Key: "*VALUES*".column1
Batches: 1 Memory Usage: 24kB
-> Values Scan on "*VALUES*" (cost=0.00..0.53 rows=42 width=4) (actual time=0.001..0.008 rows=42 loops=1)
-> Index Only Scan using demo_value_id_idx on demo (cost=0.42..4.44 rows=1 width=12) (actual time=0.002..0.002 rows=1 loops=42)
Index Cond: (value = "*VALUES*".column1)
Heap Fetches: 0
Buffers: shared hit=127
PostgreSQL executes one nested loop per value, and the 42 Index Scan is identifiable with 'loops=42'. The number of buffers accessed is equivalent to that of PostgreSQL 16.
YugabyteDB optimizes this by batching the values from the outer table and pushing down an array for the inner table nested loop join.
yugabyte=# explain (analyze, buffers, dist, debug, summary off)
with my_values(value) as (
values (100083), (100031), (100011), (100019), (100137), (100149), (100079), (100167), (1003), (100047), (100059), (1005), (100113), (100071), (100053), (100163), (100109), (100101), (1002), (100107), (100017), (100179), (100131), (10043), (10023), (10067), (100151), (10097), (100139), (10061), (10073), (100157), (10029), (10089), (10013), (100127), (10041), (1007), (100103), (100173), (100181), (10037)
)
select id from demo where value in ( select value from my_values );
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
YB Batched Nested Loop Join (cost=181.47..999.06 rows=42 width=8) (actual time=0.820..0.860 rows=42 loops=1)
Join Filter: (demo.value = my_values.value)
CTE my_values
-> Values Scan on "*VALUES*" (cost=0.00..0.53 rows=42 width=4) (actual time=0.001..0.008 rows=42 loops=1)
-> HashAggregate (cost=0.94..1.36 rows=42 width=4) (actual time=0.039..0.043 rows=42 loops=1)
Group Key: my_values.value
-> CTE Scan on my_values (cost=0.00..0.84 rows=42 width=4) (actual time=0.004..0.023 rows=42 loops=1)
-> Index Only Scan using demo_value_id_idx on demo (cost=180.00..997.14 rows=42 width=12) (actual time=0.701..0.715 rows=42 loops=1)
Index Cond: (value = ANY (ARRAY[my_values.value, $2, $3, ..., $1024]))
Heap Fetches: 0
Storage Index Read Requests: 1
Storage Index Read Execution Time: 0.568 ms
Storage Index Rows Scanned: 42
Metric rocksdb_block_cache_hit: 5.000
Metric rocksdb_block_cache_index_hit: 1.000
Metric rocksdb_block_cache_data_hit: 3.000
Metric rocksdb_block_cache_bytes_read: 98236.000
Metric rocksdb_number_db_seek: 32.000
Metric rocksdb_number_db_next: 113.000
Metric rocksdb_number_db_seek_found: 32.000
Metric rocksdb_number_db_next_found: 113.000
Metric rocksdb_iter_bytes_read: 10292.000
Metric rocksdb_block_cache_multi_touch_hit: 5.000
Metric rocksdb_block_cache_multi_touch_bytes_read: 98236.000
Metric docdb_keys_found: 42.000
Metric ql_read_latency: sum: 160.000, count: 1.000
Estimated Seeks: 43
Estimated Nexts: 86
Estimated Docdb Result Width: 14
When using YugabyteDB, even if the list of values comes from a table instead of an array, it only performs one read request to the inner table to retrieve all values. This can be seen in the execution plan with the 'YB Batched Nested Loop Join', which batches the request. I used a Common Table Expression for simplicity, but it works with any table to optimize IN(select) or joins.
Featured ones: