dev-resources.site

for different kinds of informations.

Postgres partitioning performance: Hash vs. List

Published at

5/1/2024

Hash partitioning

The initial idea was to use hash partitioning.

drop table if exists t;

create table t(id uuid primary key)
    partition by hash(id);

create table t_00
    partition of t 
    for values with (modulus 16, remainder 0);

create table t_01
    partition of t 
    for values with (modulus 16, remainder 1);

create table t_02
    partition of t 
    for values with (modulus 16, remainder 2);

create table t_03
    partition of t 
    for values with (modulus 16, remainder 3);

create table t_04
    partition of t 
    for values with (modulus 16, remainder 4);

create table t_05
    partition of t 
    for values with (modulus 16, remainder 5);

create table t_06
    partition of t 
    for values with (modulus 16, remainder 6);

create table t_07
    partition of t 
    for values with (modulus 16, remainder 7);

create table t_08
    partition of t 
    for values with (modulus 16, remainder 8);

create table t_09
    partition of t 
    for values with (modulus 16, remainder 9);

create table t_10
    partition of t 
    for values with (modulus 16, remainder 10);

create table t_11
    partition of t 
    for values with (modulus 16, remainder 11);

create table t_12
    partition of t 
    for values with (modulus 16, remainder 12);

create table t_13
    partition of t 
    for values with (modulus 16, remainder 13);

create table t_14
    partition of t 
    for values with (modulus 16, remainder 14);

create table t_15
    partition of t 
    for values with (modulus 16, remainder 15);

A competing idea was to ditch hash and use the last digit of the id as partitioning key. I'll discuss the idea in the next section, but let's first benchmark the hash partitioning approach:

time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF

I chose to run 10 connections to DB, each sending 900 queries to the DB, using 30 concurrent threads. Each query will insert 1000 UUIDs in the table.

Why these numbers? Just for fun. They should conform to the real traffic to be indicative of anything. But let's just see how this turns out on my personal laptop (old, 2017 model!):

pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 13.776 ms
initial connection time = 6.491 ms
tps = 725.901931 (without initial connection time)

real    0m12.438s
user    0m0.162s
sys     0m0.396s

It took 12.4 seconds to insert 9,000,000 rows. The average TPS (transaction per second) is 725.9.

Using psql metacommands, we can see the table/index sizes:

Using \dt+ to see table sizes (some columns are removed for brevity):

┌──────┬───────────────────┬─────────┐
│ Name │       Type        │  Size   │
├──────┼───────────────────┼─────────┤
│ t    │ partitioned table │ 0 bytes │
│ t_00 │ table             │ 24 MB   │
│ t_01 │ table             │ 24 MB   │
│ t_02 │ table             │ 24 MB   │
│ t_03 │ table             │ 24 MB   │
│ t_04 │ table             │ 24 MB   │
│ t_05 │ table             │ 24 MB   │
│ t_06 │ table             │ 24 MB   │
│ t_07 │ table             │ 24 MB   │
│ t_08 │ table             │ 24 MB   │
│ t_09 │ table             │ 24 MB   │
│ t_10 │ table             │ 24 MB   │
│ t_11 │ table             │ 24 MB   │
│ t_12 │ table             │ 24 MB   │
│ t_13 │ table             │ 24 MB   │
│ t_14 │ table             │ 24 MB   │
│ t_15 │ table             │ 24 MB   │
└──────┴───────────────────┴─────────┘

Using \di+ to see index sizes (some columns are removed for brevity):

┌───────────┬───────────────────┬─────────┐
│   Name    │       Type        │  Size   │
├───────────┼───────────────────┼─────────┤
│ t_pkey    │ partitioned index │ 0 bytes │
│ t_00_pkey │ index             │ 21 MB   │
│ t_01_pkey │ index             │ 21 MB   │
│ t_02_pkey │ index             │ 22 MB   │
│ t_03_pkey │ index             │ 20 MB   │
│ t_04_pkey │ index             │ 21 MB   │
│ t_05_pkey │ index             │ 21 MB   │
│ t_06_pkey │ index             │ 21 MB   │
│ t_07_pkey │ index             │ 20 MB   │
│ t_08_pkey │ index             │ 20 MB   │
│ t_09_pkey │ index             │ 21 MB   │
│ t_10_pkey │ index             │ 21 MB   │
│ t_11_pkey │ index             │ 21 MB   │
│ t_12_pkey │ index             │ 21 MB   │
│ t_13_pkey │ index             │ 21 MB   │
│ t_14_pkey │ index             │ 21 MB   │
│ t_15_pkey │ index             │ 21 MB   │
└───────────┴───────────────────┴─────────┘

Notice the indexes are almost as large as the tables themselves. Also while the data is equally distributed among partitions (each 24 MB), the index sizes range from 20 to 22 MB. The total size for indexes is 334 MB.

List partitioning

If we want to use the right digit of the id as the partitioning key, the primary key cannot be added to the parent table:

create table t(id uuid primary key)
partition by list (lower(left(id::text, 1)));

results in error:

ERROR:  unsupported PRIMARY KEY constraint with partition key definition
DETAIL:  PRIMARY KEY constraints cannot be used when partition keys include expressions.

So, we decided to add the primary key to each individual partition (this effectively enforces uniqueness across all data):

drop table if exists t;

create table t(id uuid not null)
partition by list (lower(left(id::text, 1)));

create table t_00
partition of t
(id primary key)
for values in ('0');

create table t_01
partition of t
(id primary key) 
for values in ('1');

create table t_02
partition of t
(id primary key) 
for values in ('2');

create table t_03
partition of t
(id primary key) 
for values in ('3');

create table t_04
partition of t
(id primary key) 
for values in ('4');

create table t_05
partition of t
(id primary key) 
for values in ('5');

create table t_06
partition of t
(id primary key) 
for values in ('6');

create table t_07
partition of t
(id primary key) 
for values in ('7');

create table t_08
partition of t
(id primary key) 
for values in ('8');

create table t_09
partition of t
(id primary key) 
for values in ('9');

create table t_10
partition of t
(id primary key) 
for values in ('a');

create table t_11
partition of t
(id primary key) 
for values in ('b');

create table t_12
partition of t
(id primary key) 
for values in ('c');

create table t_13
partition of t
(id primary key) 
for values in ('d');

create table t_14
partition of t
(id primary key) 
for values in ('e');

create table t_15
partition of t
(id primary key) 
for values in ('f');

Now let's benchmark again:

time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF

result:

pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 15.123 ms
initial connection time = 8.810 ms
tps = 661.264382 (without initial connection time)

real    0m13.654s
user    0m0.150s
sys     0m0.409s

This is slower than the hash partition:

Duration: 13.654s instead of 12.438s
TPS: 661.264382 instead of 725.901931

So, we lose the primary key, and it's even slower! Hash partitioning is a clear winner here.

Using \dt+ and \di+ results in almost identical results, so let's not repeat myself.

Using hash indexes instead of b-tree indexes

Another suggested approach is to enforce uniqueness using hash indexes instead of b-tree indexes. The benefit is that they are often smaller and faster than b-tree indexes in case equality checking is the only important operation.

Postgres primary keys do not yet support this, but we may use a hack by applying a non-equality constraint based on hashes:

drop table if exists t;

create table t(id uuid not null)
partition by list (left(id::text, 1));

create table t_00
partition of t
(exclude using hash(id with =))
for values in ('0');

create table t_01
partition of t
(exclude using hash(id with =)) 
for values in ('1');

create table t_02
partition of t
(exclude using hash(id with =)) 
for values in ('2');

create table t_03
partition of t
(exclude using hash(id with =)) 
for values in ('3');

create table t_04
partition of t
(exclude using hash(id with =)) 
for values in ('4');

create table t_05
partition of t
(exclude using hash(id with =)) 
for values in ('5');

create table t_06
partition of t
(exclude using hash(id with =)) 
for values in ('6');

create table t_07
partition of t
(exclude using hash(id with =)) 
for values in ('7');

create table t_08
partition of t
(exclude using hash(id with =)) 
for values in ('8');

create table t_09
partition of t
(exclude using hash(id with =)) 
for values in ('9');

create table t_10
partition of t
(exclude using hash(id with =)) 
for values in ('a');

create table t_11
partition of t
(exclude using hash(id with =)) 
for values in ('b');

create table t_12
partition of t
(exclude using hash(id with =)) 
for values in ('c');

create table t_13
partition of t
(exclude using hash(id with =)) 
for values in ('d');

create table t_14
partition of t
(exclude using hash(id with =)) 
for values in ('e');

create table t_15
partition of t
(exclude using hash(id with =)) 
for values in ('f');

Let's benchmark this as well:

time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF

result:

pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 16.686 ms
initial connection time = 7.089 ms
tps = 599.314265 (without initial connection time)

real    0m15.067s
user    0m0.127s
sys     0m0.468s

Well, I didn't expect that. It's even slower now. Looking at the table sizes (\dt+), they are the same as before (24MB).

However, index sizes (\di+) are a tiny bit smaller:

┌──────────────┬───────┐
│     Name     │ Size  │
├──────────────┼───────┤
│ t_00_id_excl │ 20 MB │
│ t_01_id_excl │ 20 MB │
│ t_02_id_excl │ 20 MB │
│ t_03_id_excl │ 20 MB │
│ t_04_id_excl │ 20 MB │
│ t_05_id_excl │ 20 MB │
│ t_06_id_excl │ 20 MB │
│ t_07_id_excl │ 20 MB │
│ t_08_id_excl │ 20 MB │
│ t_09_id_excl │ 20 MB │
│ t_10_id_excl │ 20 MB │
│ t_11_id_excl │ 20 MB │
│ t_12_id_excl │ 20 MB │
│ t_13_id_excl │ 20 MB │
│ t_14_id_excl │ 20 MB │
│ t_15_id_excl │ 20 MB │
└──────────────┴───────┘

So, in total, the size of the index was reduced from 334 MB to 320 MB.

Summary

Hash partitioning outperforms list partitioning in the above example (pay attention: not always)
Hash partitioning has the added benefit that all tables have primary keys (again, specific to the above example). This is important when using logical replication. For instance, to use AWS RDS blue/green deployment:

Make sure that all tables in the DB instance have a primary key. PostgreSQL logical replication doesn't allow UPDATE or DELETE operations on tables that don't have a primary key.
Using hash indexes instead of b-tree was not a performance boost, but reduced index size by less than 5%.

Edits

Edit (2024-05-03)

A colleague of mine, who asked to remain anonymous, explained the reason why list-partitioning was slower:

In your example, to compute the partition key of the list-based approach, you use a cast (cast UUID to text), then two functions are applied (LEFT and LOWER). The functions should be pretty quick, but the cast is slow. That’s why the combined effect is slower than the hash function, which is implemented in C and is quite fast.

Another colleague, Tim, gave a nice summary:

So, if I got it right, in essence it says “Don’t try to be fancy, just do it in a boring way and PostgreSQL will deal with it in an optimized way.”

Which reminds me of a story of when we implemented a variant of strlen() function and observed it’s slower than the GLIBC library function by a factor of 300 times! I should write a post on that too 🙂