dev-resources.site
for different kinds of informations.
Speeding Up Foreign Key Constraints During Migrations
When migrating data from one database to another, such as from Oracle to PostgreSQL, Aurora, or YugabyteDB, foreign key (FK) constraints can often create challenges, especially when dealing with large datasets.
In theory, rechecking data that has already been validated in the source database is unnecessary when transferring it to the target database. However, disabling constraints could lead to potential data corruption during data manipulation language (DML) operations other than those run by the replication. Additionally, checking referential integrity is a safeguard in case the replication tool misses some changes.
In this post, we will explore ways to minimize the constraint-checking overhead. We will keep the constraints enabled but avoid checking them in the critical path of the migration. Instead, we will check them asynchronously to raise an alert in case of a violation.
I create two tables, with ten million rows in "demo1" referencing "demo0".
postgres=# create table demo0 ( id bigint primary key );
CREATE TABLE
postgres=# insert into demo0 select generate_series(0,999);
INSERT 0 1000
postgres=# create table demo1
( id bigserial, demo0 bigint references demo0 )
;
CREATE TABLE
postgres-# explain (analyze)
insert into demo1(demo0)
select generate_series(1,10000000)%1000
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Insert on demo1 (cost=0.00..250000.02 rows=0 width=0) (actual time=51194.098..51194.099 rows=0 loops=1)
-> Subquery Scan on "*SELECT*" (cost=0.00..250000.02 rows=10000000 width=16) (actual time=2.712..8154.539 rows=10000000 loops=1)
-> Result (cost=0.00..200000.02 rows=10000000 width=4) (actual time=2.654..1663.596 rows=10000000 loops=1)
-> ProjectSet (cost=0.00..50000.02 rows=10000000 width=4) (actual time=2.651..750.523 rows=10000000 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=2.636..2.637 rows=1 loops=1)
Planning Time: 0.041 ms
Trigger for constraint demo1_demo0_fkey: time=54717.115 calls=10000000
JIT:
Functions: 5
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.248 ms (Deform 0.000 ms), Inlining 0.000 ms, Optimization 0.097 ms, Emission 2.355 ms, Total 2.699 ms
Execution Time: 106432.418 ms
(12 rows)
The insertion into "demo1" took 51 seconds, while the ten million foreign key checks took 54 seconds. This row-by-row execution is slow.
To simulate a migration, I create a new table called "demo2," which is similar to "demo1," and then copy the data into it.
Since I am confident that the referential integrity has been verified in the original table, I do not need to recheck it when inserting data into the replica. I use session_replication_role
to define that: the default is original
, and I can set it to replica
.
postgres=# create table demo2
( id bigserial, demo0 bigint references demo0 )
;
CREATE TABLE
postgres=# begin;
BEGIN
postgres=*# set local session_replication_role=replica;
SET
postgres=*# explain (analyze)
insert into demo2 select * from demo1
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Insert on demo2 (cost=0.00..154056.75 rows=0 width=0) (actual time=36974.258..36974.259 rows=0 loops=1)
-> Seq Scan on demo1 (cost=0.00..154056.75 rows=10000175 width=16) (actual time=0.040..850.316 rows=10000000 loops=1)
Planning Time: 0.279 ms
Execution Time: 36974.275 ms
(4 rows)
postgres=*# set local session_replication_role=origin;
SET
postgres=*# end;
COMMIT
The referential integrity trigger is not executed for rows inserted with the replication role set for the session. I placed it in a transaction to ensure it returns to its default state.
Validating referential integrity is still important, as it can help identify any issues with replication. While I could drop and recreate the constraint, I prefer to run a simple SELECT query to avoid the exclusive lock that PostgreSQL must acquire for DDL.
I run the following Anti-Join to list the violations, and all is good if there's no row returned:
postgres=# select * from demo2 left join demo0
on(demo2.demo0=demo0.id)
where demo0.id is null
;
id | demo0 | id
----+-------+----
(0 rows)
You can use either the LEFT JOIN, NOT IN, or NOT EXISTS clause, but it is crucial to examine the execution plan. Typically, you would want to utilize a hash anti-join, with the referenced table serving as the hash table, which can be executed in a parallel query.
postgres=# explain
select * from demo2 left join demo0
on(demo2.demo0=demo0.id)
where demo0.id is null
;
QUERY PLAN
-------------------------------------------------------------------------------------
Gather (cost=1027.50..112373.46 rows=1 width=24)
Workers Planned: 2
-> Hash Anti Join (cost=27.50..111373.36 rows=1 width=24)
Hash Cond: (demo2.demo0 = demo0.id)
-> Parallel Seq Scan on demo2 (cost=0.00..95721.08 rows=4166608 width=16)
-> Hash (cost=15.00..15.00 rows=1000 width=8)
-> Seq Scan on demo0 (cost=0.00..15.00 rows=1000 width=8)
JIT:
Functions: 10
Options: Inlining false, Optimization false, Expressions true, Deforming true
(10 rows)
Key Takeaways
The default foreign key checking may not be optimal for bulk inserts or row-by-row replication. It is usually better to rely on pre-validated data or use a dedicated query for bulk checking. The method outlined here combines both approaches, allowing you to avoid running Data Definition Language (DDL) commands to disable or validate foreign keys. The parameter session_replication_role serves this purpose by preventing triggers, including internal triggers for referential integrity, from firing on a database when replicating data from another database where those triggers have already been executed.
Featured ones: