Logo

dev-resources.site

for different kinds of informations.

Speeding Up Foreign Key Constraints During Migrations

Published at
11/3/2024
Categories
postgres
database
sql
yugabytedb
Author
franckpachot
Categories
4 categories in total
postgres
open
database
open
sql
open
yugabytedb
open
Author
12 person written this
franckpachot
open
Speeding Up Foreign Key Constraints During Migrations

When migrating data from one database to another, such as from Oracle to PostgreSQL, Aurora, or YugabyteDB, foreign key (FK) constraints can often create challenges, especially when dealing with large datasets.

In theory, rechecking data that has already been validated in the source database is unnecessary when transferring it to the target database. However, disabling constraints could lead to potential data corruption during data manipulation language (DML) operations other than those run by the replication. Additionally, checking referential integrity is a safeguard in case the replication tool misses some changes.

In this post, we will explore ways to minimize the constraint-checking overhead. We will keep the constraints enabled but avoid checking them in the critical path of the migration. Instead, we will check them asynchronously to raise an alert in case of a violation.

I create two tables, with ten million rows in "demo1" referencing "demo0".

postgres=# create table demo0 ( id bigint primary key );
CREATE TABLE

postgres=# insert into  demo0 select generate_series(0,999);
INSERT 0 1000

postgres=# create table demo1
           ( id bigserial, demo0 bigint references demo0 )
;
CREATE TABLE

postgres-# explain (analyze)
           insert into  demo1(demo0)
           select generate_series(1,10000000)%1000
;

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Insert on demo1  (cost=0.00..250000.02 rows=0 width=0) (actual time=51194.098..51194.099 rows=0 loops=1)
   ->  Subquery Scan on "*SELECT*"  (cost=0.00..250000.02 rows=10000000 width=16) (actual time=2.712..8154.539 rows=10000000 loops=1)
         ->  Result  (cost=0.00..200000.02 rows=10000000 width=4) (actual time=2.654..1663.596 rows=10000000 loops=1)
               ->  ProjectSet  (cost=0.00..50000.02 rows=10000000 width=4) (actual time=2.651..750.523 rows=10000000 loops=1)
                     ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=2.636..2.637 rows=1 loops=1)
 Planning Time: 0.041 ms
 Trigger for constraint demo1_demo0_fkey: time=54717.115 calls=10000000
 JIT:
   Functions: 5
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.248 ms (Deform 0.000 ms), Inlining 0.000 ms, Optimization 0.097 ms, Emission 2.355 ms, Total 2.699 ms
 Execution Time: 106432.418 ms
(12 rows)
Enter fullscreen mode Exit fullscreen mode

The insertion into "demo1" took 51 seconds, while the ten million foreign key checks took 54 seconds. This row-by-row execution is slow.

To simulate a migration, I create a new table called "demo2," which is similar to "demo1," and then copy the data into it.
Since I am confident that the referential integrity has been verified in the original table, I do not need to recheck it when inserting data into the replica. I use session_replication_role to define that: the default is original, and I can set it to replica.

postgres=# create table demo2
           ( id bigserial, demo0 bigint references demo0 )
;
CREATE TABLE

postgres=# begin;
BEGIN

postgres=*# set local session_replication_role=replica;
SET

postgres=*# explain (analyze)
            insert into demo2 select * from demo1
;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Insert on demo2  (cost=0.00..154056.75 rows=0 width=0) (actual time=36974.258..36974.259 rows=0 loops=1)
   ->  Seq Scan on demo1  (cost=0.00..154056.75 rows=10000175 width=16) (actual time=0.040..850.316 rows=10000000 loops=1)
 Planning Time: 0.279 ms
 Execution Time: 36974.275 ms
(4 rows)

postgres=*# set local session_replication_role=origin;
SET
postgres=*# end;
COMMIT
Enter fullscreen mode Exit fullscreen mode

The referential integrity trigger is not executed for rows inserted with the replication role set for the session. I placed it in a transaction to ensure it returns to its default state.

Validating referential integrity is still important, as it can help identify any issues with replication. While I could drop and recreate the constraint, I prefer to run a simple SELECT query to avoid the exclusive lock that PostgreSQL must acquire for DDL.

I run the following Anti-Join to list the violations, and all is good if there's no row returned:

postgres=# select * from demo2 left join demo0 
            on(demo2.demo0=demo0.id)
            where demo0.id is null
;

 id | demo0 | id
----+-------+----
(0 rows)
Enter fullscreen mode Exit fullscreen mode

You can use either the LEFT JOIN, NOT IN, or NOT EXISTS clause, but it is crucial to examine the execution plan. Typically, you would want to utilize a hash anti-join, with the referenced table serving as the hash table, which can be executed in a parallel query.

postgres=# explain
            select * from demo2 left join demo0 
            on(demo2.demo0=demo0.id)
            where demo0.id is null
;

                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Gather  (cost=1027.50..112373.46 rows=1 width=24)
   Workers Planned: 2
   ->  Hash Anti Join  (cost=27.50..111373.36 rows=1 width=24)
         Hash Cond: (demo2.demo0 = demo0.id)
         ->  Parallel Seq Scan on demo2  (cost=0.00..95721.08 rows=4166608 width=16)
         ->  Hash  (cost=15.00..15.00 rows=1000 width=8)
               ->  Seq Scan on demo0  (cost=0.00..15.00 rows=1000 width=8)
 JIT:
   Functions: 10
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(10 rows)

Enter fullscreen mode Exit fullscreen mode

Key Takeaways

The default foreign key checking may not be optimal for bulk inserts or row-by-row replication. It is usually better to rely on pre-validated data or use a dedicated query for bulk checking. The method outlined here combines both approaches, allowing you to avoid running Data Definition Language (DDL) commands to disable or validate foreign keys. The parameter session_replication_role serves this purpose by preventing triggers, including internal triggers for referential integrity, from firing on a database when replicating data from another database where those triggers have already been executed.

yugabytedb Article's
30 articles in total
Favicon
PostgreSQL plan_cache_mode
Favicon
Index Filtering in PostgreSQL and YugabyteDB (Index Scan instead of Index Only Scan)
Favicon
Unique Index on NULL Values in SQL & NoSQL
Favicon
More details in pg_locks for YugabyteDB
Favicon
Large IntentsDB MemTable with Many Small SST Files
Favicon
Aurora DSQL: How it Compares to YugabyteDB
Favicon
Document data modeling to avoid write skew anomalies
Favicon
When to replace IN() with EXISTS() - correlated and uncorrelated subqueries
Favicon
2024.2: Faster with Shared Memory Between PostgreSQL and TServer Layers
Favicon
Aurora DSQL - Simple Inserts Workload from an AWS CloudShell
Favicon
Amazon Aurora DSQL: Which PostgreSQL Service Should I Use on AWS ?
Favicon
YugabyteDB MVCC and Updates: columns vs. JSON
Favicon
No Gap Ordered Numbering in SQL: A Unique Index to Serialize In Read Committed
Favicon
Starting a YugabyteDB lab cluster with AWS CLI
Favicon
Speeding Up Foreign Key Constraints During Migrations
Favicon
Indexing for a Scalable Serialization Isolation Level
Favicon
The Doctor's On-Call Shift example and a Normalized Relational Schema to Avoid Write Skew
Favicon
You Probably Don't Need Serializable Isolation
Favicon
A brief example of an SQL serializable transaction
Favicon
YugabyteDB as a Graph database with PuppyGraph
Favicon
Native GLIBC instead of Linuxbrew since 2.21
Favicon
pgSphere and Q3C on Distributed SQL
Favicon
IN() Index Scan in PostgreSQL 17 and YugabyteDB LSM Tree
Favicon
Frequent Re-Connections improved by Connection Manager
Favicon
Maintaining Throughput With Less Physical Connections
Favicon
YugabyteDB Connection Manager: a Database Resident Connection Pool with Shared Processes
Favicon
ERROR: index row size 3056 exceeds btree version 4 maximum 2704 for index
Favicon
Write Buffering to Reduce Raft Consensus Latency in YugabyteDB
Favicon
Asynch replication for Disaster Recovery, Read Replicas, and Change Data Capture
Favicon
Fast PITR and MVCC reads with Key-Value LSM Tree

Featured ones: