dev-resources.site
for different kinds of informations.
Story of an SQL query ~10x speed-up with a simple tweak
Let’s re-visit basics of debugging performance issues and importance of monitoring tools(focusing on web apps), with a story of simple tweak improving a bottleneck query performance up-to 10x
TL;DR
A walk-through of simple debugging process for performance issues, and how simple tweaks can make your code run so much faster with minimal efforts to spare on performance. (80/20 rule)
Where to start fixing bad performance
So you have bad performing app or pages, now what?
There are two obvious paths , as you being the creator or maintainer of the app, you already have basic idea of where the bottle necks are, which pages load slow etc, so you can just go and try fixing things, This may work for smaller apps, but of-course, this is not a systematic way.
A good approach for most production apps, is to have things setup in a systematic manner , have performance and server monitoring setup done right(don’t over kill), collect data, analyze data and quickly fix the issues with the richness of insights you get from the right tooling.
I suggest almost always going with the systematic manner, no matter the size of the app. If you have long term plans with you app that is.
Performance monitoring tooling
I am assuming you have an app already in production of test phase, which is facing memory bloat issues or slow API response times, where do you start with? Of-course, you need tooling that can help you determine not only the slowness causing areas in your system, but also the breakdown of insights like how much time DB is taking, how much time each function is taking and so on.
We have many tools for this, my personal favourites, new relic APM, and Skylight(it focuses on Ruby on Rails apps, so has more niche offerings)(not affiliated to any in any way). I use both regularly. Of-course you can use any APM , and even 1 single tool is enough, what you need at the end is:
-
Ability to see slow end points or processes based on a few parameters:
- _ Slowest average response time _ (transactions which perform the worst)
- _ Most time consuming _ (transactions which collectively takes most of the server time)
- _ Throughput _ (most popular transactions)
-
Ability to provide insights into each of the server transaction(we define a transaction as being a unit of work, be it an API call or a background job)
- _ Time being taken _ by each section of the code
- Total time spent on _ DB vs in app _ processing
- Break down of all the SQL queries, and their run time in the transaction
- Aggregation of these insights, like overall slowest queries in the system etc
- In addition, some tools provide even granular insights by trying to identify N+1 queries, repeated queries patterns and showing exactly how much time is spent on each query (i find skylight very good at this)
- Also, another really helpful thing, especially when debugging memory issues, is tooling which can tell us how much memory allocations are done in each section of the code , being run in a single transaction, and of-course, total average allocations per transaction. (again, i find skylight better in this regard)
If we have our tooling setup right, and we are getting as much insights as we need, tool name at the end doesn’t matter as such. It’s very much critical to have your setup right! Everything else, is on your taste and maybe somewhat dependent on your framework.
Just one thing is to ensure working in the right direction, i.e focusing on high impact or frequent transactions instead of just trying to fix the slowest one , that might not even be called in days.
After Identifying Slowest Transactions
Now after all the basic setup, you have your slowest transactions at hand, you pick single transaction you want to focus on, based the impact of that transaction(like an endpoint directly impacting core function of the app.
With right tooling, you should have the all the data you need (even code section level insights) to help you find bottle-necks in slow transactions.
We don’t intend going to a lot of details of each type of fixes you can make with these insights, any engineer can use this data and build upon this information their hypothesis, verify them, and than fix the issues they have identified.
Only one thing i would like to highlight is the importance of benchmarking.
Benchmarking your fixes
Benchmarking in this context is having a base performance stats logged, and than monitoring the performance of same transactions after you fix them, to see how they actually perform and stand against the previously logged transaction performance
Simple SQL query tweak improved it ~10x
(We will mask some details as to ensure code privacy)
Also, we will see why i mentioned 80/20 rules on top, and how it helps us move fast and focus on what matters most and create most impact with minimal efforts.
We had an endpoint in our app, which was fetching Posts i am associated with in some manners, say we have post table and associated tables like post_statuses, post_reports, post_events, post_collaborators and so on. This endpoints fetches all posts where :
- Post is in specific statuses (published, draft etc)
- Post has any pending reports assigned to current user for review (abuse reports)
- Current user is collaborator in the post
Plus, all of these posts should be ordered in a very specific order, with some data on top for each report, i.e:
- Order by latest activity (activity is defined across multiple tables’ operations)
- Include unread comments count (to show kind of like whatsapp shows unread message count)
Now this defines the baseline of the structure and requirements, of-course there are more details to the matter, and the long single SQL query, taking like 25-30 seconds per request.
Original Query
Looked something like:
SELECT posts.*, COUNT(prs.id) AS pending_reports, posts.*,
GREATEST(MAX(post_comments.created_at), MAX(post_events.created_at), GREATEST(posts.created_at, posts.updated_at)) AS latest_activity,
count(DISTINCT(user_comments.id)) FILTER (where user_comments.id IS NOT NULL) AS unread_count,
GREATEST(MAX(comments.created_at), posts.created_at) AS last_activity
FROM "posts"
INNER JOIN "post_statuses" ON "post_statuses"."post_id" = "posts"."id"
LEFT JOIN post_reports as pr ON pr.reportable_id = posts.id AND pr.reportable_type='Post'
LEFT JOIN post_report_statuses as prs ON prs.post_report_id = pr.id AND pr.active = true AND pr.user_id = 1
LEFT JOIN comments AS post_comments ON post_comments.commentable_id = posts.id AND post_comments.commentable_type = 'post'
LEFT JOIN post_events ON post_events.post_id = posts.id
LEFT JOIN comments ON comments.commentable_id = posts.id AND comments.commentable_type = 'post'
LEFT JOIN user_comments ON user_comments.comment_id = comments.id AND user_comments.user_id = 1 AND user_comments.read_at IS NULL
LEFT JOIN post_collaborators as pco ON pco.post_id = posts.id AND pco.user_id = 1 AND pco.post_signed_off = 'f'
WHERE "posts"."id" IN (
SELECT DISTINCT "posts"."id"
FROM "posts"
LEFT JOIN post_collaborators AS current_post_collaborators ON current_post_collaborators.post_id = posts.id AND current_post_collaborators.deleted_at IS NULL
LEFT JOIN post_reports AS user_reports ON posts.id = user_reports.reportable_id AND user_reports.reportable_type = 'post'
LEFT JOIN post_report_statuses ON user_reports.id = post_report_statuses.post_report_id AND post_report_statuses.status = ANY(ARRAY [0, 1, 2])
WHERE (posts.moderator_id = 1 OR current_post_collaborators.user_id = 1 OR user_reports.user_id = 1 )
) AND (
"posts"."creator_id" IN (SELECT "users"."id" FROM "users" WHERE ((users.first_name || ' ' || COALESCE(users.last_name, '')) ILIKE '%%'))
OR "posts"."moderator_id" IN (SELECT "users"."id" FROM "users" WHERE ((users.first_name || ' ' || COALESCE(users.last_name, '')) ILIKE '%%')))
GROUP BY "posts"."id"
ORDER BY latest_activity DESC, last_activity DESC
LIMIT 50 OFFSET 0
At first glance, there are some obvious problems in this SQL , remember this SQL is produced by an ORM, not directly written by a developer, and ORMs while make things easy and fast to build, sometimes can introduce un-intended complexity.
Original Query Plan
Here is how its Query Plan with Explain Analyze looked like this(don’t try to read it in full)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=237736.02..237736.15 rows=50 width=694) (actual time=27772.897..27772.922 rows=50 loops=1)
-> Sort (cost=237736.02..237785.96 rows=19974 width=694) (actual time=27772.895..27772.916 rows=50 loops=1)
Sort Key: (GREATEST(max(post_comments.created_at), max(post_events.created_at), GREATEST(posts.created_at, posts.updated_at))) DESC, (GREATEST(max(comments.created_at), posts.created_at)) DESC
Sort Method: top-N heapsort Memory: 121kB
-> GroupAggregate (cost=33452.05..237072.50 rows=19974 width=694) (actual time=373.867..27771.558 rows=685 loops=1)
Group Key: posts.id
-> Merge Left Join (cost=33452.05..226439.26 rows=685580 width=371) (actual time=369.564..27562.155 rows=1371572 loops=1)
Merge Cond: (posts.id = pco.post_id)
-> Merge Join (cost=32483.89..223659.06 rows=685580 width=371) (actual time=368.498..27302.152 rows=1371572 loops=1)
Merge Cond: (post_statuses.post_id = posts_1.id)
-> Merge Left Join (cost=27619.84..193204.04 rows=7477059 width=379) (actual time=104.873..22591.305 rows=71784103 loops=1)
Merge Cond: (posts.id = post_events.post_id)
-> Merge Left Join (cost=27619.55..70066.69 rows=1138329 width=371) (actual time=104.821..1694.553 rows=4350298 loops=1)
Merge Cond: (posts.id = post_comments.commentable_id)
-> Merge Left Join (cost=27619.25..37371.24 rows=262806 width=363) (actual time=104.790..336.002 rows=379658 loops=1)
Merge Cond: (posts.id = comments.commentable_id)
-> Merge Join (cost=2181.16..7665.76 rows=60674 width=347) (actual time=4.289..89.766 rows=79948 loops=1)
Merge Cond: (posts.id = post_statuses.post_id)
-> Merge Left Join (cost=2180.74..4579.45 rows=19974 width=339) (actual time=4.271..44.332 rows=26882 loops=1)
Merge Cond: (posts.id = pr.reportable_id)
-> Index Scan using posts_pkey on posts (cost=2096.29..4435.40 rows=19974 width=331) (actual time=3.439..35.952 rows=26630 loops=1)
Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
SubPlan 1
-> Seq Scan on users (cost=0.00..1043.38 rows=1850 width=8) (actual time=0.020..3.018 rows=1850 loops=1)
Filter: ((((first_name)::text || ' '::text) || (COALESCE(last_name, ''::character varying))::text) ~~* '%%'::text)
SubPlan 2
-> Seq Scan on users users_1 (cost=0.00..1043.38 rows=1850 width=8) (never executed)
Filter: ((((first_name)::text || ' '::text) || (COALESCE(last_name, ''::character varying))::text) ~~* '%%'::text)
-> Sort (cost=84.45..86.39 rows=773 width=12) (actual time=0.826..1.038 rows=777 loops=1)
Sort Key: pr.reportable_id
Sort Method: quicksort Memory: 61kB
-> Hash Right Join (cost=29.62..47.37 rows=773 width=12) (actual time=0.463..0.673 rows=777 loops=1)
Hash Cond: (prs.post_report_id = pr.id)
Join Filter: (pr.active AND (pr.user_id = 1))
Rows Removed by Join Filter: 775
-> Seq Scan on post_report_statuses tas (cost=0.00..15.72 rows=772 width=16) (actual time=0.005..0.078 rows=779 loops=1)
-> Hash (cost=19.96..19.96 rows=773 width=21) (actual time=0.321..0.322 rows=777 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 57kB
-> Seq Scan on post_reports ta (cost=0.00..19.96 rows=773 width=21) (actual time=0.014..0.204 rows=777 loops=1)
Filter: ((reportable_type)::text = 'post'::text)
Rows Removed by Filter: 24
-> Index Only Scan using index_post_statuses_on_post_id on post_statuses (cost=0.29..2265.14 rows=80899 width=8) (actual time=0.015..27.638 rows=79948 loops=1)
Heap Fetches: 28565
-> Materialize (cost=25438.10..26014.87 rows=115355 width=24) (actual time=100.494..154.413 rows=370263 loops=1)
-> Sort (cost=25438.10..25726.48 rows=115355 width=24) (actual time=100.490..120.023 rows=115303 loops=1)
Sort Key: comments.commentable_id
Sort Method: external merge Disk: 2952kB
-> Hash Left Join (cost=4747.85..13373.21 rows=115355 width=24) (actual time=4.067..61.603 rows=115304 loops=1)
Hash Cond: (comments.id = user_comments.comment_id)
-> Seq Scan on comments (cost=0.00..8322.54 rows=115355 width=24) (actual time=0.006..39.808 rows=115304 loops=1)
Filter: ((commentable_type)::text = 'post'::text)
Rows Removed by Filter: 27899
-> Hash (cost=4734.53..4734.53 rows=1066 width=16) (actual time=4.054..4.055 rows=807 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 54kB
-> Bitmap Heap Scan on user_comments (cost=39.41..4734.53 rows=1066 width=16) (actual time=0.567..3.899 rows=807 loops=1)
Recheck Cond: (user_id = 1)
Filter: (read_at IS NULL)
Rows Removed by Filter: 2899
Heap Blocks: exact=1935
-> Bitmap Index Scan on index_user_comments_on_user_id (cost=0.00..39.15 rows=3030 width=0) (actual time=0.270..0.270 rows=3889 loops=1)
Index Cond: (user_id = 1)
-> Materialize (cost=0.29..17628.74 rows=115355 width=16) (actual time=0.027..391.368 rows=4340903 loops=1)
-> Index Scan using index_comments_on_commentable_id on comments post_comments (cost=0.29..17340.35 rows=115355 width=16) (actual time=0.024..117.440 rows=115303 loops=1)
Filter: ((commentable_type)::text = 'post'::text)
Rows Removed by Filter: 9612
-> Materialize (cost=0.29..8572.97 rows=174931 width=16) (actual time=0.049..4195.281 rows=71784103 loops=1)
-> Index Scan using index_post_events_on_post_id on post_events (cost=0.29..8135.64 rows=174931 width=16) (actual time=0.047..97.077 rows=173202 loops=1)
-> Unique (cost=4864.04..4876.19 rows=2430 width=8) (actual time=53.422..53.979 rows=685 loops=1)
-> Sort (cost=4864.04..4870.12 rows=2430 width=8) (actual time=53.419..53.623 rows=904 loops=1)
Sort Key: posts_1.id
Sort Method: quicksort Memory: 25kB
-> Hash Left Join (cost=1873.16..4727.40 rows=2430 width=8) (actual time=11.725..53.306 rows=904 loops=1)
Hash Cond: (posts_1.id = user_reports.reportable_id)
Filter: ((posts_1.moderator_id = 1) OR (current_post_collaborators.user_id = 1) OR (user_reports.user_id = 1))
Rows Removed by Filter: 70531
-> Hash Right Join (cost=1813.22..3964.41 rows=66906 width=24) (actual time=10.900..43.152 rows=70403 loops=1)
Hash Cond: (current_post_collaborators.post_id = posts_1.id)
-> Seq Scan on post_collaborators current_post_collaborators (cost=0.00..1975.53 rows=66906 width=16) (actual time=0.016..12.351 rows=66718 loops=1)
Filter: (deleted_at IS NULL)
Rows Removed by Filter: 23609
-> Hash (cost=1480.32..1480.32 rows=26632 width=16) (actual time=10.868..10.869 rows=26632 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1505kB
-> Seq Scan on posts posts_1 (cost=0.00..1480.32 rows=26632 width=16) (actual time=0.009..6.641 rows=26632 loops=1)
-> Hash (cost=50.28..50.28 rows=773 width=12) (actual time=0.733..0.735 rows=779 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 45kB
-> Hash Right Join (cost=29.62..50.28 rows=773 width=12) (actual time=0.306..0.625 rows=779 loops=1)
Hash Cond: (post_report_statuses.post_report_id = user_reports.id)
-> Seq Scan on post_report_statuses (cost=0.00..18.62 rows=772 width=8) (actual time=0.009..0.136 rows=779 loops=1)
Filter: (status = ANY ('{0,1,2}'::integer[]))
-> Hash (cost=19.96..19.96 rows=773 width=20) (actual time=0.289..0.290 rows=777 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 51kB
-> Seq Scan on post_reports user_reports (cost=0.00..19.96 rows=773 width=20) (actual time=0.009..0.181 rows=777 loops=1)
Filter: ((reportable_type)::text = 'post'::text)
Rows Removed by Filter: 24
-> Sort (cost=968.16..968.80 rows=254 width=8) (actual time=1.063..15.024 rows=245069 loops=1)
Sort Key: pco.post_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on post_collaborators tus (cost=358.00..958.02 rows=254 width=8) (actual time=0.871..1.038 rows=138 loops=1)
Recheck Cond: (user_id = 1)
Filter: (NOT post_signed_off)
Heap Blocks: exact=113
-> BitmapAnd (cost=358.00..358.00 rows=254 width=0) (actual time=0.854..0.855 rows=0 loops=1)
-> Bitmap Index Scan on index_post_collaborators_on_user_id (cost=0.00..10.06 rows=769 width=0) (actual time=0.055..0.055 rows=741 loops=1)
Index Cond: (user_id = 1)
-> Bitmap Index Scan on index_post_collaborators_on_post_signed_off (cost=0.00..347.56 rows=29769 width=0) (actual time=0.779..0.779 rows=30347 loops=1)
Index Cond: (post_signed_off = false)
Planning Time: 5.432 ms
Execution Time: 27773.891 ms
(108 rows)
It is obviously too long to read, but there are some key areas like :
- Actual cost and planned cost to look closely into, especially where it is highest.
- Lookout for the sequential scans, and fix them with appropriate indexes.
Start with small and obvious things
But remember we are here following 80/20 rule, we went systematically to find bottlenecks, but we will not go in and fix everything we see not making sense, we will start small, and see impact while we improve. First lets remove the useless blocks at the end which look something like:
(
"posts"."creator_id" IN (SELECT "users"."id" FROM "users" WHERE ((users.first_name || ' ' || COALESCE(users.last_name, '')) ILIKE '%%'))
OR "posts"."moderator_id" IN (SELECT "users"."id" FROM "users" WHERE ((users.first_name || ' ' || COALESCE(users.last_name, '')) ILIKE '%%')))
These likes are literally doing nothing, they will validate to true for all ALL users, so we don’t really need them, they are intended to be there when there is a free form search on username. But we had a missing check which was causing this SQL to be written even when not needed.
Removing this made a huge impact, like a 500ms impact on a 30s query, lol. But that was a start. Don’t worry.
Remove duplicate join
At this stage, we noticed a duplicate join:
LEFT JOIN comments AS post_comments ON post_comments.commentable_id = posts.id AND post_comments.commentable_type = 'post'
LEFT JOIN comments ON comments.commentable_id = posts.id AND comments.commentable_type = 'post'
These are essentially joining on same conditions , but one is aliased and one is not. Was not something looked like could make a huge difference, but hey, continue with small fixes.
So i removed one join and re-tried the final query… and we got:
Planning Time: 9.905 ms
Execution Time: 2151.695 ms
(93 rows)
This is 2.1 seconds, vs the original 25-30 seconds variant. Voila, 10-15x improvements with 1 simple tweak. This is where we decided that it is good enough to push in first iteration of the enhancements, we opened up our rails models, and fixed a couple of scopes, dried out some SQL in different scopes, and we had this all sorted.
(As we pushed this fix, we continued to monitor the performance benchmarks for respective transactions, and we saw up-to 10x improvements in overall stats of slowest response time, and average response time etc.)
Further optimisations, learning where to stop
At this stage, we have achieved our 80% from the 80/20 rule, we have acceptable performance for this endpoint, i know 2 seconds is still too slow , but this was for worst case, average case is gone under 400ms, which is still not great, but acceptable in our use case(We have other slow endpoints to take care of).
But, what i would like to touch base on, is why 80/20 rule says that 80% of the desired outcomes can be achieved with 20% efforts(in the right direction), and rest 20% of the desired outcomes, take 80% of your time, with showing you how much efforts more optimizations would require.
If you look at the query again, there is a complex select:
GREATEST(MAX(post_comments.created_at), MAX(post_events.created_at), GREATEST(posts.created_at, posts.updated_at)) AS latest_activity
And what’s more is, we are even sorting on this! Which means, for all eligible rows, our DB has to calculate this giant value and than sort on the result of select. Which is obviously going to be very slow.
One solution which comes to mind is, that we can store a calculated latest_activity_timestamp value for posts , and maintain that value in a reasonable manner(via model callbacks or DB triggers), and have it indexed so we can run our queries in the most optimal manner.
BUT, as you can tell, introducing a new calculated value column in DB, and making sure all triggers are set right against this, in going to be a significant effort, compared to what we did before by just removing 2-3 lines of SQL and achieving 10x improving.
I know doing this complex activity, may give us more 10x improvement(bring 2s query time to 200ms), but that is not worth it right now for us, we will do this when it bites us more than other high impact things. We have got 80% of the desired results and we saved plenty of time to spend on other problems.
So lesson here is to learn where to start, where to stop and finding the sweat spot for your use case.
Featured ones: