Parallel merge join with sorting on large temporary tables in PostgreSQL
inna last edited by
I have two tables with text column which I want to join: larger has billion rows and smaller has 100M rows. Tables don't fit into memory, so PostgreSQL reasonably uses merge join for joining.
The problem happens on the sorting stage: PostgreSQL does single threaded sorting, which takes forever.
Is there any way to solve this? I imagine parallel multi-worker sort would scale it. Or maybe there are any other possible solutions? I think this should be very common scenario.
Update: I found that issue is reproducible only for temporary tables, which is a known feature per: https://stackoverflow.com/questions/69533864/why-are-scans-of-ctes-and-temporary-tablest-parallel-restricted Parallel scans are not allowed on temporary tables.
Having parallel sort in PostgreSQL would mean to exchange lots of rows between parallel worker processes, so it is questionable whether that would be a win.
If the speed of that query is very important, one possible solution for the problem would be partitioning. You'd have to partition both tables on the column that is used in the join condition (I expect that the join will be on
=) and use the same partition boundaries for both tables. Then set the parameter
on, and PostgreSQL will perform the join for each partition. Not only can that be parallelized, but since the tables are smaller, you may also end up with a faster hash join.