Optimizing a PostgreSQL query with many small lookups



  • I have a database about books and book sales. The tables look like this:

    CREATE TABLE purchase (
        person_id integer,
        book_id integer,
        CONSTRAINT purchase_pk PRIMARY KEY (person_id, book_id)
    );
    

    CREATE TABLE person (
    person_id integer PRIMARY KEY
    );

    CREATE TABLE book (
    book_id integer PRIMARY KEY
    );

    CREATE INDEX purchase_idx ON purchase (person_id, book_id);

    I would need to know the number of people who have brought both of the books for each pair of books, so basically:

    book_1_id book_2_id people_count_who_bought_both_books
    1 2 45
    1 3 789

    I have defined the following query for this:

    CREATE MATERIALIZED VIEW book_sales_stat AS
    SELECT b1.book_id AS book_1_id,
           b2.book_id AS book_2_id,
           (SELECT count(*)
            FROM person AS per
            WHERE EXISTS (SELECT FROM purchase pur
                          WHERE pur.person_id = per.person_id
                          AND pur.book_id = b1.book_id)
            AND EXISTS (SELECT FROM purchase pur
                          WHERE pur.person_id = per.person_id
                          AND pur.book_id = b2.book_id)
           ) AS person_count
    FROM book AS b1, book AS b2
    WHERE b1.book_id < b2.book_id;
    

    But unfortunately it is incredibly slow.

    The execution plan looks like this:

     Nested Loop  (cost=0.29..1564886083010.59 rows=90420300 width=16)
       ->  Seq Scan on book b1  (cost=0.00..237.70 rows=16470 width=4)
       ->  Index Only Scan using book_pk on book b2  (cost=0.29..96.38 rows=5490 width=4)
             Index Cond: (book_id > b1.book_id)
       SubPlan 1
         ->  Aggregate  (cost=17306.76..17306.77 rows=1 width=8)
               ->  Nested Loop  (cost=1.14..17306.76 rows=1 width=0)
                     ->  Nested Loop  (cost=0.72..17238.11 rows=123 width=8)
                           ->  Index Only Scan using purchase_idx on purchase pur_1  (cost=0.42..16791.97 rows=123 width=4)
                                 Index Cond: (book_id = b2.book_id)
                           ->  Index Only Scan using person_pk on person per  (cost=0.29..3.63 rows=1 width=4)
                                 Index Cond: (person_id = pur_1.person_id)
                     ->  Index Only Scan using purchase_idx on purchase pur  (cost=0.42..0.56 rows=1 width=4)
                           Index Cond: ((person_id = per.person_id) AND (book_id = b1.book_id))
     JIT:
       Functions: 13
       Options: Inlining true, Optimization true, Expressions true, Deforming true
    

    Seems pretty good in so far that the purchase_idx index is used for pretty much everything inside the loop.

    But why does it not use a parallel execution plan? I tried to go over all the requirements for parallel execution in the Postgres docs, but did not find anything that I would not be satisfied.

    What can I do to speed the query up and get it to execute in parallel?



  • Your query falls under https://www.postgresql.org/docs/current/parallel-safety.html :

    The following operations are always parallel restricted:

    • Plan nodes that reference a correlated SubPlan.

    The top nested loop could be parallelized, but it would have to be gathered before dispatching to the SubPlan. There is just no point in doing it that way.



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2