Optimize join without denormalization



  • I have two tables:

    orders(order_id, scan_timestamp) and
    order_products(order_product_id, order_id, pick_number, quantity)
    

    and I want to know how many of a particular product (identified by pick_number) have been scanned on a particular day:

    SELECT sum(quantity)
    FROM order_products
    JOIN orders USING (order_id)
    WHERE scan_timestamp >= '2022-04-01T00:00:00.000000+02:00'
    AND scan_timestamp < '2022-04-02T00:00:00.000000+02:00'
    AND pick_number = '9553'
    

    This query is not particularly fast, about 19 ms. This adds up when it's in a subselect and runs for several products.

    Aggregate  (cost=7577.42..7577.43 rows=1 width=8) (actual time=17.978..17.990 rows=1 loops=1)
      Buffers: shared hit=2027
      ->  Hash Join  (cost=2538.14..7577.20 rows=87 width=4) (actual time=17.341..17.920 rows=83 loops=1)
            Hash Cond: (order_products.order_id = orders.order_id)
            Buffers: shared hit=2027
            ->  Bitmap Heap Scan on order_products  (cost=126.93..5155.66 rows=3936 width=8) (actual time=1.254..7.544 rows=4132 loops=1)
                  Recheck Cond: (pick_number = '9553'::text)
                  Heap Blocks: exact=1648
                  Buffers: shared hit=1661
                  ->  Bitmap Index Scan on order_products_pick_number_idx  (cost=0.00..125.94 rows=3936 width=0) (actual time=0.938..0.939 rows=4132 loops=1)
                        Index Cond: (pick_number = '9553'::text)
                        Buffers: shared hit=13
            ->  Hash  (cost=2374.37..2374.37 rows=2947 width=4) (actual time=7.222..7.226 rows=2773 loops=1)
                  Buckets: 4096  Batches: 1  Memory Usage: 130kB
                  Buffers: shared hit=366
                  ->  Bitmap Heap Scan on orders  (cost=66.63..2374.37 rows=2947 width=4) (actual time=0.418..4.169 rows=2773 loops=1)
                        Recheck Cond: ((scan_timestamp >= '2022-03-31 22:00:00+00'::timestamp with time zone) AND (scan_timestamp < '2022-04-01 22:00:00+00'::timestamp with time zone))
                        Heap Blocks: exact=356
                        Buffers: shared hit=366
                        ->  Bitmap Index Scan on orders_scan_timestamp_idx  (cost=0.00..65.89 rows=2947 width=0) (actual time=0.360..0.360 rows=2773 loops=1)
                              Index Cond: ((scan_timestamp >= '2022-03-31 22:00:00+00'::timestamp with time zone) AND (scan_timestamp < '2022-04-01 22:00:00+00'::timestamp with time zone))
                              Buffers: shared hit=10
    

    It kind of makes sense because Postgres has to either go through all orders of a day to find the correct product or it has to go through all the orders of all time of that product to find the correct day.

    I found that if I denormalize my data by copying the scan_timestamp into order_products (and indexing it), I get a massive performance boost (now 1.3 ms):

    SELECT sum(quantity)
    FROM order_products
    WHERE order_scan_timestamp >= '2022-04-01T00:00:00.000000+02:00'
    AND order_scan_timestamp < '2022-04-02T00:00:00.000000+02:00'
    AND pick_number = '9553'
    
    Aggregate  (cost=190.68..190.69 rows=1 width=8) (actual time=0.952..0.956 rows=1 loops=1)
      Buffers: shared hit=90
      ->  Bitmap Heap Scan on order_products  (cost=110.19..190.63 rows=21 width=4) (actual time=0.737..0.885 rows=83 loops=1)
            Recheck Cond: ((order_scan_timestamp >= '2022-03-31 22:00:00+00'::timestamp with time zone) AND (order_scan_timestamp < '2022-04-01 22:00:00+00'::timestamp with time zone) AND (pick_number = '9553'::text))
            Heap Blocks: exact=39
            Buffers: shared hit=90
            ->  Bitmap Index Scan on order_products_order_scan_timestamp_pick_number_index  (cost=0.00..110.19 rows=21 width=0) (actual time=0.717..0.718 rows=83 loops=1)
                  Index Cond: ((order_scan_timestamp >= '2022-03-31 22:00:00+00'::timestamp with time zone) AND (order_scan_timestamp < '2022-04-01 22:00:00+00'::timestamp with time zone) AND (pick_number = '9553'::text))
                  Buffers: shared hit=51
    

    Is there anything else I can do instead of denormalizing?



  • Optimizing IO is often a good starting point, but in your case all of your buffers are "Hit", so there is nothing there to optimize.

    You can denormalize your data "automatically" by using a functional index. This does involve declaring a function as "immutable" when it might not truly be immutable. So there is a risk there that you might get mild corruption (wrong answers from queries using the index) if orders.scan_timestamp ever changes after it was already used in order_products. You could try to mitigate this with some trigger on orders, but then you would probably have race conditions of both tables are being changed at the same time.

    create function get_scan(int) returns timestamptz language sql immutable parallel safe as $$
        select scan_timestamp from orders where order_id=$1
    $$;
    

    create index on order_products (pick_number , get_scan(order_id);

    Then the corresponding query would be:

    SELECT sum(quantity)
    FROM order_products
    WHERE  get_scan(order_id) >= '2022-04-01T00:00:00.000000+02:00'
    AND get_scan(order_id) < '2022-04-02T00:00:00.000000+02:00'
    AND pick_number = '8853';
    

    You could instead use an actual column to order_products, using the same function define above (with the same caveat about immutability) in a GENERATED expression, and then use a regular (not function-based) index including that column. A downside to that is that you are storing the denormalized data an extra time, both in the table and in the index. Plus you now have a column showing up in the table that would need to be documented as not really belonging there. The benefit though is that you could get an index-only scan in situations where you might not be able to get one with functional index.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2