Optimize GROUP BY, ORDER BY and many SUM and AVG operation query



  • I have this query, from TPCH-H benchmark:

    explain analyze select
        l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice*(1 - l_discount)) as sum_disc_price,
        sum(l_extendedprice*(1 - l_discount)*(1 + l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price,
        avg(l_discount) as avg_disc,
        count(*) as count_order
    from
        lineitem
    where
        l_shipdate

    returning this:

    "Finalize GroupAggregate  (cost=2300777.06..2300779.00 rows=6 width=212) (actual time=38289.923..38290.426 rows=4 loops=1)"
    "  Group Key: l_returnflag, l_linestatus"
    "  ->  Gather Merge  (cost=2300777.06..2300778.46 rows=12 width=212) (actual time=38289.907..38290.390 rows=12 loops=1)"
    "        Workers Planned: 2"
    "        Workers Launched: 2"
    "        ->  Sort  (cost=2299777.04..2299777.05 rows=6 width=212) (actual time=38284.169..38284.169 rows=4 loops=3)"
    "              Sort Key: l_returnflag, l_linestatus"
    "              Sort Method: quicksort  Memory: 27kB"
    "              Worker 0:  Sort Method: quicksort  Memory: 27kB"
    "              Worker 1:  Sort Method: quicksort  Memory: 27kB"
    "              ->  Partial HashAggregate  (cost=2299776.84..2299776.96 rows=6 width=212) (actual time=38284.129..38284.133 rows=4 loops=3)"
    "                    Group Key: l_returnflag, l_linestatus"
    "                    Batches: 1  Memory Usage: 24kB"
    "                    Worker 0:  Batches: 1  Memory Usage: 24kB"
    "                    Worker 1:  Batches: 1  Memory Usage: 24kB"
    "                    ->  Parallel Seq Scan on lineitem  (cost=0.00..1493832.54 rows=21491848 width=24) (actual time=0.281..29321.949 rows=17236798 loops=3)"
    "                          Filter: (l_shipdate 

    and it involves this relation:

    CREATE TABLE LINEITEM
     ( 
    L_ORDERKEY INTEGER REFERENCES ORDERS(O_ORDERKEY),
    L_PARTKEY INTEGER REFERENCES PART(P_PARTKEY),
    L_SUPPKEY INTEGER REFERENCES SUPPLIER(S_SUPPKEY),
    L_LINENUMBER INTEGER,
    L_QUANTITY INTEGER,
    L_EXTENDEDPRICE NUMERIC (12,2),
    L_DISCOUNT NUMERIC (12,2),
    L_TAX NUMERIC (12,2),
    L_RETURNFLAG CHAR (1),
    L_LINESTATUS CHAR (1),
    L_SHIPDATE DATE ,
    L_COMMITDATE DATE ,
    L_RECEIPTDATE DATE ,
    L_SHIPINSTRUCT CHAR (25),
    L_SHIPMODE CHAR (10),
    L_COMMENT CHAR (44),
    L_PARTSUPPKEY CHAR (20) REFERENCES PARTSUPP(PS_PARTSUPPKEY)
    ) 
    

    As you can see it takes around 40 seconds and I would like to optimize this. I added a b-tree index on L_SHIPDATE column (sort order ASC and NULLs last).

    • How can I do better?

    As you can see https://tatiyants.com/pev/#/plans/plan_1649953736870 the optimizer is not using the index on l_shipdate and for this reason he prefers to sequential scan lineitem table.



  • There is nothing magical you can do to make this faster. Your options are:

    • faster disks

    • more RAM, and make sure the table is cached in RAM

    • throw more workers at it by increasing max_parallel_workers_per_gather




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2