How to speed up subquery



  • Let's suppose we have:

    select
        o_orderpriority,
        count(*) as order_count
    from
        orders
    where
        o_orderdate>='01/07/1993'and o_orderdate

    from TPC-H benchmark. These are the tables:

    CREATE TABLE LINEITEM
     ( 
    L_ORDERKEY INTEGER REFERENCES ORDERS(O_ORDERKEY),
    L_PARTKEY INTEGER REFERENCES PART(P_PARTKEY),
    L_SUPPKEY INTEGER REFERENCES SUPPLIER(S_SUPPKEY),
    L_LINENUMBER INTEGER,
    L_QUANTITY INTEGER,
    L_EXTENDEDPRICE NUMERIC (12,2),
    L_DISCOUNT NUMERIC (12,2),
    L_TAX NUMERIC (12,2),
    L_RETURNFLAG CHAR (1),
    L_LINESTATUS CHAR (1),
    L_SHIPDATE DATE ,
    L_COMMITDATE DATE ,
    L_RECEIPTDATE DATE ,
    L_SHIPINSTRUCT CHAR (25),
    L_SHIPMODE CHAR (10),
    L_COMMENT CHAR (44),
    L_PARTSUPPKEY CHAR (20) REFERENCES PARTSUPP(PS_PARTSUPPKEY)
    ) 
    

    and

    CREATE TABLE ORDERS
     ( 
    O_ORDERKEY INTEGER PRIMARY KEY ,
    O_CUSTKEY INTEGER REFERENCES CUSTOMER(C_CUSTKEY),
    O_ORDERSTATUS CHAR (1),
    O_TOTALPRICE NUMERIC (12,2),
    O_ORDERDATE DATE ,
    O_ORDERPRIORITY CHAR (15),
    O_CLERK CHAR (15),
    O_SHIPPRIORITY INTEGER,
    O_COMMENT CHAR (79)
    ) 
    

    Explain Analyze returns:

    "Finalize GroupAggregate  (cost=734234.44..734235.71 rows=5 width=24) (actual time=14682.540..14683.017 rows=5 loops=1)"
    "  Group Key: orders.o_orderpriority"
    "  InitPlan 1 (returns $1)"
    "    ->  Nested Loop  (cost=0.45..91327112.22 rows=19995359 width=0) (actual time=0.803..0.804 rows=1 loops=1)"
    "          ->  Seq Scan on lineitem  (cost=0.00..2144850.95 rows=19995359 width=4) (actual time=0.240..0.240 rows=1 loops=1)"
    "                Filter: (l_commitdate < l_receiptdate)"
    "          ->  Memoize  (cost=0.45..4.99 rows=1 width=4) (actual time=0.560..0.561 rows=1 loops=1)"
    "                Cache Key: lineitem.l_orderkey"
    "                Cache Mode: logical"
    "                Hits: 0  Misses: 1  Evictions: 0  Overflows: 0  Memory Usage: 1kB"
    "                ->  Index Only Scan using orders_pkey on orders orders_1  (cost=0.43..4.98 rows=1 width=4) (actual time=0.555..0.555 rows=1 loops=1)"
    "                      Index Cond: (o_orderkey = lineitem.l_orderkey)"
    "                      Heap Fetches: 1"
    "  ->  Gather Merge  (cost=734229.43..734230.60 rows=10 width=24) (actual time=14682.524..14682.999 rows=15 loops=1)"
    "        Workers Planned: 2"
    "        Params Evaluated: $1"
    "        Workers Launched: 2"
    "        ->  Sort  (cost=733229.41..733229.42 rows=5 width=24) (actual time=14676.302..14676.303 rows=5 loops=3)"
    "              Sort Key: orders.o_orderpriority"
    "              Sort Method: quicksort  Memory: 25kB"
    "              Worker 0:  Sort Method: quicksort  Memory: 25kB"
    "              Worker 1:  Sort Method: quicksort  Memory: 25kB"
    "              ->  Partial HashAggregate  (cost=733229.30..733229.35 rows=5 width=24) (actual time=14676.265..14676.266 rows=5 loops=3)"
    "                    Group Key: orders.o_orderpriority"
    "                    Batches: 1  Memory Usage: 24kB"
    "                    Worker 0:  Batches: 1  Memory Usage: 24kB"
    "                    Worker 1:  Batches: 1  Memory Usage: 24kB"
    "                    ->  Result  (cost=0.00..732048.00 rows=236260 width=16) (actual time=7272.919..14638.858 rows=191224 loops=3)"
    "                          One-Time Filter: $1"
    "                          ->  Parallel Seq Scan on orders  (cost=0.00..732048.00 rows=236260 width=16) (actual time=7272.917..14628.994 rows=191224 loops=3)"
    "                                Filter: ((o_orderdate >= '1993-07-01'::date) AND (o_orderdate < '1993-10-01'::date))"
    "                                Rows Removed by Filter: 4808776"
    "Planning Time: 6.722 ms"
    "Execution Time: 14683.864 ms"
    
    • How can I speed up this query (and its subquery) using indexes, materialized views and/or other stuff?
    • If I change char(n) datatype to text, would I get better performances?

    EDIT

    CREATE INDEX o_orderdate_c_idx ON orderdate
        USING btree (o_orderdate ASC NULLS LAST)
    ALTER TABLE orderdate CLUSTER ON o_orderdate_c_idx
    

    "Finalize GroupAggregate (cost=734234.44..734235.71 rows=5 width=24) (actual time=9026.877..9027.505 rows=5 loops=1)"
    " Group Key: orders.o_orderpriority"
    " InitPlan 1 (returns $1)"
    " -> Nested Loop (cost=0.45..91327077.11 rows=19995351 width=0) (actual time=0.025..0.025 rows=1 loops=1)"
    " -> Seq Scan on lineitem (cost=0.00..2144850.65 rows=19995351 width=4) (actual time=0.010..0.010 rows=1 loops=1)"
    " Filter: (l_commitdate < l_receiptdate)"
    " -> Memoize (cost=0.45..4.99 rows=1 width=4) (actual time=0.013..0.013 rows=1 loops=1)"
    " Cache Key: lineitem.l_orderkey"
    " Cache Mode: logical"
    " Hits: 0 Misses: 1 Evictions: 0 Overflows: 0 Memory Usage: 1kB"
    " -> Index Only Scan using orders_pkey on orders orders_1 (cost=0.43..4.98 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=1)"
    " Index Cond: (o_orderkey = lineitem.l_orderkey)"
    " Heap Fetches: 1"
    " -> Gather Merge (cost=734229.43..734230.60 rows=10 width=24) (actual time=9026.871..9027.498 rows=15 loops=1)"
    " Workers Planned: 2"
    " Params Evaluated: $1"
    " Workers Launched: 2"
    " -> Sort (cost=733229.41..733229.42 rows=5 width=24) (actual time=9023.327..9023.328 rows=5 loops=3)"
    " Sort Key: orders.o_orderpriority"
    " Sort Method: quicksort Memory: 25kB"
    " Worker 0: Sort Method: quicksort Memory: 25kB"
    " Worker 1: Sort Method: quicksort Memory: 25kB"
    " -> Partial HashAggregate (cost=733229.30..733229.35 rows=5 width=24) (actual time=9023.294..9023.295 rows=5 loops=3)"
    " Group Key: orders.o_orderpriority"
    " Batches: 1 Memory Usage: 24kB"
    " Worker 0: Batches: 1 Memory Usage: 24kB"
    " Worker 1: Batches: 1 Memory Usage: 24kB"
    " -> Result (cost=0.00..732048.00 rows=236260 width=16) (actual time=4222.574..8991.249 rows=191224 loops=3)"
    " One-Time Filter: $1"
    " -> Parallel Seq Scan on orders (cost=0.00..732048.00 rows=236260 width=16) (actual time=4222.572..8981.901 rows=191224 loops=3)"
    " Filter: ((o_orderdate >= '1993-07-01'::date) AND (o_orderdate < '1993-10-01'::date))"
    " Rows Removed by Filter: 4808776"
    "Planning Time: 3.580 ms"
    "Execution Time: 9027.807 ms"

    EDIT 2

                          Table "public.orders"
         Column      |     Type      | Collation | Nullable | Default 
    -----------------+---------------+-----------+----------+---------
     o_orderkey      | integer       |           | not null | 
     o_custkey       | integer       |           |          | 
     o_orderstatus   | character(1)  |           |          | 
     o_totalprice    | numeric(12,2) |           |          | 
     o_orderdate     | date          |           |          | 
     o_orderpriority | character(15) |           |          | 
     o_clerk         | character(15) |           |          | 
     o_shippriority  | integer       |           |          | 
     o_comment       | character(79) |           |          | 
    Indexes:
        "orders_pkey" PRIMARY KEY, btree (o_orderkey)
        "o_orderdate_c_idx" btree (o_orderdate) CLUSTER
    Foreign-key constraints:
        "orders_o_custkey_fkey" FOREIGN KEY (o_custkey) REFERENCES customer(c_custkey)
    Referenced by:
        TABLE "lineitem" CONSTRAINT "lineitem_l_orderkey_fkey" FOREIGN KEY (l_orderkey) REFERENCES orders(o_orderkey)
    

    and EXPLAIN(ANALYZE, BUFFERS, SETTINGS) returns:

    "Finalize GroupAggregate  (cost=734234.44..734235.71 rows=5 width=24) (actual time=13542.359..13542.849 rows=5 loops=1)"
    "  Group Key: orders.o_orderpriority"
    "  Buffers: shared hit=19 read=638299"
    "  InitPlan 1 (returns $1)"
    "    ->  Nested Loop  (cost=0.45..91327077.11 rows=19995351 width=0) (actual time=0.681..0.682 rows=1 loops=1)"
    "          Buffers: shared hit=2 read=4"
    "          ->  Seq Scan on lineitem  (cost=0.00..2144850.65 rows=19995351 width=4) (actual time=0.187..0.187 rows=1 loops=1)"
    "                Filter: (l_commitdate < l_receiptdate)"
    "                Buffers: shared read=1"
    "          ->  Memoize  (cost=0.45..4.99 rows=1 width=4) (actual time=0.492..0.492 rows=1 loops=1)"
    "                Cache Key: lineitem.l_orderkey"
    "                Cache Mode: logical"
    "                Hits: 0  Misses: 1  Evictions: 0  Overflows: 0  Memory Usage: 1kB"
    "                Buffers: shared hit=2 read=3"
    "                ->  Index Only Scan using orders_pkey on orders orders_1  (cost=0.43..4.98 rows=1 width=4) (actual time=0.487..0.487 rows=1 loops=1)"
    "                      Index Cond: (o_orderkey = lineitem.l_orderkey)"
    "                      Heap Fetches: 1"
    "                      Buffers: shared hit=2 read=3"
    "  ->  Gather Merge  (cost=734229.43..734230.60 rows=10 width=24) (actual time=13542.294..13542.783 rows=15 loops=1)"
    "        Workers Planned: 2"
    "        Params Evaluated: $1"
    "        Workers Launched: 2"
    "        Buffers: shared hit=19 read=638299"
    "        ->  Sort  (cost=733229.41..733229.42 rows=5 width=24) (actual time=13537.578..13537.579 rows=5 loops=3)"
    "              Sort Key: orders.o_orderpriority"
    "              Sort Method: quicksort  Memory: 25kB"
    "              Buffers: shared hit=17 read=638295"
    "              Worker 0:  Sort Method: quicksort  Memory: 25kB"
    "              Worker 1:  Sort Method: quicksort  Memory: 25kB"
    "              ->  Partial HashAggregate  (cost=733229.30..733229.35 rows=5 width=24) (actual time=13537.542..13537.543 rows=5 loops=3)"
    "                    Group Key: orders.o_orderpriority"
    "                    Batches: 1  Memory Usage: 24kB"
    "                    Buffers: shared hit=3 read=638295"
    "                    Worker 0:  Batches: 1  Memory Usage: 24kB"
    "                    Worker 1:  Batches: 1  Memory Usage: 24kB"
    "                    ->  Result  (cost=0.00..732048.00 rows=236260 width=16) (actual time=6079.412..13496.465 rows=191224 loops=3)"
    "                          One-Time Filter: $1"
    "                          Buffers: shared hit=3 read=638295"
    "                          ->  Parallel Seq Scan on orders  (cost=0.00..732048.00 rows=236260 width=16) (actual time=6079.411..13485.183 rows=191224 loops=3)"
    "                                Filter: ((o_orderdate >= '1993-07-01'::date) AND (o_orderdate < '1993-10-01'::date))"
    "                                Rows Removed by Filter: 4808776"
    "                                Buffers: shared hit=3 read=638295"
    "Planning:"
    "  Buffers: shared hit=223 read=20"
    "Planning Time: 5.517 ms"
    "Execution Time: 13543.511 ms"
    


  • You need an index on orders (o_orderdate), which should avoid the expensive sequential scan.

    If it doesn't, and the sequential scan is actually slower than the index scan, PostgreSQL is probably not configured properly: either random_page_cost is to high for your hardware, or effective_cache_size is too low. Disabling parallel query (max_parallel_workers_per_gather = 0) will also favor an index scan.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2