Why is the optimizer not using the clustering index on my table?



  • I have this table

                         Table "public.lineitem"
         Column      |     Type      | Collation | Nullable | Default 
    -----------------+---------------+-----------+----------+---------
     l_orderkey      | integer       |           |          | 
     l_partkey       | integer       |           |          | 
     l_suppkey       | integer       |           |          | 
     l_linenumber    | integer       |           |          | 
     l_quantity      | integer       |           |          | 
     l_extendedprice | numeric(12,2) |           |          | 
     l_discount      | numeric(12,2) |           |          | 
     l_tax           | numeric(12,2) |           |          | 
     l_returnflag    | character(1)  |           |          | 
     l_linestatus    | character(1)  |           |          | 
     l_shipdate      | date          |           |          | 
     l_commitdate    | date          |           |          | 
     l_receiptdate   | date          |           |          | 
     l_shipinstruct  | character(25) |           |          | 
     l_shipmode      | character(10) |           |          | 
     l_comment       | character(44) |           |          | 
     l_partsuppkey   | character(20) |           |          | 
    Indexes:
        "l_shipdate_c_idx" btree (l_shipdate) CLUSTER
        "l_shipmode_h_idx" hash (l_shipdate)
    Foreign-key constraints:
        "lineitem_l_orderkey_fkey" FOREIGN KEY (l_orderkey) REFERENCES orders(o_orderkey)
        "lineitem_l_partkey_fkey" FOREIGN KEY (l_partkey) REFERENCES part(p_partkey)
        "lineitem_l_partsuppkey_fkey" FOREIGN KEY (l_partsuppkey) REFERENCES partsupp(ps_partsuppkey)
        "lineitem_l_suppkey_fkey" FOREIGN KEY (l_suppkey) REFERENCES supplier(s_suppkey)
    

    and this query:

    explain analyze select
        l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice*(1 - l_discount)) as sum_disc_price,
        sum(l_extendedprice*(1 - l_discount)*(1 + l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price,
        avg(l_discount) as avg_disc,
        count(*) as count_order
    from
        lineitem
    where
        l_shipdate

    returning this query plan:

    "Finalize GroupAggregate  (cost=2631562.25..2631564.19 rows=6 width=212) (actual time=28624.012..28624.466 rows=4 loops=1)"
    "  Group Key: l_returnflag, l_linestatus"
    "  ->  Gather Merge  (cost=2631562.25..2631563.65 rows=12 width=212) (actual time=28623.998..28624.442 rows=12 loops=1)"
    "        Workers Planned: 2"
    "        Workers Launched: 2"
    "        ->  Sort  (cost=2630562.23..2630562.24 rows=6 width=212) (actual time=28620.633..28620.633 rows=4 loops=3)"
    "              Sort Key: l_returnflag, l_linestatus"
    "              Sort Method: quicksort  Memory: 27kB"
    "              Worker 0:  Sort Method: quicksort  Memory: 27kB"
    "              Worker 1:  Sort Method: quicksort  Memory: 27kB"
    "              ->  Partial HashAggregate  (cost=2630562.03..2630562.15 rows=6 width=212) (actual time=28620.607..28620.611 rows=4 loops=3)"
    "                    Group Key: l_returnflag, l_linestatus"
    "                    Batches: 1  Memory Usage: 24kB"
    "                    Worker 0:  Batches: 1  Memory Usage: 24kB"
    "                    Worker 1:  Batches: 1  Memory Usage: 24kB"
    "                    ->  Parallel Seq Scan on lineitem  (cost=0.00..1707452.35 rows=24616258 width=24) (actual time=0.549..19028.353 rows=19701655 loops=3)"
    "                          Filter: (l_shipdate 
    • Why is the optimizer preferring the sequential scan over lineitem table instead of using l_shipdate_c_idx? Should I drop it?

    Postgres version: PostgreSQL 14.2 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit



  • Your filter

    l_shipdate

    Is not very selective, we can see from the plan that it was responsible for removing only 293,696 rows and it ended up needing to use 19,701,655. If it was to use an index to read those rows one by one it would probably have been much slower than the sequential scan of the table.

    Should I drop it?

    If this is the only query you are running and the only filter being used, then probably. Otherwise, there's not enough information to go on. The index would probably be useful if you wanted to see the rows for 1 particular day. The index might be better off with some additional columns in it. Impossible to say without knowing your application.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2