Postgres : Indexing switches to Parallel Seq Scan from index scan when queried for date range exceeding 150 days



  • I have two tables - exam_data and series_data, with 300M and 900M records each. I have a query which joins the two tables. If I pull the records for the last 90 days, then the query returns within a second. But the moment it exceeds certain date range such as 180 days, then query takes huge amount of time, in excess of 200 seconds, sometimes even 10 minutes.

    When I observe the explain plan, one thing that stands out is that, the query switches from index scan to parallel seq scan. Number of workers also increases from 3 to 10. From the infrastructure point of view, the DB is running on AWS RDS server with db.m6g.large instance type, which has 2 vCPUs, 8GB memory.

    Below are the query plans - first with the date range set to 180 days and the second one set to 90 days.

    With 180 days

    Limit  (cost=57228.52..700532.83 rows=1000 width=198)
      ->  Nested Loop  (cost=57228.52..30312473.74 rows=47031 width=198)
            Join Filter: (exam.mdp_source_connector_ref_id = reff.mdp_source_connector_ref_id)
            ->  Gather  (cost=57228.52..30171353.12 rows=18445 width=174)
                  Workers Planned: 10
                  ->  Parallel Hash Join  (cost=56228.52..30168508.62 rows=1844 width=174)
                        Hash Cond: ((series.mdp_source_connector_ref_id = exam.mdp_source_connector_ref_id) AND (series.exam_id = exam.exam_id))
                        ->  Parallel Seq Scan on series_data series  (cost=0.00..29035458.82 rows=107681532 width=75)
                              Filter: (filtered_flag = 'N'::text)
                        ->  Parallel Hash  (cost=56216.33..56216.33 rows=813 width=130)
                              ->  Parallel Bitmap Heap Scan on exam_data exam  (cost=165.66..56216.33 rows=813 width=130)
                                    Recheck Cond: (system_id = '60743ABCDC'::text)
                                    Filter: ((filtered_flag = 'N'::text) AND (exam_enddatetime_ts >= (CURRENT_DATE - 180)))
                                    ->  Bitmap Index Scan on exam_system_id  (cost=0.00..165.03 rows=14462 width=0)
                                          Index Cond: (system_id = '60743ABCDC'::text)
            ->  Materialize  (cost=0.00..17.65 rows=510 width=36)
                  ->  Seq Scan on mdp_source_connector_ref reff  (cost=0.00..15.10 rows=510 width=36)
    

    With 90 days

    Limit  (cost=1165.97..686763.18 rows=1000 width=198)
      ->  Nested Loop  (cost=1165.97..11702939.19 rows=17068 width=198)
            Join Filter: (exam.mdp_source_connector_ref_id = reff.mdp_source_connector_ref_id)
            ->  Gather  (cost=1165.97..11651721.36 rows=6693 width=174)
                  Workers Planned: 3
                  ->  Nested Loop  (cost=165.97..11650052.06 rows=2159 width=174)
                        ->  Parallel Bitmap Heap Scan on exam_data exam  (cost=165.26..56215.92 rows=295 width=130)
                              Recheck Cond: (system_id = '60743ABCDC'::text)
                              Filter: ((filtered_flag = 'N'::text) AND (exam_enddatetime_ts >= (CURRENT_DATE - 90)))
                              ->  Bitmap Index Scan on exam_system_id  (cost=0.00..165.03 rows=14462 width=0)
                                    Index Cond: (system_id = '60743ABCDC'::text)
                        ->  Index Scan using series_data_exam_id on series_data series  (cost=0.70..39289.00 rows=1214 width=75)
                              Index Cond: (exam_id = exam.exam_id)
                              Filter: ((filtered_flag = 'N'::text) AND (exam.mdp_source_connector_ref_id = mdp_source_connector_ref_id))
            ->  Materialize  (cost=0.00..17.65 rows=510 width=36)
                  ->  Seq Scan on mdp_source_connector_ref reff  (cost=0.00..15.10 rows=510 width=36)
    

    Below is the query used

    > SELECT exam.system_id, reff.modality, exam.exam_number,
    > exam.exam_duration, exam.exam_startdate,     exam.operator,
    > exam.study_type, exam.load_time AS lastupdatedtime, exam.exam_id,
    >     exam.exam_startdatetime_ts, exam.exam_enddatetime_ts, exam.referring_physician,
    >     exam.contrast_used, series.series_startdatetime_ts, series.series_enddatetime_ts,
    >     series.series_number, series.series_modality, series.series_description, series.series_duration    FROM
    > device_utilization.exam_data exam
    >      JOIN device_utilization.mdp_source_connector_ref reff ON exam.mdp_source_connector_ref_id = reff.mdp_source_connector_ref_id
    >      JOIN device_utilization.series_data series ON exam.exam_id = series.exam_id AND exam.mdp_source_connector_ref_id =
    > series.mdp_source_connector_ref_id
    >      WHERE exam.system_id = '63046ABCDA'  and exam_enddatetime_ts >=current_date -360     and exam.filtered_flag = 'N'::text AND series.filtered_flag = 'N'::text
    

    Couple of optimizations planned out are

    1. We are including partitioning on the exam_enddatetime_ts column. Not sure if this will have any impact on the type of indexing used.
    2. We are looking at increase the instance size from 2 cores to 4 cores.

    Looking for the input on why does the index switches from index scan to parallel seq scan. Also any other input how can I optimize the query response time.

    Thanks, Prasanna

    UPDATES on further questions

    The output of EXPLAIN (ANALYZE, BUFFERS) is included below for the query which pulled 180 days of data.

    Limit  (cost=1358.99..784417.12 rows=1000 width=198) (actual time=50.950..194.047 rows=1000 loops=1)
      Buffers: shared hit=5532 read=1576
      I/O Timings: read=677.975
      ->  Nested Loop  (cost=1358.99..14066649.07 rows=17962 width=198) (actual time=50.948..193.912 rows=1000 loops=1)
            Join Filter: (exam.mdp_source_connector_ref_id = reff.mdp_source_connector_ref_id)
            Rows Removed by Join Filter: 7995
            Buffers: shared hit=5532 read=1576
            I/O Timings: read=677.975
            ->  Gather  (cost=1358.99..14012746.10 rows=7044 width=174) (actual time=50.923..191.700 rows=1000 loops=1)
                  Workers Planned: 3
                  Workers Launched: 3
                  Buffers: shared hit=5531 read=1576
                  I/O Timings: read=677.975
                  ->  Nested Loop  (cost=358.99..14011041.70 rows=2272 width=174) (actual time=43.362..182.427 rows=251 loops=4)
                        Buffers: shared hit=5531 read=1576
                        I/O Timings: read=677.975
                        ->  Parallel Bitmap Heap Scan on exam_data exam  (cost=165.26..56215.92 rows=286 width=130) (actual time=41.051..149.211 rows=237 loops=4)
                              Recheck Cond: (system_id = '607431D656B'::text)
                              Filter: ((filtered_flag = 'N'::text) AND (exam_enddatetime_ts >= (CURRENT_DATE - 90)))
                              Rows Removed by Filter: 3069
                              Heap Blocks: exact=371
                              Buffers: shared hit=102 read=1310
                              I/O Timings: read=571.800
                              ->  Bitmap Index Scan on exam_system_id  (cost=0.00..165.03 rows=14462 width=0) (actual time=4.123..4.124 rows=13223 loops=1)
                                    Index Cond: (system_id = '607431D656B'::text)
                                    Buffers: shared read=15
                                    I/O Timings: read=3.183
                        ->  Bitmap Heap Scan on series_data series  (cost=193.74..48780.83 rows=1227 width=75) (actual time=0.133..0.138 rows=1 loops=947)
                              Recheck Cond: (exam_id = exam.exam_id)
                              Filter: ((filtered_flag = 'N'::text) AND (exam.mdp_source_connector_ref_id = mdp_source_connector_ref_id))
                              Rows Removed by Filter: 0
                              Heap Blocks: exact=251
                              Buffers: shared hit=5429 read=266
                              I/O Timings: read=106.175
                              ->  Bitmap Index Scan on series_data_exam_id  (cost=0.00..193.43 rows=12426 width=0) (actual time=0.037..0.037 rows=1 loops=947)
                                    Index Cond: (exam_id = exam.exam_id)
                                    Buffers: shared hit=4684 read=54
                                    I/O Timings: read=19.526
            ->  Materialize  (cost=0.00..17.65 rows=510 width=36) (actual time=0.000..0.001 rows=9 loops=1000)
                  Buffers: shared hit=1
                  ->  Seq Scan on mdp_source_connector_ref reff  (cost=0.00..15.10 rows=510 width=36) (actual time=0.012..0.014 rows=9 loops=1)
                        Buffers: shared hit=1
    Planning:
      Buffers: shared hit=344 read=99 dirtied=1
      I/O Timings: read=49.324
    Planning Time: 53.288 ms
    Execution Time: 194.222 ms
    


  • PostgreSQL switches to a sequential scan because it thinks that the execution time will be shorter that way. If it isn't, it may be a configuration problem:

    • Perhaps you told PostgreSQL that index scans are expensive by leaving random_page_cost at its default value 4, which may be appropriate for spinning disks, but not for anything else.

    • Perhaps index scans are fast because the index is cached. If you increase effective_cache_size, PostgreSQL will know that the index is probably cached and will lower the cost of index scans.

    • Perhaps you specified a degree of parallelism that exceeds the capacities of your hardware, so the parallel table scan is slower than PostgreSQL thinks. Consider lowering max_parallel_workers_per_gather.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2