How to optimized self-join view of 50 millions rows table



  • I've the following simplified table with ~50M rows.

    table sample (
        id uuid not null primary key,
        measured_date timestamp with time zone not null,
        segment_id uuid not null,
        activity_id uuid not null,
        value integer not null
    );
    Indexes:
        "sample_pkey" PRIMARY KEY, btree (id)
        "sample_idx" btree (segment_id, measured_date)
        "sample_uniq" UNIQUE CONSTRAINT, btree (segment_id, activity_id, measured_date)
        "sample_activity_idx" btree (activity_id)
    

    I would like to get for each measured gap (between two different dates) my calculated value.

    My query is as follow:

    SELECT ROW_NUMBER () OVER () AS id,
            t1.segment_id AS segment_id,
            t1.activity_id AS activity_id,
            t1.measured_date AS from_date,
            t2.measured_date AS to_date,
            t2.value AS cumulative_progress,
            (t2.value - t1.value) AS marginal_progress,
        FROM sample AS t1 JOIN sample AS t2
        ON t1.activity_id = t2.activity_id AND t1.segment_id = t2.segment_id and t1.measured_date < t2.measured_date
        WHERE t1.segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53' AND t1.activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09' ORDER BY from_date asc, to_date asc;
    

    And the results are quite fast

    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |QUERY PLAN                                                                                                                                                              |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |Sort  (cost=17.23..17.23 rows=1 width=154) (actual time=1.011..1.028 rows=171 loops=1)                                                                                  |
    |  Sort Key: t1.measured_date, t2.measured_date                                                                                                                          |
    |  Sort Method: quicksort  Memory: 70kB                                                                                                                                  |
    |  ->  WindowAgg  (cost=1.13..17.22 rows=1 width=154) (actual time=0.063..0.914 rows=171 loops=1)                                                                        |
    |        ->  Nested Loop  (cost=1.13..17.18 rows=1 width=124) (actual time=0.056..0.698 rows=171 loops=1)                                                                |
    |              Join Filter: (t1.measured_date < t2.measured_date)                                                                                                        |
    |              Rows Removed by Join Filter: 190                                                                                                                          |
    |              ->  Index Scan using sample_uniq on sample t1  (cost=0.56..8.58 rows=1 width=70) (actual time=0.021..0.043 rows=19 loops=1)                               |
    |                    Index Cond: ((segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid))          |
    |              ->  Index Scan using sample_uniq on sample t2  (cost=0.56..8.58 rows=1 width=86) (actual time=0.005..0.030 rows=19 loops=19)                              |
    |                    Index Cond: ((segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid))          |
    |Planning Time: 0.321 ms                                                                                                                                                 |
    |Execution Time: 1.097 ms                                                                                                                                                |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    However, when I create a view to use this query often the results are poor.

    Creating the view:

    CREATE OR REPLACE VIEW sample_marginal AS
            SELECT ROW_NUMBER () OVER () AS id,
            t1.segment_id AS segment_id,
            t1.activity_id AS activity_id,
            t1.measured_date AS from_date,
            t2.measured_date AS to_date,
            t2.value AS cumulative_progress,
            (t2.value - t1.value) AS marginal_progress,
        FROM sample AS t1 JOIN sample AS t2
        ON t1.activity_id = t2.activity_id AND t1.segment_id = t2.segment_id and t1.measured_date < t2.measured_date;
    

    Querying the view:

    SELECT * FROM sample_marginal WHERE segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53' AND activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09' ORDER BY from_date asc, to_date asc;
    

    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |QUERY PLAN |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |Subquery Scan on sample_marginal (cost=14106816.44..23333603.42 rows=2 width=154) |
    | Filter: ((sample_marginal.segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (sample_marginal.activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid)) |
    | -> WindowAgg (cost=14106816.44..22564147.32 rows=51297073 width=154) |
    | -> Gather (cost=14106816.44..20768749.77 rows=51297073 width=124) |
    | Workers Planned: 2 |
    | -> Merge Join (cost=14105816.44..15638042.47 rows=21373780 width=124) |
    | Merge Cond: ((t2.activity_id = t1.activity_id) AND (t2.segment_id = t1.segment_id)) |
    | Join Filter: (t1.updated_by_date < t2.updated_by_date) |
    | -> Sort (cost=4751689.83..4797948.93 rows=18503642 width=86) |
    | Sort Key: t2.activity_id, t2.segment_id |
    | -> Parallel Seq Scan on sample t2 (cost=0.00..1632749.42 rows=18503642 width=86) |
    | -> Materialize (cost=9354126.62..9576170.32 rows=44408740 width=70) |
    | -> Sort (cost=9354126.62..9465148.47 rows=44408740 width=70) |
    | Sort Key: t1.activity_id, t1.segment_id |
    | -> Seq Scan on sample t1 (cost=0.00..1891800.40 rows=44408740 width=70) |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

    I actually never waited for this query to finish since it took extremely long time, and I thought of fixing it before I can use it.

    Running count(*) however, resulted in 1.8B rows (execution was slow of course).

    Any idea on how to improve my view?



  • The parts of an SQL statement are processed in a certain order (at least logically). In the case at hand, the WHERE condition is evaluated before a window function like row_number(). Your view definition and the query on the view reverse that order by effectively introducing a subquery:

    SELECT *
    FROM (SELECT ROW_NUMBER () OVER () AS id,
                 t1.segment_id AS segment_id,
                 t1.activity_id AS activity_id,
                 t1.measured_date AS from_date,
                 t2.measured_date AS to_date,
                 t2.value AS cumulative_progress,
                 (t2.value - t1.value) AS marginal_progress,
          FROM sample AS t1
             JOIN sample AS t2
                ON t1.activity_id = t2.activity_id
                   AND t1.segment_id = t2.segment_id
                   and t1.measured_date < t2.measured_date
         ) AS sample_marginal
    WHERE segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'
      AND activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'
    ORDER BY from_date asc, to_date asc;
    

    There is nothing that the optimizer can do about that.

    If you remove the window function call (that seems pretty pointless anyway) from the view definition, performance should improve.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2