How to incrementally optimize a `COUNT(*) ... GROUPBY` query in PostgreSQL?



  • I have a PostgreSQL table with ~millions of rows (timestamp, event, distinct_id, properties) where I want to

    • filter based on event (low cardinality) and timestamp,
    • and group by date_trunc('week', timestamp) and distinct_id (high cardinality)

    I have already squeezed the performance of COUNT(*) ... GROUP BY week queries, specifically these ones:

    SELECT
      date_trunc('week', timestamp) AS "timestamp"
      , count(*) AS "count"
    FROM telemetry_events
    WHERE (
      (event = 'view load' OR event = 'view:loaded')
      AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
    )
    GROUP BY
      date_trunc('week', timestamp)
    ORDER BY
      date_trunc('week', timestamp) ASC
    ;
    

    by creating two indexes:

    CREATE INDEX idx_timestamp ON public.telemetry_events USING btree ("timestamp")
    CREATE INDEX telemetry_events_event_timestamp_idx ON public.telemetry_events USING btree (event, "timestamp")
    

    The current execution time is under 7 seconds, which I find excellent (at least for now). https://explain.depesz.com/s/dTOL :

    GroupAggregate  (cost=83849.24..87478.16 rows=181418 width=16) (actual time=6971.824..6989.712 rows=21 loops=1)
      Group Key: (date_trunc('week'::text, "timestamp"))
      ->  Sort  (cost=83849.24..84302.97 rows=181493 width=8) (actual time=6971.735..6978.683 rows=116425 loops=1)
            Sort Key: (date_trunc('week'::text, "timestamp"))
            Sort Method: quicksort  Memory: 8189kB
            ->  Index Only Scan using telemetry_events_event_timestamp_idx on telemetry_events  (cost=0.43..67996.23 rows=181493 width=8) (actual time=6507.580..6906.571 rows=116425 loops=1)
                  Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
                  Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
                  Rows Removed by Filter: 639794
                  Heap Fetches: 0
    Planning Time: 1.873 ms
    Execution Time: 6989.778 ms
    

    However, if I add my high cardinality distinct_id to the GROUP BY, the queries are much slower (45 to 70 seconds):

    SELECT
      date_trunc('week', timestamp) AS "timestamp"
      , distinct_id
      , count(*) AS "count"
    FROM telemetry_events
    WHERE (
      (event = 'view load' OR event = 'view:loaded')
      AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z'
    )
    GROUP BY (
      date_trunc('week', timestamp)
      , distinct_id
    )
    ORDER BY
      date_trunc('week', timestamp) ASC
    ;
    

    This is the EXPLAIN ANALYZE:

    Finalize GroupAggregate  (cost=88115.68..110676.54 rows=181418 width=52) (actual time=71956.298..72028.293 rows=1326 loops=1)
      Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
      ->  Gather Merge  (cost=88115.68..107274.48 rows=151244 width=52) (actual time=71956.254..72027.304 rows=2526 loops=1)
            Workers Planned: 2
            Workers Launched: 2
            ->  Partial GroupAggregate  (cost=87115.65..88817.15 rows=75622 width=52) (actual time=71740.225..71758.172 rows=842 loops=3)
                  Group Key: (date_trunc('week'::text, "timestamp")), distinct_id
                  ->  Sort  (cost=87115.65..87304.71 rows=75622 width=44) (actual time=71736.479..71743.974 rows=38808 loops=3)
                        Sort Key: (date_trunc('week'::text, "timestamp")), distinct_id
                        Sort Method: quicksort  Memory: 4520kB
                        Worker 0:  Sort Method: quicksort  Memory: 4507kB
                        Worker 1:  Sort Method: quicksort  Memory: 4679kB
                        ->  Parallel Index Scan using idx_timestamp on telemetry_events  (cost=0.43..80987.81 rows=75622 width=44) (actual time=402.899..71576.001 rows=38808 loops=3)
                              Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone)
                              Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text))
                              Rows Removed by Filter: 213265
    Planning Time: 1.942 ms
    Execution Time: 72029.136 ms
    

    What would be the best way to try to optimize these queries?

    And some sub-questions: is my expectation of keeping this query under ~10 seconds reasonable? How fast can I make PostgreSQL for these kind of OLAP workloads?



  • Why do you have (event = 'view load' OR event = 'view:loaded')? Do those actually mean two different things, or do you just have dirty data with two difference spellings for the same meaning?

    Your first query is just using the index as a skinny version of the table. Note that the Index Cond doesn't include the leading column of the index, which for a btree index means it is the scanning the whole index and just applying the other column as an "in-index filter". This can be faster than scanning the table because the index might be much smaller than the table.

    In your second query, this doesn't work well because one necessary column is not in the index, so it needs to visit the table anyway. A good index for the 2nd query would be ("timestamp",event,distinct_id) This index should also work for the first query even better than the current index.

    But better yet would probably be cleaning your data so you don't need the OR. Or making a filtered index which is filtered on the OR condition.



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2