How to incrementally optimize a `COUNT(*) ... GROUPBY` query in PostgreSQL?
-
I have a PostgreSQL table with ~millions of rows (timestamp, event, distinct_id, properties) where I want to
- filter based on
event
(low cardinality) andtimestamp
, - and group by
date_trunc('week', timestamp)
anddistinct_id
(high cardinality)
I have already squeezed the performance of
COUNT(*) ... GROUP BY week
queries, specifically these ones:SELECT date_trunc('week', timestamp) AS "timestamp" , count(*) AS "count" FROM telemetry_events WHERE ( (event = 'view load' OR event = 'view:loaded') AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z' ) GROUP BY date_trunc('week', timestamp) ORDER BY date_trunc('week', timestamp) ASC ;
by creating two indexes:
CREATE INDEX idx_timestamp ON public.telemetry_events USING btree ("timestamp") CREATE INDEX telemetry_events_event_timestamp_idx ON public.telemetry_events USING btree (event, "timestamp")
The current execution time is under 7 seconds, which I find excellent (at least for now). https://explain.depesz.com/s/dTOL :
GroupAggregate (cost=83849.24..87478.16 rows=181418 width=16) (actual time=6971.824..6989.712 rows=21 loops=1) Group Key: (date_trunc('week'::text, "timestamp")) -> Sort (cost=83849.24..84302.97 rows=181493 width=8) (actual time=6971.735..6978.683 rows=116425 loops=1) Sort Key: (date_trunc('week'::text, "timestamp")) Sort Method: quicksort Memory: 8189kB -> Index Only Scan using telemetry_events_event_timestamp_idx on telemetry_events (cost=0.43..67996.23 rows=181493 width=8) (actual time=6507.580..6906.571 rows=116425 loops=1) Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone) Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text)) Rows Removed by Filter: 639794 Heap Fetches: 0 Planning Time: 1.873 ms Execution Time: 6989.778 ms
However, if I add my high cardinality
distinct_id
to theGROUP BY
, the queries are much slower (45 to 70 seconds):SELECT date_trunc('week', timestamp) AS "timestamp" , distinct_id , count(*) AS "count" FROM telemetry_events WHERE ( (event = 'view load' OR event = 'view:loaded') AND timestamp >= timestamp with time zone '2022-01-02 00:00:00.000Z' ) GROUP BY ( date_trunc('week', timestamp) , distinct_id ) ORDER BY date_trunc('week', timestamp) ASC ;
This is the
EXPLAIN ANALYZE
:Finalize GroupAggregate (cost=88115.68..110676.54 rows=181418 width=52) (actual time=71956.298..72028.293 rows=1326 loops=1) Group Key: (date_trunc('week'::text, "timestamp")), distinct_id -> Gather Merge (cost=88115.68..107274.48 rows=151244 width=52) (actual time=71956.254..72027.304 rows=2526 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial GroupAggregate (cost=87115.65..88817.15 rows=75622 width=52) (actual time=71740.225..71758.172 rows=842 loops=3) Group Key: (date_trunc('week'::text, "timestamp")), distinct_id -> Sort (cost=87115.65..87304.71 rows=75622 width=44) (actual time=71736.479..71743.974 rows=38808 loops=3) Sort Key: (date_trunc('week'::text, "timestamp")), distinct_id Sort Method: quicksort Memory: 4520kB Worker 0: Sort Method: quicksort Memory: 4507kB Worker 1: Sort Method: quicksort Memory: 4679kB -> Parallel Index Scan using idx_timestamp on telemetry_events (cost=0.43..80987.81 rows=75622 width=44) (actual time=402.899..71576.001 rows=38808 loops=3) Index Cond: ("timestamp" >= '2022-01-02 00:00:00+00'::timestamp with time zone) Filter: (((event)::text = 'view load'::text) OR ((event)::text = 'view:loaded'::text)) Rows Removed by Filter: 213265 Planning Time: 1.942 ms Execution Time: 72029.136 ms
What would be the best way to try to optimize these queries?
And some sub-questions: is my expectation of keeping this query under ~10 seconds reasonable? How fast can I make PostgreSQL for these kind of OLAP workloads?
- filter based on
-
Why do you have
(event = 'view load' OR event = 'view:loaded')
? Do those actually mean two different things, or do you just have dirty data with two difference spellings for the same meaning?Your first query is just using the index as a skinny version of the table. Note that the
Index Cond
doesn't include the leading column of the index, which for a btree index means it is the scanning the whole index and just applying the other column as an "in-index filter". This can be faster than scanning the table because the index might be much smaller than the table.In your second query, this doesn't work well because one necessary column is not in the index, so it needs to visit the table anyway. A good index for the 2nd query would be
("timestamp",event,distinct_id)
This index should also work for the first query even better than the current index.But better yet would probably be cleaning your data so you don't need the OR. Or making a filtered index which is filtered on the OR condition.