Agregation group by date or group, pair index, MySQL, slow response
-
I would like to get an advice from you about manipulating lot of data using SUM and GROUP BY two columns.
I have a table with 60M rows:
mysql> SELECT COUNT(*) -> FROM adgroup_ad_diff; +----------+ | COUNT(*) | +----------+ | 59746727 | +----------+ 1 row in set (1.34 sec)
mysql> DESCRIBE adgroup_ad_diff;
+--------------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+---------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| ad_group_name_id | int | YES | MUL | NULL | |
| date_to | datetime | NO | MUL | NULL | |
| cost | decimal(10,2) | NO | | NULL | |
... more columns
+--------------------+---------------+------+-----+---------+----------------+
10 rows in set (0.00 sec)mysql> SHOW INDEX FROM adgroup_ad_diff;
+-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| adgroup_ad_diff | 0 | PRIMARY | 1 | id | A | 58985640 | NULL | NULL | | BTREE | | | YES | NULL |
| adgroup_ad_diff | 0 | search_idx | 1 | date_to | A | 116898 | NULL | NULL | | BTREE | | | YES | NULL |
| adgroup_ad_diff | 0 | search_idx | 2 | ad_group_name_id | A | 58983692 | NULL | NULL | YES | BTREE | | | YES | NULL |
| adgroup_ad_diff | 1 | IDX_CDAC9A9E3A2C86F0 | 1 | ad_group_name_id | A | 63372 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+That table contains sum of costs for each groups. There are over 20k groups. Each date_to is dividable by 15minutes, so new rows are created every 15 minutes, like:
- 2022-01-01 10:00
- 2022-01-01 10:15
- 2022-01-01 10:30 etc.
And I run simple aggregation queries on that table only:
SELECT SQL_NO_CACHE ad_group_name_id, SUM(cost) AS cost FROM adgroup_ad_diff WHERE date_to >= '2021-05-20 10:01' AND date_to
and
SELECT SQL_NO_CACHE date_to, SUM(cost) AS cost FROM adgroup_ad_diff WHERE date_to >= '2021-05-20 10:01' AND date_to
The list of IDS is the list that varies, from 30 ids to 15k ids, depends on selection of the group.
Usually,
date_to
range I'm using in WHERE condition is not big, like 1-10 days, so my idea was to create an index on pairdate_to
andad_group_name_id
, so MySQL engine can filter easily rows by date and then group them bydate_to
orad_group_name_id
.But it does not work, it's very slow, the single query takes about 40 seconds. No matter what are
date_to
set to, if we take a day or a month or a year. Always about 40 seconds.I wonder what can I do to improve that time. So for example query:
SELECT SQL_NO_CACHE date_to, SUM(cost) AS cost
FROM adgroup_ad_diff WHERE date_to >= '2022-05-20 10:01' AND date_to <= '2022-05-30 23:45' GROUP BY date_to;
Could be executed faster than 40 seconds. New index? Partition? I will be really greateful for any tips.
Btw. in all queries I have
SQL_NO_CACHE
as I without it the results were not repeatable when I was testing response time.
-
That's one month's worth of data? Are you having any trouble with 20K inserts every 15 minutes? (I would expect not.)
The table can be shrunk:
ad_group_name_id
probably does not need to be a 4-byte INT; for 20K groups,SMALLINT UNSIGNED
would take only 2 bytes (but overflow at 65K).id
(4 bytes) is unnecessary. If you remove it, change thePRIMARY KEY
as noted below.DECIMAL(8,2)
takes 4 bytes.(6,2)
would take 3 bytes (but top out at 9999.99).
I recommend this pair of indexes to handle all 3 of your queries:
PRIMARY KEY(date_to, ad_group_name_id), INDEX(ad_group_name_id, date_to, cost)
A Summary table would work nicely for queries that cover whole days - midnight up to midnight. It would have 1 row per day per group, plus the
SUM(cost)
. A simple aggregation at the end of each day would add the 20K new rows.Then use the Summary table instead of the main table for all 3 of your queries, thereby running upwards of 96 times as fast. (It probably won't be more than 10 times as fast.)
It would have two indexes. (Note that
day_to
would be of typeDATE
.)PRIMARY KEY(day_to, ad_group_name_id), INDEX(ad_group_name_id, day_to, sum_cost)
If you do add the Summary table, you could consider dropping the secondary index from the main table unless it is needed for other lookups based primarily on
ad_group_name_id
.