Agregation group by date or group, pair index, MySQL, slow response



  • I would like to get an advice from you about manipulating lot of data using SUM and GROUP BY two columns.

    I have a table with 60M rows:

    mysql> SELECT COUNT(*) 
        -> FROM adgroup_ad_diff;
    +----------+
    | COUNT(*) |
    +----------+
    | 59746727 |
    +----------+
    1 row in set (1.34 sec)
    

    mysql> DESCRIBE adgroup_ad_diff;
    +--------------------+---------------+------+-----+---------+----------------+
    | Field | Type | Null | Key | Default | Extra |
    +--------------------+---------------+------+-----+---------+----------------+
    | id | int | NO | PRI | NULL | auto_increment |
    | ad_group_name_id | int | YES | MUL | NULL | |
    | date_to | datetime | NO | MUL | NULL | |
    | cost | decimal(10,2) | NO | | NULL | |
    ... more columns
    +--------------------+---------------+------+-----+---------+----------------+
    10 rows in set (0.00 sec)

    mysql> SHOW INDEX FROM adgroup_ad_diff;
    +-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
    | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
    +-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
    | adgroup_ad_diff | 0 | PRIMARY | 1 | id | A | 58985640 | NULL | NULL | | BTREE | | | YES | NULL |
    | adgroup_ad_diff | 0 | search_idx | 1 | date_to | A | 116898 | NULL | NULL | | BTREE | | | YES | NULL |
    | adgroup_ad_diff | 0 | search_idx | 2 | ad_group_name_id | A | 58983692 | NULL | NULL | YES | BTREE | | | YES | NULL |
    | adgroup_ad_diff | 1 | IDX_CDAC9A9E3A2C86F0 | 1 | ad_group_name_id | A | 63372 | NULL | NULL | YES | BTREE | | | YES | NULL |
    +-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

    That table contains sum of costs for each groups. There are over 20k groups. Each date_to is dividable by 15minutes, so new rows are created every 15 minutes, like:

    • 2022-01-01 10:00
    • 2022-01-01 10:15
    • 2022-01-01 10:30 etc.

    And I run simple aggregation queries on that table only:

    SELECT SQL_NO_CACHE
           ad_group_name_id,
           SUM(cost) AS cost
                FROM adgroup_ad_diff
                WHERE date_to >= '2021-05-20 10:01'
                AND date_to 

    and

    SELECT SQL_NO_CACHE
           date_to,
           SUM(cost) AS cost
                FROM adgroup_ad_diff
                WHERE date_to >= '2021-05-20 10:01'
                AND date_to 

    The list of IDS is the list that varies, from 30 ids to 15k ids, depends on selection of the group.

    Usually, date_to range I'm using in WHERE condition is not big, like 1-10 days, so my idea was to create an index on pair date_to and ad_group_name_id, so MySQL engine can filter easily rows by date and then group them by date_to or ad_group_name_id.

    But it does not work, it's very slow, the single query takes about 40 seconds. No matter what are date_to set to, if we take a day or a month or a year. Always about 40 seconds.

    I wonder what can I do to improve that time. So for example query:

    SELECT SQL_NO_CACHE
                date_to,
                SUM(cost) AS cost
    
            FROM adgroup_ad_diff
            WHERE date_to >= '2022-05-20 10:01'
            AND date_to <= '2022-05-30 23:45' 
            GROUP BY date_to;
    

    Could be executed faster than 40 seconds. New index? Partition? I will be really greateful for any tips.

    Btw. in all queries I have SQL_NO_CACHE as I without it the results were not repeatable when I was testing response time.


  • QA Engineer

    That's one month's worth of data? Are you having any trouble with 20K inserts every 15 minutes? (I would expect not.)

    The table can be shrunk:

    • ad_group_name_id probably does not need to be a 4-byte INT; for 20K groups, SMALLINT UNSIGNED would take only 2 bytes (but overflow at 65K).
    • id (4 bytes) is unnecessary. If you remove it, change the PRIMARY KEY as noted below.
    • DECIMAL(8,2) takes 4 bytes. (6,2) would take 3 bytes (but top out at 9999.99).

    I recommend this pair of indexes to handle all 3 of your queries:

    PRIMARY KEY(date_to, ad_group_name_id),
    INDEX(ad_group_name_id, date_to, cost)
    

    A Summary table would work nicely for queries that cover whole days - midnight up to midnight. It would have 1 row per day per group, plus the SUM(cost). A simple aggregation at the end of each day would add the 20K new rows.

    Then use the Summary table instead of the main table for all 3 of your queries, thereby running upwards of 96 times as fast. (It probably won't be more than 10 times as fast.)

    It would have two indexes. (Note that day_to would be of type DATE.)

    PRIMARY KEY(day_to, ad_group_name_id),
    INDEX(ad_group_name_id, day_to, sum_cost)
    

    If you do add the Summary table, you could consider dropping the secondary index from the main table unless it is needed for other lookups based primarily on ad_group_name_id.



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2