MySQL Indexes on all column combinations



  • I have a table that basically represents the log of records

    CREATE TABLE records
    (
        id         varchar(36), -- uuid
        text       blob,
        user_id    bigint,
        cluster_id bigint,
        status_id  tinyint,
        session_id varchar,     -- uuid,
        type       tinyint,
        start_time timestamp,
        duration   int,
        PRIMARY KEY (user_id, start_time, id),
        KEY general_index (start_time, cluster_id, user_id, id),
        KEY endpoint_index (cluster_id, start_time, id)
    );
    

    The table is somehow large:

    100 GB

    10 GB index size

    40 m records

    Now the table is tuned to support filtering by user_id, cluster_id and start_time. But I need to filter (and sort) efficiently by almost every column with different combinations.

    Adding more and more indexes representing search patterns (e.g. KEY general_index (start_time,duration,id)) doesn't seem like a good long term solution.

    Is there a best practice or something available for this for MySQL? (unfortunately, using Elasticsearch or any other database is not an option)

    First though I had is to have 1 PK index:

    PRIMARY KEY (start_time,id)

    and then many indexes that represent some search patterns like:

    KEY cluster_index (cluster_id, id),

    KEY user_index (user_id, id),

    KEY duration_index (duration, id),

    KEY status_index (status_id, id)

    When you need to query the date - just filter using appropriate index(es)

    SELECT *
    FROM records t
    WHERE start_time >= {x}
    INNER JOIN (
        SELECT id FROM records WHERE duration > 10000 -- supporting index duration_index
    ) t2 on t.id=t2.id
    INNER JOIN (
        SELECT id FROM records WHERE user_id = 123 -- supporting index user_index
    ) t3 on t.id=t3.id
    ORDER BY [...]
    LIMIT 100
    

    But that turns out to work pretty slow. Even if the solution above uses the correct index for a subquery it still has to do the HUGE join with t



  • Miscellany Tips

    • Do not blindly add indexes. Often they will go unused for mysterious reasons.
    • Do list the important queries, then make indexes for them. Eliminate duplicates. (Give me the list, I'll design the indexes.)
    • It may be useful to rethink the PRIMARY KEY.
    • It is rarely useful to start a composite index with a date/datetime/timestamp. Usually the optimal indexes start with columns tested with =.
    • "filter (and sort) by almost every column with different combination" -- not practical. A first cut is to list many 2-column indexes, then tack on a 3rd column to some of them.
    • In InnoDB, the PK column(s) are implicitly tacked onto the end of each secondary index.
    • Do not normalize flags; the added JOINs are too costly. (I am guessing that status_id refers to a small number of possible values?)
    • Your example of having subqueries delivering ids, is not likely to be efficient. It is similar to what the Optimizer does with "Index merge" -- which is rarely used because of how inefficient it is.
    • It looks like text BLOB averages over 2KB? And it might actually be text, not blob? If so, consider compressing/uncompressing (in the client). This shrinks typical text by 3:1, thereby cutting the overall data size in half! And probably there would be a slight overall CPU improvement.

    One query

    You provided hints of one query:

     WHERE start_time >= {x}
       AND duration > 10000
       AND user_id = 123
    

    For that, I recommend two composite indexes, with the user_id first:

    INDEX(user_id, duration),
    INDEX(user_id, start_time)
    

    The Optimizer can [usually] decide witch of these works better. Because of the "range" test, a 3rd column would not benefit.

    ORDER BY

    The ORDER BY is important, too. And so is GROUP BY.

    If, for example, the above were followed by ORDER BY start_time, then the 2nd index would avoid a sort.

    More on creating indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

    PRIMARY KEY; UUIDs

    40 million rows? PRIMARY KEY (user_id, start_time, id)? Or is it PRIMARY KEY (start_time, id)? The 3-col version seems better, especially if a lot of queries include WHERE user_id=123 and fetch multiple rows.

    I'm surprised not to see UNIQUE(id) -- to make sure you don't have a dup UUID.

    You could shrink the data by 1.6GB (plus some amount for the indexes) by shrinking the UUIDs into BINARY(16). For 5.7, see my old blog: http://mysql.rjweb.org/doc.php/uuid . 8.0 and MariaDB 10.7 have builtin most of my blog.

    PARTITIONing

    If you will be purging "old" data, then partitioning is much better than big deletes. See http://mysql.rjweb.org/doc.php/partitionmaint . It will not, however help with most the performance of most SELECTs. However, your one example has two ranges, one would be handled (somewhat) by "partition pruning" on start_time; then the other would benefit a suitable index:

    PARTITION BY RANGE (TO_DAYS(start_time))...
    

    PRIMARY KEY(user_id, start_time, id),
    INDEX(user_id, duration)

    Then the query would look at some partitions, using that secondary index to efficiently filter by user_id and duration. That would build a temp table which would then be ORDERed and LIMITed.

    I call that use of partitioning a "2D index".




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2