Fastest query to process data in small batches without repetition



  • I have java app that is using MySQL in the backend. I have the following table:

    A = int, B = varchar, C = timestamp

    A | B   | C
    1 | 100 | 2022-03-01 12:00:00
    2 | 200 | 2022-03-01 12:00:01
    3 | 100 | 2022-03-01 12:00:01
    4 | 200 | 2022-03-01 12:00:02
    5 | 600 | 2022-03-01 12:00:03
    1 | 100 | 2022-03-01 12:00:06
    5 | 700 | 2022-03-01 12:00:07
    2 | 200 | 2022-03-01 12:00:08
    9 | 100 | 2022-03-01 12:00:08
    

    On every X seconds, query should be run, and it should process 5 records where column C > LAST_PROCESSED_TIMESTAMP. This LAST_PROCESSED_TIMESTAMP is updated after each run.

    What I need is - I want to select these 5 rows, but not to include the rows if columns A and B are going to repeat in some fetches that are going to happen in the future.

    Example: for table above:

    First run - select 5

    1 | 100 | 2022-03-01 12:00:00

    2 | 200 | 2022-03-01 12:00:01

    3 | 100 | 2022-03-01 12:00:01
    4 | 200 | 2022-03-01 12:00:02
    5 | 600 | 2022-03-01 12:00:03
    

    Seconds run - select 5

    1 | 100 | 2022-03-01 12:00:06
    5 | 700 | 2022-03-01 12:00:07
    2 | 200 | 2022-03-01 12:00:08
    9 | 100 | 2022-03-01 12:00:08
    

    In first run, we do not select 1|100 and 2|200, since it will be selected in second run.

    I already have some solutions for this, but selects are taking way too much time. The database is also huge - so I'm trying to find the fastest way to execute this. Which indices should I have to optimize this? Which kind of query?

    What I tried:

    SELECT  *
        FROM ( SELECT  A,B
                FROM  TABLE
                WHERE  C >= '2022-03-01 12:00:00'
                LIMIT  5
             ) a
        LEFT JOIN (
            SELECT  A,B
                FROM  TABLE
                WHERE  C >= '2022-03-01 12:00:00'
                LIMIT  5, 18446744073709551615
                  ) b  ON (     a.A=b.A
                           AND  a.B=b.B
                          )
        WHERE  b.A IS NULL;
    

    and also (this one is probably NOT OK, since it will select MAX of C even if not in first 5, so for my example, it would include 2 | 200 | 2022-03-01 12:00:08 inside of the first run - not what I need):

    SELECT  A, B, MAX(C)
        FROM  TABLE
        WHERE  C >= '2022-03-01 12:00:00'
        GROUP BY  A, B ASC
        LIMIT  5;
    



  • These would help your first attempt:

    INDEX(C,A,B)
    INDEX(A,B,C)
    

    In order to figure out how to simplify or improve the query, start by tossing b. It adds nothing but complexity to the question. Think through the question with only a and c.

    More

    1. Pull 5 rows off the "input queue"
    2. Identify "dups" (perhaps with LEFT JOIN and IS NOT NULL, or with NOT EXISTS ( SELECT 1 ... )
    3. Store the non-dups (perhaps with INSERT ... in front of the SELECT in step 2)
    4. push the dups back into the input queue (DELETE from the queue?)`

    A "dup" is where A and B both match.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2