To inquire about the query complexity of the following query in SQL



  • I would like to inquire about the complexity of the following example query in SQL:

    SELECT id
    from DB
    where Country = "India"
        AND Size = "Large"
    

    That is, taking the AND of rows matching the filters for any two columns out of m>>2 columns. I want to return the first column which contain the ids.

    Can it be done in time sublinear in the number of rows?

    Thank you.



  • taking the AND of rows matching the filters for two columns

    Can it be done in time sublinear in the number of rows?

    By sublinear I take it you mean faster than O(n) time. The answer is yes but under a certain condition. That condition is you must have a https://blog.toadworld.com/2017/04/06/speed-up-your-queries-using-the-covering-index-in-mysql for your query. Because an index is normally stored in a B-Tree logical data structure, the data is already sorted in such a way that makes seeking against that index for specific values on the order of O(log(n)) search time which is much faster than linear O(n) since not all of the rows will need to be searched.

    For the predicate where Country = "India" AND Size = "Large" that is easy to accomplish. Your index definition would essentially just need to include both columns (Country, Size).

    But things get a little more complicated when your query also uses SELECT * (which is an anti-pattern). If there are other columns besides Country and Size in the table, now the previously mentioned index definition is no longer covering. I'm not an expert on all of the operations MySQL does under the hood to serve a query, but there is a possibility you'll end up with an index scan now which would then be back to O(n) linear search time.

    If your table had a third column (and only 3 columns total for this example) called Column3, and you only needed to return the two columns Country and Size, then you could do two things to fix the aforementioned anti-pattern that will fix the performance to be an index seek again:

    1. Change the SELECT * part of your query to SELECT Country, Size.
    2. Change the index definition to also include Column3 on the end so that it becomes a covering index again: (Country, Size, Column3)

    Either of those options should fix the issue and make your query able to run in O(log(n)) time again.

    Now of course if your table has many more columns in it, then the second option becomes less and less viable (because indexes require additional storage space and affect write performance dependent on the amount of data needed to be written to them), which goes back to the reasoning SELECT * is an anti-pattern and you should only select the columns you need to actually return.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2