LIKE query with leading wildcards in large MySQL production database
According to insights in Amazon RDS the following query in a large MySQL production database is causing high loads ( ~ 50.000.000 entries):
SELECT * FROM entities WHERE status='ready' AND user_id='81663729' AND (primary_name LIKE '%mysearch%' OR additional_names LIKE '%mysearch%') ORDER BY id DESC LIMIT 0, 100000
Columns responsible for it:
common_name: VARCHAR(255) additional_names: VARCHAR(2000) (Note: This is unfortunately a comma-seperated string of zero to hundreds of names per entry, which is horrible database design)
Not a big surprise that this query is performing bad: Leading wildcards make indices more or less unusable (
additional_namesis not). Some users have a whole lot of entries in this table (currently up to ~ 2.000.000 entries per user, numbers of entries growing) - so the user_id in the query will not rescue its performance.
My question is what might be the best way to solve this problem.
Approach 1: MySQL FULLTEXT index
Create a fulltext index:
CREATE FULLTEXT INDEX domain ON entities (primary_name,additional_names);
SELECT * FROM entities WHERE status='ready' AND user_id='81663729' AND MATCH (primary_name,additional_names) AGAINST ('mysearch') IN NATURAL LANGUAGE MODE ORDER BY id DESC LIMIT 0, 100000;
Pro: Easiest to migrate and implement?
Contra: Not sure if this the requirements? Creating that index might take a huge amount of time.
Approach 2: MySQL JSON
Step 1: Add an additional column with JSON
ALTER TABLE entities ADD additional_names_j JSON AFTER additional_names;
Step 2: All new entities are inserted into the new JSON column as json as well as into the old column.
Step 3: A script which might run a few days makes sure that all old entities have their entries in the JSON column.
Step 4: Application is adjusted to use the JSON column.
Step 5: Delete the old additional_names column.
SELECT * FROM entities WHERE status='ready' AND user_id='81663729' AND 'mysearch' member of ORDER BY id DESC LIMIT 0, 100000;
Pro: Searching the JSON would work much better and it would be possible to extract the names without comma exploding and such stuff.
Contra: Not sure if partial matching for JSON columns is even possible? Furthermore the effort is quite high.
Approach 3: Create a relations table and a names table
This would basically mean that we have a table with id and name and a m:n relationship table.
Pro: Would be nice and clean design.
Contra: Would have to exist and be populated in parallel to the running application, and in this case it would cause enormous changes in the application logic.
My question to the experts is: Which of those approaches would solve the situation best - and why?
Note: The table is growing by ~ 5 entries per second, downtime is not allowed. I am open to any other approach to make this query faster, which I am not aware of at the moment!
Approach 1: A fulltext index can't do everything a
LIKEcondition can do. The most common problem I see preventing use of fulltext is that MySQL's implementation only indexes whole words, so if your search is for a partial word it won't help.
It will take time to create the index, but if it's important to avoid downtime, you should definitely get acquainted with using https://www.percona.com/doc/percona-toolkit/LATEST/pt-online-schema-change.html or https://github.com/github/gh-ost . These are both free tools that allow schema changes (including adding an index) without blocking clients from using the table in the meantime. We used pt-osc at my last job to run hundreds of schema changes per week during peak hours.
Approach 2: Adding a JSON column alone would not help, without also creating an index for the expression you want to search for. Have you used JSON in MySQL before? You should do a proof of concept with the version of MySQL you're using to see if it works, before you decide on this. There are a lot of limitations and caveats with using JSON in MySQL. I usually recommend not to use JSON.
Approach 3: This is also called an inverted index. It can work well, but as you said it requires some work to maintain the inverted index, adding values as you add data. A comment above suggests using a trigger, which does work but think that also counts as adding more code.
Which of those approaches would solve the situation best?
I'd prefer the fulltext index if it meets the needs of the searches you need to support. The advantage is that it takes the least code to maintain the index, and that's a pretty big advantage.
But every time I hear the question "which is best?" I know that it's the wrong question. There is no "best" solution for all circumstances. Your project has its own requirements, and you need to choose the solution that fits those requirements. Any solution might be best for one person's application but it won't work for another application.