Autocomplete using Postgres Optimization
I'm trying to build an autocomplete feature using postgres.
Here's the steps that I followed:
- Since the autocomplete is to be done on fields across 2 fields from different tables, I created a materialized view where I merged the two fields on which I'm searching.
- Created pg_trgm extension.
- Created a gin index on the new merged_field (this is the name of the new field).
- Querying for particular products like "ball" using similarity % operator.
The materialized view had a total of 3 million rows.
Here's the list of issues that I encountered :
- The queries are slow, sometimes it was making seq. scan, since the word is quite common like "ball" it's present in almost 80% of the rows.
- Can't decide on threshold pg_trgm.similarity_threshold, initially I left it at 0.3 but noticed that some records that have 200 chars length were not treated as probable hit when I looked for "bat" (even though there are not that many entries for this word).
- As the number of search words increased the query time also starts increasing. I tested this by giving one genuine term, that's there in db, while rest of the words weren't.
Here's the query that I'm using:
create index on my_schema.mvw_autocomplete using gin (merged_fields gin_trgm_ops);
explain analyze select * from my_schema.mvw_autocomplete where merged_fields % 'bat arkham knight';
This query took 9 seconds to run, though it did used index as per query plan.
Is there anything that I can do to improve the runtime and fix the other issues. And also in Explain Analyze output I can see that there's no mention of parallel workers for this one, whereas there are other queries that execute faster and are making use of parallel workers as well.
If anything is missing please let me know.
Autocompletion is usually for things which complete the prefix, not for proposing things kind of similar to the entire phrase. That is a different use case and is usually done differently (e.g. waiting for the person to submit, rather than doing it as they type).
And both of those are usually for words and short phrases, neither makes much sense for 200 character phrases.
And you must be doing something quite weird if 'ball' is in 80% of your rows. Maybe you should just declare that a stop word (or sequence of characters) and remove it from the mat view.
noticed that some records that have 200 chars length were not treated as probable hit when I looked for "bat"
Sure. Unless you have one word repeated 50 times, how similar can a 200 char phrase really be to 3 letter word? The problem is not in setting similarity_threshold, it is that you are just plain using the wrong tool. Maybe you should be dividing your phrases into smaller chunks, or using FTS, or using %> or %>>, or using