Compare occurence of chars in PostgreSQL string
I have some data like this:
metaphone lag FLKSW [null] PPS FLKSW PPS PPS PSP PPS
And I want to compare the string values in both columns on the folowing condition: they're similar (assign some value, like 1) if they share at least 2 chars. Otherwise, they're not similar.
So in the example, PPS and PSP would be similar.
How can this substring comparison be achieved?
I know one approach would be to extract substrings and manually compare them, but it feels hacky and I don't know the maximum number of chars that can occur.
they're similar ... if they share at least 2 chars.
Unfortunately, there is no built-in "intersect" operator or function for strings or https://www.postgresql.org/docs/current/functions-array.html . You can roll your own function to count overlapping characters:
CREATE FUNCTION f_count_overlapping_char(text, text) RETURNS int LANGUAGE sql PARALLEL SAFE IMMUTABLE STRICT AS $func$ SELECT count(*)::int FROM ( SELECT unnest(string_to_array($1, NULL)) INTERSECT ALL SELECT unnest(string_to_array($2, NULL)) ) sub; $func$;
https://www.postgresql.org/docs/current/queries-union.html includes duplicate matching characters. To fold duplicates, use just
Then your query can be:
SELECT *, f_count_overlapping_char(t1.metaphone, t2.metaphone) AS overlap FROM tbl t1 JOIN tbl t2 ON t1.id < t2.id AND f_count_overlapping_char(t1.metaphone, t2.metaphone) >= 2;
But it's expensive and does not scale well with more rows in the table - O(N²). Depending on your actual objective there are various superior alternatives - like trigram similarity provided by the additional module https://www.postgresql.org/docs/current/pgtrgm.html . See: