Compare occurence of chars in PostgreSQL string



  • I have some data like this:

    metaphone lag
    FLKSW [null]
    PPS FLKSW
    PPS PPS
    PSP PPS

    And I want to compare the string values in both columns on the folowing condition: they're similar (assign some value, like 1) if they share at least 2 chars. Otherwise, they're not similar.

    So in the example, PPS and PSP would be similar.

    How can this substring comparison be achieved?

    I know one approach would be to extract substrings and manually compare them, but it feels hacky and I don't know the maximum number of chars that can occur.



  • they're similar ... if they share at least 2 chars.

    Unfortunately, there is no built-in "intersect" operator or function for strings or https://www.postgresql.org/docs/current/functions-array.html . You can roll your own function to count overlapping characters:

    CREATE FUNCTION f_count_overlapping_char(text, text)
      RETURNS int
      LANGUAGE sql PARALLEL SAFE IMMUTABLE STRICT AS
    $func$
    SELECT count(*)::int
    FROM (
       SELECT unnest(string_to_array($1, NULL))
       INTERSECT ALL
       SELECT unnest(string_to_array($2, NULL))
       ) sub;
    $func$;
    

    https://www.postgresql.org/docs/current/queries-union.html includes duplicate matching characters. To fold duplicates, use just INTERSECT instead.

    Then your query can be:

    SELECT *, f_count_overlapping_char(t1.metaphone, t2.metaphone) AS overlap
    FROM   tbl t1
    JOIN   tbl t2 ON t1.id < t2.id
                 AND f_count_overlapping_char(t1.metaphone, t2.metaphone) >= 2;
    

    db<>fiddle https://dbfiddle.uk/?rdbms=postgres_14&fiddle=1d8f9a81c23d32fbeb4d85d6a376fde1

    But it's expensive and does not scale well with more rows in the table - O(N²). Depending on your actual objective there are various superior alternatives - like trigram similarity provided by the additional module https://www.postgresql.org/docs/current/pgtrgm.html . See:

    • https://dba.stackexchange.com/questions/86773/what-does-the-postgres-operator-do/86774#86774



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2