Select rows where array of letters is contained in given array



  • I want to search for the letters in the "spelling" (text[]) column: a n n o y t

    I need it to find the words: any, annoy, no, an, toy But not find: derivatives of annoy (annoying, annoyance), only find no once.

    I also need the query to NOT find annoy if even a single letter is missing (such as anoyt).

    I am using PostgreSQL 13.5

    ronshome=# SELECT reference, word, spelling FROM word_mash_dictionary
               WHERE word LIKE 'annoy';
     reference | word  |  spelling
    -----------+-------+-------------
           420 | annoy | {a,n,n,o,y}
    (1 row)
    

    This is the table structure:

                                                          Table "public.word_mash_dictionary"
       Column    |  Type  | Collation | Nullable |                         Default                         | Storage  | Stats target | Description
    -------------+--------+-----------+----------+---------------------------------------------------------+----------+--------------+-------------
     reference   | bigint |           | not null | nextval('word_mash_dictionary_reference_seq'::regclass) | plain    |              |
     word        | text   |           |          |                                                         | extended |              |
     spelling    | text[] |           |          |                                                         | extended |              |
     ignore      | bigint |           |          |                                                         | plain    |              |
     list_100    | bigint |           |          |                                                         | plain    |              |
     list_300    | bigint |           |          |                                                         | plain    |              |
     list_500    | bigint |           |          |                                                         | plain    |              |
     list_800    | bigint |           |          |                                                         | plain    |              |
     list_1000   | bigint |           |          |                                                         | plain    |              |
     list_2000   | bigint |           |          |                                                         | plain    |              |
     list_3000   | bigint |           |          |                                                         | plain    |              |
     list_5000   | bigint |           |          |                                                         | plain    |              |
     list_7000   | bigint |           |          |                                                         | plain    |              |
     list_10000  | bigint |           |          |                                                         | plain    |              |
     word_length | bigint |           |          |                                                         | plain    |              |
    


  • The https://www.postgresql.org/docs/current/functions-array.html#ARRAY-OPERATORS-TABLE mostly does it:

    SELECT reference, word, spelling
    FROM   word_mash_dictionary
    WHERE  spelling 

    This can be supported with a GIN index on the array, which makes it fast for big tables. Like:

    CREATE INDEX ON word_mash_dictionary USING gin (spelling);
    

    However, one element in the search array covers any number of matches in spelling. So '{a,n,o,y}' would find '{a,n,n,o,y}' etc. False positives for words with repeated letters.

    A set-operation with https://www.postgresql.org/docs/current/queries-union.html would be exact (consider each copy of the same element separately). Wrapped into a custom function:

    CREATE OR REPLACE FUNCTION f_arr_is_contained(arr1 text[], arr2 text[])
      RETURNS bool
      LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE AS
    $func$
    DECLARE
    BEGIN
       PERFORM unnest(arr1) EXCEPT ALL SELECT unnest(arr2);
    

    RETURN NOT FOUND;
    END
    $func$;

    If every single letter is covered in the second term, no row is returned, and FOUND is false. So return NOT FOUND.

    I chose LANGUAGE plpgsql because the function cannot be "inlined" anyway, so plpgsql might be faster. You can test the equivalent alternative with LANGUAGE plpgsql:

    CREATE OR REPLACE FUNCTION f_arr_is_contained_sql(arr1 text[], arr2 text[])
      RETURNS bool
      LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
    'SELECT NOT EXISTS (SELECT unnest (arr1) EXCEPT ALL SELECT unnest (arr2))';
    

    The function cannot use any indexes, though, which would lead to expensive sequential scans over the whole table.

    Combine both to be fast and exact:

    SELECT reference, word, spelling
    FROM   word_mash_dictionary
    WHERE  spelling 

    db<>fiddle https://dbfiddle.uk/?rdbms=postgres_14&fiddle=451858f812f6ab1f3fc6b8731d6b702a

    The first predicate finds all matches (and possibly some false positives) quickly with index support; the second predicate weeds out the (few!) false positives.

    Aside, word and spelling should probably be declared NOT NULL.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2