Create an index for fields within an array of composite type



  • My question is a follow-up to the question answered here:
    https://stackoverflow.com/a/15041094/994263

    Considering a table with an array column of a composite type, is it possible with PostgreSQL to index the column to be able to search for rows containing an array entry matching some arbitrary predicate.

    Here is a fiddle to start off the problem: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=cbed38d77f5fb7e2bc3d14605e74d464

    We have the following schema:

    CREATE TYPE complex AS (
        r       double precision,
        i       double precision
    );
    

    CREATE TABLE tbl2 (tbl2_id serial, co complex[]);

    INSERT INTO tbl2(co)
    select array_agg((random()* 100, random()*100)::complex)
    from generate_series(1, 50000) i
    group by i % 10000;

    -- how to create an index on co[*].r basically?
    CREATE INDEX tbl2_co1_idx ON tbl2 (((co[1]).r)); -- note the parentheses!
    -- * this is only a single array entry's r values

    Is there a mechanism to do an indexed lookup for queries such as this:

    SELECT * FROM
       (SELECT *,
               generate_subscripts(co, 1) AS s
        FROM tbl2) AS foo
    WHERE (co[s].r) BETWEEN 9.65 and 9.67;
    

    The rationale behind this could be to have items such as polygons for example with a small number of points (x,y) and to then lookup easily which polygons are out of bounds. It is a more NoSQL-like approach, which would be great if it is doable without resorting to jsonb.



  • It is possible to create an index for fields within an array of composite type, but the applicable https://www.postgresql.org/docs/current/indexes-types.html#INDEXES-TYPES-GIN for arrays is GIN. And https://www.postgresql.org/docs/current/gin-builtin-opclasses.html that would work for your query predicate WHERE (co[*].r) BETWEEN 9.65 and 9.67. So not possible after all - with current Postgres 14, and probably future versions as well.

    The problem is merely academic, though, because you wouldn't do that to begin with. The solution is a proper relational design.

    Whenever you find yourself creating a table with an array of composite type, cease and desist. With a probability bordering on certainty, you should create a separate table instead, https://en.wikipedia.org/wiki/Database_normalization your design at least that much.

    Then everything becomes rather simple. Could look like this:

    -- parent table optional, not needed for demo
    CREATE TABLE tbl2_parent (tbl2_id serial PRIMARY KEY);  -- more attributes?
    

    -- random test data
    INSERT INTO tbl2_parent
    SELECT FROM generate_series(1, 1000) t;

    -- main table
    CREATE TABLE tbl2(
    tbl2_id int NOT NULL -- REFERENCES tbl2_parent
    , idx int NOT NULL
    , l float8 NOT NULL
    , r float8 NOT NULL
    , PRIMARY KEY (tbl2_id, idx)
    );

    -- random test data
    INSERT INTO tbl2 (tbl2_id, idx, l, r)
    SELECT t
    , i
    , random() * 100
    , random() * 100
    FROM generate_series(1, 1000) t, generate_series(1, 5) i
    ORDER BY 1, 2;

    Now, the index is simple:

    CREATE INDEX tbl2_r_idx ON tbl2 (r, tbl2_id);
    

    Any B-tree index with leading r does the job. Minor additional optimizations depend on the complete use case.

    And the query is something like:

    SELECT DISTINCT tbl2_id
    FROM   tbl2
    WHERE  r BETWEEN 9.65 and 9.67;
    

    Or, if you need more than distinct IDs:

    EXPLAIN ANALYZE
    SELECT *
    FROM   tbl2_parent p
    WHERE  EXISTS (
       SELECT FROM tbl2 t
       WHERE  t.r BETWEEN 9.65 AND 9.67
       AND    t.tbl2_id = p.tbl2_id
       );
    

    db<>fiddle https://dbfiddle.uk/?rdbms=postgres_14&fiddle=fd7ebbc39240674113d2475252cc7758

    Either query can use very fast index-only scans with tables that are vacuumed enough, or at least index scans.

    Asides

    For your original scenario, you could just use the built-in type point. It consists of two float8 quantities, exactly like your custom row type. Reference first and second number with a subscript like:

    SELECT ('(7,8)'::point)[0];  -- 7
    

    See:

    • https://stackoverflow.com/a/8150944/939860

    Your query in your original test case can be rewritten as:

    SELECT *
    FROM   tbl2 t
    WHERE  EXISTS (
       SELECT FROM unnest(t.co) elem
       WHERE (elem.r) BETWEEN 9.65 and 9.67
       );
    

    Reading your rationale, though, you might consider https://postgis.net/ first, or at least mainline https://www.postgresql.org/docs/current/datatype-geometric.html .




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2