Do I need statistics in addition to an index?



  • Suppose I have a table with images that are supposed to go through several steps:

    CREATE TABLE images (filename text, extracted bool, cropped bool, resized bool);
    

    INSERT INTO images (filename, extracted, cropped, resized)
    VALUES
    ('foo', false, false, false),
    ('bar', true, false, false),
    ('baz', true, true, false),
    ('qux', true, true, true);

    At some point I have a query to find all images that are cropped but still need to be resized:

    SELECT count(*) FROM images WHERE cropped AND NOT resized;
    

    Now I believe the best way to make that query fast is a partial index:

    CREATE INDEX ON images (cropped, resized) WHERE (cropped AND NOT resized);
    

    I'd make it partial because cropped AND NOT resized is a relatively rare state while there might be millions of images that are already fully processed and also millions that are not cropped yet.

    My question is now, do I need statistics in addition to the index?

    One of these?

    CREATE STATISTICS stat1 (dependencies) ON cropped, resized FROM images;
    CREATE STATISTICS stat2 (ndistinct) ON cropped, resized FROM images;
    CREATE STATISTICS stat3 (mcv) ON cropped, resized FROM images;
    ANALYZE images;
    

    I found chapter https://www.postgresql.org/docs/current/planner-stats-details.html which I had previously missed (or rather conflated with https://www.postgresql.org/docs/current/planner-stats.html ), but it only talks about how statistics are turned into row estimates. What is unclear to me is how indexes are chosen, given that there are apparently no statistics about indexes.



  • You are substantially over-thinking this. Your query is very simple, and there are only a few ways it could be executed. Whether it will return 7000 rows or 2000 rows, it doesn't matter because either way the index will appear to be better than the meager alternatives.

    If you really do want to run a wider variety of queries which have more opportunities to make the wrong planner choice, it might be important to include the extended statistics of the mcv variety.

    Your two examples are utterly mismatched. The table of counts in your question would lead to vastly different row estimates than are shown in your answer. It would give around 5,000,000 with no extended MCV stats, and around 1 with the extended stats. Certainly not 6872 vs 1782.




Suggested Topics