Optimizing expensive GROUP BY / ORDER BY query on 690,000 row table



  • I am writing a query to calculate hiscore rankings for a game. It performs a GROUP BY to aggregate hiscore statistics for individual player stats that are stored across 23 different rows (separated by different "skills").

    The query I came up with below works:

    SELECT * FROM (
        SELECT
            username,
            SUM(level) total_level,
            SUM(experience) total_experience,
            (SELECT rank FROM overall_hiscore_rankings WHERE username = skill_hiscore.username) rank
            FROM skill_hiscore GROUP BY username ORDER BY rank
         ) a WHERE rank >= 1 AND rank 

    Note: overall_hiscore_rankings is a materialized view and the filter at the end is my implementation of the seek method for pagination to prevent expensive LIMIT OFFSET pagination.

    I did not notice any problems with this until seeding about ~30,000 players worth of data into my hiscores schema. Because each player has 23 different rows as mentioned above, this leaves us with a table with 690,000 rows.

    This now gives us atrocious performance, literally taking my db 100 seconds to execute this query:

    Sort  (cost=35926899.17..35926907.50 rows=3330 width=35) (actual time=98599.592..98599.595 rows=25 loops=1)
      Sort Key: ((SubPlan 1))
      Sort Method: quicksort  Memory: 26kB
      ->  GroupAggregate  (cost=0.42..35926704.35 rows=3330 width=35) (actual time=149.023..98599.514 rows=25 loops=1)
            Group Key: skill_hiscore.username
            Filter: (((SubPlan 2) >= 1) AND ((SubPlan 3)   Index Scan using skill_hiscore_username on skill_hiscore  (cost=0.42..46884.88 rows=690063 width=19) (actual time=144.644..293.859 rows=690063 loops=1)
            SubPlan 1
              ->  Seq Scan on overall_hiscore_rankings  (cost=0.00..567.04 rows=1 width=8) (actual time=0.002..1.592 rows=1 loops=25)
                    Filter: ((username)::text = (skill_hiscore.username)::text)
                    Rows Removed by Filter: 30002
            SubPlan 2
              ->  Seq Scan on overall_hiscore_rankings overall_hiscore_rankings_1  (cost=0.00..567.04 rows=1 width=8) (actual time=0.817..1.633 rows=1 loops=30003)
                    Filter: ((username)::text = (skill_hiscore.username)::text)
                    Rows Removed by Filter: 30002
            SubPlan 3
              ->  Seq Scan on overall_hiscore_rankings overall_hiscore_rankings_2  (cost=0.00..567.04 rows=1 width=8) (actual time=0.816..1.633 rows=1 loops=30003)
                    Filter: ((username)::text = (skill_hiscore.username)::text)
                    Rows Removed by Filter: 30002
    Planning Time: 2.820 ms
    JIT:
      Functions: 21
      Options: Inlining true, Optimization true, Expressions true, Deforming true
      Timing: Generation 1.248 ms, Inlining 44.936 ms, Optimization 60.995 ms, Emission 38.520 ms, Total 145.700 ms
    Execution Time: 98623.550 ms
    

    Clearly, the ORDER BY and GROUP BY is causing the largest performance drain, but I am not sure how to rewrite this.

    Nothing pops out at my that indicates this query would take 100 seconds - I have never seen a query perform so poorly, and frankly trying to optimize this is out of my knowledge.

    This query is not backed by any indexes.
    My main question is this: Is the lack of an index on this query for the GROUP BY and ORDER BY data enough to make the runtime of this query 100 seconds?



  • The correlated subquery is a real drag on performance, and without need.
    The lack of indexes, too.

    Try instead:

    SELECT username
         , sum(s.level) AS total_level
         , sum(s.experience) AS total_experience
         , o.rank
    FROM   overall_hiscore_rankings o
    JOIN   skill_hiscore s USING (username)
    WHERE  o.rank >= 1
    AND    o.rank 

    If overall_hiscore_rankings.username is unique, you can just add o.rank to GROUP BY. I'd need table definitions to be sure ...

    Ideally, you have indexes like:

    CREATE INDEX ON overall_hiscore_rankings (rank, username);
    CREATE INDEX ON skill_hiscore (username) INCLUDE (level, experience);
    

    The covering index only makes sense if your table is vacuumed enough. Else make it on just (username) - the index skill_hiscore_username in your query plan seems to cover that. See:

    • https://dba.stackexchange.com/questions/93158/how-to-speed-up-select-distinct/93159#93159
    • https://dba.stackexchange.com/questions/190156/can-postgres-use-an-index-only-scan-for-this-query-with-joined-tables/190158#190158

    Be sure to have autovacuum running, or run VACUUM ANALYZE manually once after creating your test case to get current column statistics and update the visibility map.

    BTW, you mention "23 different rows (separated by different skills)", but WHERE rank >= 1 AND rank seems to select 25 of them. Did you mean WHERE rank > 1 AND rank < 25?




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2