UPDATE incredibly slow on small PostgreSQL query - EXPLAIN ANALYZE causes it to hang



  • I'm trying to debug a very slow UPDATE query in one of my tables.

    The table name is beepers. It has 2486 rows and weighs 1204 MB. It has 18 columns and two indices: one on id and the other on friendly_name (a string with a legible name, i.e. "correct staple").

    Here's what I see through Postico:

    CREATE UNIQUE INDEX beepers_pkey ON beepers(id int4_ops);
    CREATE INDEX ix_beepers_friendly_name ON beepers(friendly_name text_ops);
    

    I have a couple hundred beepers out in the field, and every 30s each beeper reaches out to my server to update when they were last active.

    The query being executed behind the scenes is:

    UPDATE beepers SET updated_at=NOW()::timestamp, last_active=NOW()::timestamp WHERE beepers.id = 399;
    

    Note that this query is not updating any of the two indexed fields.

    --

    I'm using python and flask-sqlalchemy to interact with this DB (hosted via Google Cloud SQL) running PostgreSQL 9.6.

    Most of the time, this UPDATE happens in under 1 second.

    However, a lot of the time, this UPDATE takes over 20 seconds, sometimes even hitting >100 seconds.

    If I try running this query back-to-back in a little script, I see a large spike in Lock Wait according to Google Cloud SQL's Query Insights tab.

    I tried running the query with EXPLAIN ANALYZE at the front, but I get the strangest behavior!

    import db
    from sqlalchemy import text
    

    sql = text("UPDATE beepers SET updated_at=NOW()::timestamp, last_active=NOW()::timestamp WHERE beepers.id = 399;")
    db.engine.execute(sql)

    Querying the DB shows that the row updated! This is expected.

    Now let's add "EXPLAIN ANALYZE" to the front.

    sql = text("EXPLAIN ANALYZE UPDATE beepers SET updated_at=NOW()::timestamp, db.engine.execute(sql)
    db.engine.execute(sql)
    db.engine.execute(sql)

    I can try running this over and over - but querying the DB shows that these rows DO NOT update!

    If I now change the query back to remove the "EXPLAIN ANALYZE"...

    sql = text("UPDATE beepers SET updated_at=NOW()::timestamp, last_active=NOW()::timestamp WHERE beepers.id = 399;")
    db.engine.execute(sql)

    Python completely hangs!

    For some reason the queries with EXPLAIN ANALYZE don't actually perform the UPDATE, and the next call (minus the EXPLAIN) completely hangs!

    How can I figure out what's causing this strangeness?

    When I do get some output from EXPLAIN ANALYZE, I never catch the super slow run:

    Update on public.beepers  (cost=8.41..12.43 rows=1 width=587) (actual time=0.395..0.395 rows=0 loops=1)
      Buffers: shared hit=21 dirtied=1
      ->  Bitmap Heap Scan on public.beepers  (cost=8.41..12.43 rows=1 width=587) (actual time=0.013..0.015 rows=1 loops=1)
            Output: created_at, '2022-02-07 17:53:58.657941'::timestamp without time zone, id, friendly_name, '2022-02-07 17:53:58.652552'::timestamp without time zone, last_ip, ctid ...
            Recheck Cond: (beeperss.id = 399)
            Heap Blocks: exact=1
            Buffers: shared hit=5
            ->  Bitmap Index Scan on beepers_pkey  (cost=0.00..8.41 rows=1 width=0) (actual time=0.009..0.009 rows=1 loops=1)
                  Index Cond: (beepers.id = 399)
                  Buffers: shared hit=4
    Planning time: 0.348 ms
    Execution time: 0.465 ms
    

    UPDATE 1

    I've run the command from this stackoverflow: https://dba.stackexchange.com/questions/23879/measure-the-size-of-a-postgresql-table-row before and after running VACUUM (ANALYZE,FULL).

    Before:

                  metric               |   bytes    | bytes_pretty | bytes_per_row
    -----------------------------------+------------+--------------+---------------
     core_relation_size                | 1262665728 | 1204 MB      |        507910
     visibility_map                    |      40960 | 40 kB        |            16
     free_space_map                    |     327680 | 320 kB       |           131
     table_size_incl_toast             | 5456994304 | 5204 MB      |       2195090
     indexes_size                      |   60055552 | 57 MB        |         24157
     total_size_incl_toast_and_indexes | 5517049856 | 5261 MB      |       2219247
     live_rows_in_text_representation  |    7994740 | 7807 kB      |          3215
     ------------------------------    |            |              |
     row_count                         |       2486 |              |
     live_tuples                       |       2338 |              |
     dead_tuples                       |        457 |              |
    

    After:

                  metric               |  bytes  | bytes_pretty | bytes_per_row
    -----------------------------------+---------+--------------+---------------
     core_relation_size                | 2359296 | 2304 kB      |           949
     visibility_map                    |       0 | 0 bytes      |             0
     free_space_map                    |       0 | 0 bytes      |             0
     table_size_incl_toast             | 7708672 | 7528 kB      |          3100
     indexes_size                      |  196608 | 192 kB       |            79
     total_size_incl_toast_and_indexes | 7905280 | 7720 kB      |          3179
     live_rows_in_text_representation  | 7994727 | 7807 kB      |          3215
     ------------------------------    |         |              |
     row_count                         |    2486 |              |
     live_tuples                       |    2486 |              |
     dead_tuples                       |      36 |              |
    

    Hot dang. Somehow 1.2GB shrank to 2.3MB.

    This strikes me as kind of strange, because in the logs I see an "automatic vacuum" running regularly with the following output:

    LOG:  automatic vacuum of table "mydb.public.beepers": index scans: 0
    pages: 0 removed, 154134 remain, 2 skipped due to pins, 152707 skipped frozen
    tuples: 0 removed, 11085 remain, 6406 are dead but not yet removable
    buffer usage: 16379 hits, 0 misses, 1 dirtied
    avg read rate: 0.000 MB/s, avg write rate: 0.040 MB/s
    system usage: CPU 0.00s/0.01u sec elapsed 0.19 sec
    

    Otherwise, I think I've found what's going on: lots of lock waiting. The logs look something like this:

    [23086]: [1-1] LOG: process 23086 still waiting for ShareLock on transaction 2432622359 after 1000.156 ms
    [23086]: [2-1] DETAIL: Process holding the lock: 12946. Wait queue: 23086.
    [23086]: [3-1] CONTEXT: while updating tuple (17,4) in relation "beepers"
    [23086]: [4-1] STATEMENT: UPDATE beepers SET updated_at='2022-02-09T02:13:05.882135'::timestamp, last_active='2022-02-09T02:13:05.881215'::timestamp WHERE beepers.id = 1154
    

    [20888]: [2-1] DETAIL: Process holding the lock: 23086. Wait queue: 20888.
    [22744]: [2-1] DETAIL: Process holding the lock: 23086. Wait queue: 20888, 22744.
    [22743]: [2-1] DETAIL: Process holding the lock: 23086. Wait queue: 20888, 22744, 22743.
    ...
    [22527]: [2-1] DETAIL: Process holding the lock: 23086. Wait queue: 20888, 22744, 22743, 22931, 22902, 21294, 22693, 23277, 23276, 22695, 23275, 22243, 22692, 21005, 22653, 21898, 21734, 22527.

    [23086]: [5-1] LOG: process 23086 acquired ShareLock on transaction 2432622359 after 238962.352 ms
    [23086]: [6-1] CONTEXT: while updating tuple (17,4) in relation "beepers"
    [23086]: [7-1] STATEMENT: UPDATE beepers SET updated_at='2022-02-09T02:13:05.882135'::timestamp, last_active='2022-02-09T02:13:05.881215'::timestamp WHERE beepers.id = 1154
    [23086]: [8-1] LOG: duration: 238963.367 ms statement: UPDATE beepers SET updated_at='2022-02-09T02:13:05.882135'::timestamp, last_active='2022-02-09T02:13:05.881215'::timestamp WHERE beepers.id = 1154

    In this case, process 12946 was holding the Sharelock on my beepers table for a long time. Once 12946 let go of it, 23086 acquired it and the backlog of waiting processes could be cleared. But of all the logs put out by process 12946 - I never once see a mention of it querying or using the beepers table. Why would this process have a lock on the beepers table?



  • 1+GB size for a table with 2k rows means either it has huge text columns or it is highly bloated.

    Why don't you try running vacuum (analyze, full) against the database in question (if you haven't already done that) and re-check the size of your table again?

    VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.

    Reference: https://www.postgresql.org/docs/current/sql-vacuum.html (PostgreSQL | Docs)




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2