MySQL Database locks up - trying to explain spike in writes



  • Trying to explain this symptom

    Our RDS MySQL database (no fancy custom parameter settings) has been going down periodically, which our team believe is due to a high number of concurrent connections coming from a load balanced PHP with a high number of concurrent workers, which we have since alleviated by reducing the number of concurrent workers to be under the theoretical max connections of 4gb of RAM.

    However, I am struggling to explain an additional symptom being a high degree of innob_db_data_writes and innodb_rows_inserted metrics during these downtime - which is leading some on the team to believe that an errant job or malicious attack is occurring on the database, although we cannot find this data in any of the database tables at all.

    My position is that this is related to buffer pool writes, but the counter argument here being that the metric is related to filesystem writes and row updates, and not memory caches. I've countered with the fact that a low memory environment would lead to buffer swapping to disk (i.e. swap insanity) and that disk space usage shows that after reset, the data is gone, but then it goes back into the circle of "this is a database metric, not a OS metric. And back and forth it goes...

    So I'm happy to start from scratch for my own learning - what is happening here?


    The database goes down, completely hangs, metrics show a high number of connections to the database. PHP exceptions all show "max connection limits reached" (or along those lines). Database requires restart, services resume, and disk space is reset.

    all monitoring

    During this time - high spike in writes and updates.

    Edit: more images

    enter image description here enter image description here enter image description here

    spike - innodb stats

    Edit:

    The previous graphs may have been averaging metrics to larger increments. I have set it to bin statistics to 60 seconds in these graphs. You can clearly see connections hitting high limits here.

    enter image description here


    My Position - the spike is a result, not the cause, of the downtime. The database needs to be upgraded with more RAM (or a read replica)

    Their position - the spike is the cause of the downtime, we should find the cause of the spike.

    I'll again note that as of writing, the issue has not occurred again after reducing our concurrent worker count. But I'm very happy to be wrong here, if I can solve this for everyone and learn something new.

    Edit: Status and Variables here https://gist.github.com/darylteo/f2681c99b4e0c5eafcb31d95297a5b36

    Edit: looking at the error logs from RDS, there are hundreds of these errors

    A long semaphore wait:
    --Thread 22545149187840 has waited at srv0tmp.h line 182 for 303 seconds the semaphore:
    Mutex at 0x1481625f79c0, Mutex TEMP_POOL_MANAGER created srv0tmp.cc:184, lock var 1
    

    Edit: adding more dashboard metrics: request counts pattern remains the same, but we no longer reach the same level of database connections after changes. So far, no database outage or service downtime since.

    enter image description here



  • The first set of graphs implies a backup on odd-numbered dates.

    Yes, throttle the number of clients you run; this avoids flooding of MySQL. MySQL gives every connection equal access to resources, thereby leading to long delays in any of them finishing.

    You say "workers". Do you have some kind of "queue". Be aware that if you launch workers "too fast", they could fail to finish before new ones are added to the mix. This can especially happen if you are using "corn" and one instance is not finishing before the next instance is launched.

    Use the http://mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog to locate the "worst" queries -- namely the ones to optimize (via better indexes, etc) first.

    "innob_db_data_writes and innodb_rows_inserted metrics" - Sounds like inefficient queries; the slowlog should help in diagnosing.

    "filesystem writes and row updates, and not memory caches" -- Yeah. High "reads" might indicate the need for more RAM.

    "swap insanity" -- Probably RDS is carefully tuned to always avoid swapping. (The analysis, above, will help see it that is the case.) The "innodb%" metrics do not include swapping.

    "disk space usage shows that after reset, the data is gone" -- Again, this would show up as more innodb reads, not writes.

    Provide the GLOBAL STATUS and VARIABLES for analysis as mentioned here: http://mysql.rjweb.org/doc.php/mysql_analysis#tuning




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2