Database becoming unresponsive after time (4 to 24 hours)
-
Since a couple of days my mariadb database randomly becomes unresponsive after random intervals (which have been between 4 and 20 hours).
The database comes back to life when rebooting the ct.
And after some time these queries start to fail withMySqlException (0x80004005): The Command Timeout expired before the operation completed
.
I have also noticed that sometimes restarting the mariadb-service also gets stuck.Below is all information I have on the issue:
System Info
Ubuntu: Ubuntu 22.04 LTS
Mariadb: 10.6.7
Proxmox: 7.2-3Fixing attemps
Fresh install (vm) importing old data (only copying the database
fivem
) using mysqldumpmysqlcheck --all-databases --auto-repair
(while mysql is working. When it is in the unresponsive state, there's no output.
Proxmox
This is a screenshot of all information on the proxmox CT: https://i.imgur.com/nR6n3nt.png
Data and structure
It happens in a database called fivem which has multiple tables. Notably only the bigger ones are stuck (meaning that
select * from TABLE_NAME
will not finish for a long time).Phpmyadmin table (content): https://i.imgur.com/cUdhvs1.png
Phpmyadmin table (size and "is hung"): https://i.imgur.com/C6gxn1N.pngMariaDB Config
I have done these modification to the default config:
skip-name-resolve max_allowed_packet = 1G max_connections = 100000 table_cache = 640 thread_handling = pool-of-threads innodb_buffer_pool_size = 48103633715
INNODB
SHOW ENGINE INNODB STATUS;
was also unresponsive.Processlist
Shows multiple queries on mostly the states
commit
andsending data
(somecommits
).Hastebin-Link to full processlist: https://www.toptal.com/developers/hastebin/sapajapohi.sql
Syslog
Shows a lot of
Got an error reading communication packets
and(This connection closed normally without authentication
.
Note that these happen while mysql is operational too.Some examples: https://www.toptal.com/developers/hastebin/rewozofana.yaml
Mariadb service
Output while starting: https://www.toptal.com/developers/hastebin/uracoqutab.yaml
Restarting the mariadb service also did not work. The last warning it shows are: https://www.toptal.com/developers/hastebin/usewerugut.less
Conclusion
If anyone has ideas of what the issue might be or just ideas on how to further debug this I would really appreciate it.
Edit / Updates
- max_connections has been lowered to 250 but will need to be increased
- innodb_buffer has also been lowered
- It has become apparent that shortly after (or right before) the unresponsivness a lot of Disk IO is observed (20x the usual amount)
- mysql status shows some warnings the quickly increase in size ( https://i.imgur.com/TxMxC35.png ). I am not sure if these could be a cause and how to fix them for now
- I noticed that restarting the service (mariadb) occasionally got it stuck at
Waiting to flush the buffer pool
with the last messages being
May 19 08:11:06 database systemd[1]: Stopping MariaDB 10.6.7 database server... May 19 08:11:06 database mariadbd[659]: 2022-05-19 8:11:06 0 [Note] InnoDB: FTS optimize thread exiting. May 19 08:11:06 database mariadbd[659]: 2022-05-19 8:11:06 0 [Note] InnoDB: Starting shutdown... May 19 08:11:06 database mariadbd[659]: 2022-05-19 8:11:06 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool May 19 08:11:06 database mariadbd[659]: 2022-05-19 8:11:06 0 [Note] InnoDB: Buffer pool(s) dump completed at 220519 8:11:06
-
I have probably found the cause and it's a weird edge case.
I have the cts (from proxmox) running on lvm instead of lvm-thin. And had a backup job using snapshot for the database every 4 hours. My guess is that since snapshot was unavailable it froze the ct for the backup and somehow mariadb did not like that.
I have now changed that to a backup once a day with the shutdown option.