Mysql 5.6 Memory usage keeps increasing till it is killed by OOM killer



  • we have a production mysql 5.6.46, which is under heavy load (2000 Queries per second). The memory allocation to Mysql keeps on increasing over a period of time (sometimes in 1 day, sometime in 1 week) and the server gets auto restarted. Operating System is Ubuntu 20:04 LTS.

    All the tables are Innodb tables. Total Ram on the server is 512 GB. InnoDB Buffer is set to 350 GB (we have tried changing it 400 GB, 300 GB, doesnt seem to affect the issue). Overall index size is close to 300 GB. Data size is 1.5 TB

    Last we added a swap space to stop OOM Killer from killing the server, but even the additional 500 GB of swap becomes full over few additional hours/days and the server crashes (Also when the swapping takes place, the queries hangs for a few seconds). The time taken to crash is also not consistent, it has crashed in 12-14 hours on few occasions and ran OK for a week.

    I am pretty sure it is either a memory leak in this mysql version, or we are running some crazy queries which should not be running. The application is pretty old and heavy and while a new complete revamp on the way, this needs to run for another 6-8 months. Checking every query is going to be challenge.

    Here are the server variables and global status, if you can help on what configurations we can experiment with, it would be a great help.

    https://docs.google.com/spreadsheets/d/e/2PACX-1vSeGErEVF-bqHAf5XPRhWrCqpBO6CgeSeG4Cat_UxZ6wPrzrOI-DNQ0MnFTklom4WG4syI2cqZEMt1g/pubhtml



  • Analysis of GLOBAL STATUS and VARIABLES:

    Observations:

    • Version: 5.6.46-log
    • 512 GB of RAM
    • Uptime = 03:30:58; Please rerun SHOW GLOBAL STATUS after several hours.
    • 1930 QPS

    The More Important Issues:

    If "transparent huge pages" is turned on in the OS, see if turning it off will prevent the OOMs. (There is some chatter about this as a problem.)

    After 3.5 hours, the buffer pool was not filled up. I suggest lowering innodb_buffer_pool_size to 280G as a way to at least delay OOM. (Just a delay, not a cure.)

    table_open_cache is now 20000 is much higher than Opened_tables; recommend lowering to 3000. Also, I suggest table_open_cache_instances = 16

    innodb_purge_threads = 4

    host_cache_size = 1000 -- This might help with the high Max_used_connections.

    Do you have some kind of "connection pooling" turned on? Is it set too high?

    The "binlog" is turned on (log_bin); is it being used for Replication? Or something else?

    If you are having issues with "slow queries" there are several settings that need changing to turn on the slowlog to help locate such queries.

    tmp_table_size is currently only 1MB. That is abnormally small, especially for a machine as big as yours. It may be slowing down some querys. Recommend 16M.

    Details and other observations:

    ( table_open_cache ) = 20,000 -- Number of table descriptors to cache -- Several hundred is usually good.

    ( innodb_buffer_pool_instances ) = 64 -- For large RAM, consider using 1-16 buffer pool instances, not allowing less than 1GB each. Also, not more than, say, twice the number of CPU cores. -- Recommend no more than 16. (Beginning to go away in 10.5)

    ( innodb_lru_scan_depth * innodb_buffer_pool_instances ) = 1,024 * 64 = 65,536 -- A metric of CPU usage. -- Lower either number.

    ( innodb_lru_scan_depth ) = 1,024 -- "InnoDB: page_cleaner: 1000ms intended loop took ..." may be fixed by lowering lru_scan_depth

    ( Innodb_buffer_pool_pages_free / Innodb_buffer_pool_pages_total ) = 7,598,545 / 22937568 = 33.1% -- Pct of buffer_pool currently not in use -- innodb_buffer_pool_size (now 375809638400) is bigger than necessary?

    ( Com_rollback ) = 59,051 / 12658 = 4.7 /sec -- ROLLBACKs in InnoDB. -- An excessive frequency of rollbacks may indicate inefficient app logic.

    ( Handler_rollback ) = 31,472 / 12658 = 2.5 /sec -- Why so many rollbacks?

    ( Innodb_row_lock_time_max ) = 51,005 -- Max time to lock a row (millisec) -- Possibly conflicting queries; possibly table scans.

    ( innodb_flush_neighbors ) = 1 -- A minor optimization when writing blocks to disk. -- Use 0 for SSD drives; 1 for HDD.

    ( innodb_strict_mode ) = innodb_strict_mode = OFF -- ON would catch some subtle errors earlier. -- OFF leaves some warnings as warnings; ON makes them errors.

    ( sync_binlog ) = 0 -- Use 1 for added security, at some cost of I/O =1 may lead to lots of "query end"; =0 may lead to "binlog at impossible position" and lose transactions in a crash, but is faster. 0 is OK for Galera.

    ( innodb_adaptive_hash_index ) = innodb_adaptive_hash_index = ON -- Usually should be ON. -- There are cases where OFF is better. See also innodb_adaptive_hash_index_parts (after 5.7.9) and innodb_adaptive_hash_index_partitions (MariaDB and Percona). ON has been implicated in rare crashes (bug 73890). 10.5.0 decided to default OFF.

    ( innodb_print_all_deadlocks ) = innodb_print_all_deadlocks = OFF -- Whether to log all Deadlocks. -- If you are plagued with Deadlocks, turn this on. Caution: If you have lots of deadlocks, this may write a lot to disk.

    ( innodb_purge_threads ) = 1 -- Number of threads to clean up history list. -- If you have a lot of writes, recommend 4 in versions 5.6 and 10.0 or later.

    ( max_connections ) = 2,000 -- Maximum number of connections (threads). Impacts various allocations. -- If max_connections (now 2000) is too high and various memory settings are high, you could run out of RAM.

    ( local_infile ) = local_infile = ON -- local_infile (now ON) = ON is a potential security issue

    ( Created_tmp_tables ) = 2,256,582 / 12658 = 178 /sec -- Frequency of creating "temp" tables as part of complex SELECTs.

    ( Created_tmp_disk_tables ) = 19,445 / 12658 = 1.5 /sec -- Frequency of creating disk "temp" tables as part of complex SELECTs -- increase tmp_table_size (now 1048576) and max_heap_table_size (now 16777216). Check the rules for temp tables on when MEMORY is used instead of MyISAM. Perhaps minor schema or query changes can avoid MyISAM. Better indexes and reformulation of queries are more likely to help.

    ( Com_rollback / (Com_commit + Com_rollback) ) = 59,051 / (294527 + 59051) = 16.7% -- Rollback : Commit ratio -- Rollbacks are costly; change app logic

    ( Com_insert + Com_delete + Com_delete_multi + Com_replace + Com_update + Com_update_multi ) = (635420 + 10508 + 0 + 0 + 2733673 + 0) / 12658 = 266 /sec -- writes/sec -- 50 writes/sec + log flushes will probably max out I/O write capacity of normal drives

    ( binlog_error_action ) = binlog_error_action = IGNORE_ERROR -- What to do if the binlog cannot be written. -- IGNORE_ERROR is the default for backwards compatibility, but ABORT_SERVER is recommended.

    ( binlog_format ) = binlog_format = STATEMENT -- STATEMENT/ROW/MIXED. -- ROW is preferred by 5.7 (10.3)

    ( simplified_binlog_gtid_recovery ) = 0 -- Percona: Perform a faster crash recovery. -- Set to 1 to speed up binlog recovery with/without GTIDs after a crash.

    ( slow_query_log ) = slow_query_log = OFF -- Whether to log slow queries. (5.1.12)

    ( long_query_time ) = 10 -- Cutoff (Seconds) for defining a "slow" query. -- Suggest 2

    ( log_slow_slave_statements ) = log_slow_slave_statements = OFF -- (5.6.11, 5.7.1) By default, replicated statements won't show up in the slowlog; this causes them to show. -- It can be helpful in the slowlog to see writes that could be interfering with Replica reads.

    ( Max_used_connections ) = 831 -- High-water mark for connections -- Lots of inactive connections is OK; over 100 active connections is likely to be a problem. Max_used_connections (now 831) does not distinguish them; Threads_running (now 9) is instantaneous.

    ( Max_used_connections / host_cache_size ) = 831 / 703 = 118.2% -- Increase host_cache_size (now 703)

    ( Aborted_clients ) = 5,129 / 12658 = 0.41 /sec -- Threads bumped due to wait_timeout -- Increase wait_timeout (now 28800); be nice, use disconnect

    ( thread_cache_size / Max_used_connections ) = 1,000 / 831 = 120.3% -- There is no advantage in having the thread cache bigger than your likely number of connections. Wasting space is the disadvantage.

    Abnormally small:

    Com_show_tables = 0
    Handler_read_rnd_next / Handler_read_rnd = 2.91
    Innodb_log_writes / Innodb_log_write_requests = 0.05%
    innodb_lru_scan_depth / innodb_io_capacity = 0.341
    min(max_heap_table_size, tmp_table_size) = 1MB
    query_cache_limit = 0
    

    Abnormally large:

    ( Innodb_pages_read + Innodb_pages_written ) / Uptime = 1,385
    Binlog_cache_use = 173 /sec
    Bytes_sent = 6391896 /sec
    Com_kill = 0.48 /sec
    Com_set_option = 716 /sec
    Com_show_charsets = 0.25 /sec
    Com_show_collations = 0.25 /sec
    Com_update = 215 /sec
    Created_tmp_files = 0.32 /sec
    Handler_delete = 1387 /sec
    Handler_read_key = 171995 /sec
    Handler_read_next = 9216407 /sec
    Handler_read_prev = 41634 /sec
    Handler_read_rnd = 162780 /sec
    Innodb_buffer_pool_bytes_data = 19093702 /sec
    Innodb_buffer_pool_pages_data = 1.48e+7
    Innodb_buffer_pool_pages_dirty = 198,092
    Innodb_buffer_pool_pages_free = 7.6e+6
    Innodb_buffer_pool_pages_misc = 587,553
    Innodb_buffer_pool_pages_total = 2.29e+7
    Innodb_buffer_pool_read_ahead = 44 /sec
    Innodb_buffer_pool_read_requests = 3098244 /sec
    Innodb_buffer_pool_write_requests = 74119 /sec
    Innodb_buffer_pool_write_requests / Innodb_buffer_pool_pages_flushed = 321
    Innodb_data_read = 18940790 /sec
    Innodb_data_reads = 1154 /sec
    Innodb_data_writes - Innodb_log_writes - Innodb_dblwr_writes = 230 /sec
    Innodb_data_written = 8925503 /sec
    Innodb_dblwr_pages_written = 230 /sec
    Innodb_log_write_requests = 3442 /sec
    Innodb_os_log_written / (Uptime / 3600) / innodb_log_files_in_group = 2,359.1MB
    Innodb_pages_created = 11 /sec
    Innodb_pages_read = 1154 /sec
    Innodb_pages_read + Innodb_pages_written = 1385 /sec
    Innodb_pages_written = 230 /sec
    Innodb_rows_deleted = 1387 /sec
    Innodb_rows_deleted + Innodb_rows_inserted = 1435 /sec
    Innodb_rows_read = 9861927 /sec
    Innodb_rows_updated = 372 /sec
    Select_range = 73 /sec
    Sort_range = 50 /sec
    Threads_cached = 687
    innodb_open_files = 20,000
    innodb_read_io_threads = 64
    innodb_write_io_threads = 64
    optimizer_trace_offset = --1
    performance_schema_max_cond_instances = 28,100
    performance_schema_max_file_instances = 92,308
    performance_schema_max_socket_instances = 4,020
    performance_schema_max_thread_instances = 4,100
    table_definition_cache = 10,000
    

    Abnormal strings:

    optimizer_trace_features = greedy_search=on, range_optimizer=on, dynamic_range=on, repeated_subselect=on
    slave_rows_search_algorithms = TABLE_SCAN,INDEX_SCAN
    



Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2