I have been struggling to work out what could be happening here but for many months, since upgrading to mysql 5.6 we seem to be experiencing random lag spikes in page loads which has been tracked back to mysql and the innodb_buffer_pool_size setting specifically.
The server is as follows;
- CentOS 6 64bit cpanel/WHM
- Dual Intel Xeon E5-2620 – Hex Core 2.0GHz, 15MB Cache, HyperThreaded [12 cores, 24 threads]
- 24GB RAM
- 2 x 240GB SSD
- Apache 2.4
- PHP 5.5 (fcgid handler)
- mysql 5.6
We are running a Magento site that gets about 1,500 visitors a day with a DB size of about 5GB. Nothing major by any stretch.
Using Newrelic we can see that we can comfortably allocate 12GB of ram to innodb_buffer_pool_size. However, about 24 hours after doing this we start to get random load spikes where pages may take 10s of seconds to minutes to load, however, if you hit refresh in the middle of such a load spike the page will load in under a second as per normal.
This problem began especially apparent when we tried to use php-fpm as the php handler so as to properly utilise opcache. When we tried this the PHP processes would get backed up during a mysql lock and processes numbers would ramp up until the site crashed. It became so unstable that we had to revert back to using fcgid which just stalls rather than crashes.
Looking at Newrelic we can see the load time spikes in mysql however, the server memory and CPU usage is still no where near exhausted.
Oddly enough, leaving innodb_buffer_pool_size to defaults seemed to make the problem go away for at least a week or so. Once the mysql ram usage starts to hit the 900mb to 1gb range the spikes start to happen again and we have to restart mysql. Then we are good for another week or so.
With the default mysql settings Newrelic shows about 80% of the ram is free.
Below is our current my.cnf; (as you can see we have commented out the innodb buffer lines and all logging)
[mysql] # CLIENT # port = 3306 socket = /var/lib/mysql/mysql.sock [mysqld] # GENERAL # user = mysql default-storage-engine = InnoDB socket = /var/lib/mysql/mysql.sock # MyISAM # key_buffer_size = 64M myisam_recover_options = FORCE,BACKUP # SAFETY # max_allowed_packet = 16M max_connect_errors = 1000000 skip-name-resolve innodb = FORCE # DATA STORAGE # datadir = /var/lib/mysql/ # BINARY LOGGING # #log-bin = /var/lib/mysql/mysql-bin #expire_logs_days = 14 #sync_binlog = 1 # CACHES AND LIMITS # wait_timeout = 300 query_cache_type = 0 query_cache_size = 0 max_connections = 500 # INNODB # innodb_log_file_size = 256M # if changing, stop database, remove old log files, then start! innodb_file_per_table = 1 #innodb_buffer_pool_size = 12G #innodb_buffer_pool_instances = 12 # LOGGING # #log-error = /var/lib/mysql/mysql-error.log #log-queries-not-using-indexes = 1 #slow-query-log = 1 #slow_query_log_file = /var/lib/mysql/mysql-slow.log
We have gone to our hosting providers and asked them for help. Initially they told me to raise the innodb_buffer_pool_instances as they thought their may have been some lock contention. I set it from 12 to 24 and the problem persisted.
Then they ran a memtest to check for issues with the ram and found nothing. Eventually they gave up and told us to consult a DB expert.
I just can't work it out. Any help would be greatly appreciated.
************ UPDATE 1 *************
So I have been digging into this issue further and further and have made some new discoveries.
I un-commented the innodb_buffer_pool_size = 12G and began monitoring the output of SHOW PROCESSLIST. I noticed, during lag times, that queries, not any in particular, would be stuck in a "Writing to net" state for sometimes minutes at a time. Those same queries would execute in milliseconds from, cli or mysql workbench.
In order to try an ascertain more data about what was going on with the server at the time of these slow writing to net states I installed Percona tools and set up a pt-stalk daemon to monitor the server and trigger a collect whenever a query sat in a writing to net state for longer than 5 seconds.
Looking at the pt-stalk output files, what seemed very interesting was the output in the opentables1 and opentables2 files which always says…
2015_12_06_05_01_04 Too many open tables: 2135
The number of open tables varies, but seems to always be well above 1000. Initially I took this as an error message and raised the open files ulimit of the server which did not fix the problem. Then I found this bug report https://bugs.launchpad.net/percona-toolkit/+bug/1307377 which explains that this is not an error but a warning pt-stalk fires when there are more than 1000 open tables.
One other thing pt-stalk was able to tell me was the exact time the lags started to occur. I notice that the slow writing to net states always starts occurring as soon as Mysql hits the 4.8 to 5GB ram usage. 4.8G happens to be the size of the tmp disk partition cpanel has set up. Although this partition seems to have plenty of free space, only ~380mb being used at any time, my gut feeling tells me I should try and increase it to see if that helps.
That will be my next step and I will report the results back here.
If there is anything else that anybody thinks I should try given the above information please do share.