Morgan gives a hint in his comment that InnoDB is constantly checking for corrupted pages by doing checksums on the pages it reads. If InnoDB finds a checksum mismatch, it will crash stop the server.
If you want to speed that process up (instead of waiting for InnoDB to read the corrupted page), you can use innochecksum
:
Because checksum mismatches will cause InnoDB to deliberately shut down a running server, it can be preferable to use this tool rather than waiting for a server in production usage to encounter the damaged pages.
An interesting caveat:
innochecksum cannot be used on tablespace files that the server already has open. For such files, you should use CHECK TABLE to check tables within the tablespace.
So yes, for an online table CHECK TABLE
is probably the tool (or as pointed out in another answer mysqlcheck
if you want to do more than a single database at a time.)
If you can shut down your database you can force it the checksums using innochecksum
Anecdotal:
On a innodb tablespace of 29GB (with innodb_file_per_table=1
), this script took about 2 Minutes
#!/bin/bash
for i in $(ls /var/lib/mysql/*/*.ibd)
do
innochecksum -v $i
done
As a bonus, though, since you are running Percona, they implemented a new method for fast innodb checksum. I've never used it, but it might speed up the process.
I would suggest that even though your innodb_buffer_pool_size
is undoubtedly too large for the available memory (192GB would be the theoretical max "sane" value) your thread-cache-size
is likely to be what's occasionally pushing you over the edge.
There's no rational justification for such a large value, particularly with max connections limited to 1200.
I will speculate that shrinking the buffer pool size may lengthen the interval between oom events, and you definitely need to do that... but reducing the thread cache size value may be necessary to eliminate oom events altogether.
Individual client threads can have essentially unbounded memory growth, depending on what's needed to execute any particular query. (That's why the memory usage calculation formulas you'll see that multiply by max connections are essentially worthless for calculating meaningful numbers -- they only give you the least-worst-case scenario).
With the thread cache at 600, you are leaving a door open for a hundreds of memory-intensive queries to leave behind a cached thread holding on to memory that may not be cleaned up for a long time, if ever. It's not technically a "leak," but it behaves similarly.
Decrease this variable below the point where the SHOW STATUS LIKE 'threads_created';
never reaches the point where it stops growing somewhat consistently during moderate to high traffic periods... unless you have an incredibly bursty demand for the number of concurrent connections, you only need a very small number of threads in the thread cache.
The only circumstance I recall where I needed this value to be larger was an application that, on start/restart, attempted to make 48 new client connections in parallel at easentially exactly the same time. (Yes, they had their reasons). Without some number of cached threads available, the server couldn't reliably establish this many new connections so quickly.
And that's what the thread cache does -- reduces the resource cost of new connections to the server, at the expense of maintaining the old ones.
Note, the query cache is also probably too large for good performance. It's essentially unrelated to the oom issue since it doesn't grow, but it's still excessive, since the qcache is a choke point for every query, and a larger cache takes more processing time to manage.
Best Answer
Well, I've tested this and the answer is yes. One small thing to keep in mind is that in RHEL 6 default filesystem is EXT4 while in RHEL 7 it is XFS and you will be mounting an EXT4 filesystem to a RHEL7 machine.