Mysql – Optimizing large MySQL SELECT WHERE IN clauses

database-designMySQLperformance

How can the following SELECT WHERE IN case be optimized?

I have a table with over 100 million rows with only 3 columns. The primary key (col1) is a 127 varchar. I am performing a SELECT col1 WHERE col1 IN (…) where the IN clause is 5,000 strings. I am just trying to see which of the 5,000 strings are in the DB as the primary key.

With a dedicated server and InnoDB table, the query takes anywhere from 3 to 10 seconds which is unacceptable. I don't believe 100 million+ rows should be too difficult for MySQL to do a SELECT from, even when selecting 5k rows, but perhaps I am wrong?

What could be done to optimize this? I have read a little about FULLTEXT keys – would those be better since the key is a 127 varchar? Or would some type of JOIN or UNION speed up this query over a large IN clause?

Any help would be appreciated! Thanks!

—-EDIT—-

SHOW ENGINE INNODB STATUS;

| InnoDB |      | 
=====================================
2014-07-14 10:59:19 2bf5cf25700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 5 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 6664 srv_active, 0 srv_shutdown, 142740 srv_idle
srv_master_thread log flush and writes: 149372
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 417120
OS WAIT ARRAY INFO: signal count 449454
Mutex spin waits 323558, rounds 2089912, OS waits 48403
RW-shared spins 49101, rounds 462555, OS waits 12976
RW-excl spins 406820, rounds 11261153, OS waits 350839
Spin rounds per wait: 6.46 mutex, 9.42 RW-shared, 27.68 RW-excl
------------
TRANSACTIONS
------------
Trx id counter 21503
Purge done for trx's n:o < 21472 undo n:o < 0 state: running but idle
History list length 641
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 15895, OS thread handle 0x2bf5cf25700, query id 399305 localhost root init
SHOW ENGINE INNODB STATUS
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
1203330 OS file reads, 2141172 OS file writes, 78570 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 5, seg size 7, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 276671, node heap has 52 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 117811837521
Log flushed up to   117811837521
Pages flushed up to 117811837521
Last checkpoint at  117811837521
0 pending log writes, 0 pending chkp writes
21279 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 137363456; in additional pool allocated 0
Dictionary memory allocated 720503
Buffer pool size   8191
Free buffers       1024
Database pages     7115
Old database pages 2606
Modified db pages  0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 4741, not young 1045985243
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1202370, created 1138775, written 2068616
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 7115, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
Main thread process no. 23063, id 3020409116416, state: sleeping
Number of rows inserted 113995729, updated 144489445, deleted 0, read 171054938
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s

Best Answer

The following is a long shot, as we do not know anything about your hardware, InnoDB configuration, and query specifics, but I bet you are using the wrong tool for the job (InnoDB Engine).

What you are trying to achieve is creating a very heavy index (up to 127 characters, which may take -this is a broad approximation- 127*3 bytes per entry), which is created using the only method available for InnoDB, a B+Tree. Also, as rows are clusterized around the primary key, the whole row is actually on the index, and accessing the primary key means accessing the page with the whole row content.

In short, you have a unique index, which contains your whole table, and which should fit more or less on memory (not necessarily all, but in this case your working set seems to be most of your the rows). How big is your InnoDB buffer pool? How is your buffer pool hit ratio? You can check both parameters with SHOW ENGINE INNODB STATUS. My bet is that your buffer pool is too small or even that your you do not have enough physical memory to hold your working set. In both cases, this may be forcing InnoDB to perform IOPS for every query. You may think that you should not need to have everything cache for everything to work well, and you should be right. But for your particular workload (large PKs), InnoDB is not the best engine. A hash index, available in other RDBMS and MySQL engines, would probably be smaller and faster, but it is not supported by InnoDB. Additionally, IN + list of values with a huge number of rows may not be the most optimal way of querying (at MySQL level), but it should be certainly faster than doing the queries individually.

  • Make sure that the query planner (you can check it with EXPLAIN) is using the range JOIN type, and not doing full table scans.
  • After that, the first thing I would recommend you to tune your InnoDB buffer to reduce InnoDB cache misses.
  • The next thing to try is to emulate a hash index by creating a small secondary index. This method is explained in the book "High Performance MySQL". This way, only the small secondary index would be cached on memory and it may better fit your physical memory.
  • If these do not work, before changing technology, I would recommend you to try to use a different engine with works well with key-value datasets. Maybe TokuDB could be better to handle this? Also, the memcached interface integrated into MySQL/InnoDB 5.6 could be another solution? Multi-get seems to fit your solution very well.
  • At last, if your load is mostly reads, you could try external technology, like full text search engines -as you mention-, but be careful as those kind of pieces of software tend to rely on fuzzy search, and may omit some results and they tend to not be fully ACID compliant (those are things that they have to sacrifice in exchange for query speed).