How can the following SELECT WHERE IN case be optimized?
I have a table with over 100 million rows with only 3 columns. The primary key (col1) is a 127 varchar. I am performing a SELECT col1 WHERE col1 IN (…) where the IN clause is 5,000 strings. I am just trying to see which of the 5,000 strings are in the DB as the primary key.
With a dedicated server and InnoDB table, the query takes anywhere from 3 to 10 seconds which is unacceptable. I don't believe 100 million+ rows should be too difficult for MySQL to do a SELECT from, even when selecting 5k rows, but perhaps I am wrong?
What could be done to optimize this? I have read a little about FULLTEXT keys – would those be better since the key is a 127 varchar? Or would some type of JOIN or UNION speed up this query over a large IN clause?
Any help would be appreciated! Thanks!
—-EDIT—-
SHOW ENGINE INNODB STATUS;
| InnoDB | |
=====================================
2014-07-14 10:59:19 2bf5cf25700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 5 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 6664 srv_active, 0 srv_shutdown, 142740 srv_idle
srv_master_thread log flush and writes: 149372
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 417120
OS WAIT ARRAY INFO: signal count 449454
Mutex spin waits 323558, rounds 2089912, OS waits 48403
RW-shared spins 49101, rounds 462555, OS waits 12976
RW-excl spins 406820, rounds 11261153, OS waits 350839
Spin rounds per wait: 6.46 mutex, 9.42 RW-shared, 27.68 RW-excl
------------
TRANSACTIONS
------------
Trx id counter 21503
Purge done for trx's n:o < 21472 undo n:o < 0 state: running but idle
History list length 641
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 15895, OS thread handle 0x2bf5cf25700, query id 399305 localhost root init
SHOW ENGINE INNODB STATUS
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
1203330 OS file reads, 2141172 OS file writes, 78570 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 5, seg size 7, 0 merges
merged operations:
insert 0, delete mark 0, delete 0
discarded operations:
insert 0, delete mark 0, delete 0
Hash table size 276671, node heap has 52 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 117811837521
Log flushed up to 117811837521
Pages flushed up to 117811837521
Last checkpoint at 117811837521
0 pending log writes, 0 pending chkp writes
21279 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 137363456; in additional pool allocated 0
Dictionary memory allocated 720503
Buffer pool size 8191
Free buffers 1024
Database pages 7115
Old database pages 2606
Modified db pages 0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 4741, not young 1045985243
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1202370, created 1138775, written 2068616
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 7115, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
Main thread process no. 23063, id 3020409116416, state: sleeping
Number of rows inserted 113995729, updated 144489445, deleted 0, read 171054938
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Best Answer
The following is a long shot, as we do not know anything about your hardware, InnoDB configuration, and query specifics, but I bet you are using the wrong tool for the job (InnoDB Engine).
What you are trying to achieve is creating a very heavy index (up to 127 characters, which may take -this is a broad approximation- 127*3 bytes per entry), which is created using the only method available for InnoDB, a B+Tree. Also, as rows are clusterized around the primary key, the whole row is actually on the index, and accessing the primary key means accessing the page with the whole row content.
In short, you have a unique index, which contains your whole table, and which should fit more or less on memory (not necessarily all, but in this case your working set seems to be most of your the rows). How big is your InnoDB buffer pool? How is your buffer pool hit ratio? You can check both parameters with
SHOW ENGINE INNODB STATUS
. My bet is that your buffer pool is too small or even that your you do not have enough physical memory to hold your working set. In both cases, this may be forcing InnoDB to perform IOPS for every query. You may think that you should not need to have everything cache for everything to work well, and you should be right. But for your particular workload (large PKs), InnoDB is not the best engine. A hash index, available in other RDBMS and MySQL engines, would probably be smaller and faster, but it is not supported by InnoDB. Additionally,IN + list of values
with a huge number of rows may not be the most optimal way of querying (at MySQL level), but it should be certainly faster than doing the queries individually.EXPLAIN
) is using therange
JOIN type, and not doing full table scans.