Say I have roughly 200 million records and have an indexed column user_id
on a product
table. If I select back 1 million records by doing the following query :
select * from product where user_id = 1
(Assuming myuser_id
is indexed.)
Would the other 199 million rows' column data affect the efficiency of my query ?
Example :
- Consider 199 million rows were completely populated (containing blobs / varchars)
- Consider 199 million rows were only partially populated, and carried a lot of null values
I've always understood this as the query would perform the same in either scenario – since user_id
is indexed and could bypass traversing through all the data – but I'd like to get confirmation on this.
If there's another question / resource that answers this – feel free to redirect me elsewhere. Just looking for some confirmation (and an explanation if my assumption incorrect).
Thanks.
Best Answer
Summary: It depends on
user_id
.Details:
For smaller numbers, I would simply say "No". But you describe a very large table -- possibly large enough to have issues with caching.
Let me start by explaining a few things...
user_id = 1
is a tiny percentage of the table, so it is very likely thatINDEX(user_id)
will be used.user_id = 1
; scan forward until that no longer holds true (1M rows as you say). For each of those index rows, reach into the data BTree to get*
(using thePRIMARY KEY
).user_id
, that will be upwards of 1M lookups in the data. InnoDB blocks are 16KB (by default). So this might involve 16GB of data being loaded.SELECT *
, specify only what you want -- this is an example of a significant speedup.NULLs
, then that changes how many rows are packed in each 16KB block. This may (but probably won't) mean that you won't need a full 1M blocks for the desired rows. That is, the caching in the buffer_pool may be useful.innodb_buffer_pool_size
? If it is much less than 32GB, then there will be a lot of I/O. If it a not more than 32GB, then a second run of that same query will find everything in cache, and run a lot faster.(The
EXPLAIN
will point out whether the index will be used, but not the rest of the details.)