Mysql – Is query efficiency affected by data not relevant to the query

MySQLperformancequery-performance

Say I have roughly 200 million records and have an indexed column user_id on a product table. If I select back 1 million records by doing the following query :

  • select * from product where user_id = 1 (Assuming my user_id is indexed.)

Would the other 199 million rows' column data affect the efficiency of my query ?

Example :

  1. Consider 199 million rows were completely populated (containing blobs / varchars)
  2. Consider 199 million rows were only partially populated, and carried a lot of null values

I've always understood this as the query would perform the same in either scenario – since user_id is indexed and could bypass traversing through all the data – but I'd like to get confirmation on this.

If there's another question / resource that answers this – feel free to redirect me elsewhere. Just looking for some confirmation (and an explanation if my assumption incorrect).

Thanks.

Best Answer

Summary: It depends on

  • Percentage of rows being fetch (at least according to imprecise statistics)
  • Size of buffer_pool (relative to size of table)
  • ENGINE
  • How much off-record blobs, etc, you have.
  • Whether the data is somewhat ordered by user_id.

Details:

For smaller numbers, I would simply say "No". But you describe a very large table -- possibly large enough to have issues with caching.

Let me start by explaining a few things...

  • user_id = 1 is a tiny percentage of the table, so it is very likely that INDEX(user_id) will be used.
  • I am assuming you are using InnoDB.
  • The lookup will work this way: Reach into the index for the first index row with user_id = 1; scan forward until that no longer holds true (1M rows as you say). For each of those index rows, reach into the data BTree to get * (using the PRIMARY KEY).
  • If the data is not effectively sorted by user_id, that will be upwards of 1M lookups in the data. InnoDB blocks are 16KB (by default). So this might involve 16GB of data being loaded.
  • You mention blobs, etc. They are probably stored 'off record', that is, in some other block. Maybe another 16GB to be fetched? If you don't need the blobs, don't say SELECT *, specify only what you want -- this is an example of a significant speedup.
  • If the other rows are full of NULLs, then that changes how many rows are packed in each 16KB block. This may (but probably won't) mean that you won't need a full 1M blocks for the desired rows. That is, the caching in the buffer_pool may be useful.
  • What is the value of innodb_buffer_pool_size? If it is much less than 32GB, then there will be a lot of I/O. If it a not more than 32GB, then a second run of that same query will find everything in cache, and run a lot faster.
  • If it weren't a "tiny percentage", but instead over about 20%, the Optimizer would probably decide that a table scan is cheaper than bouncing back and forth between the index's BTree and the data BTree.

(The EXPLAIN will point out whether the index will be used, but not the rest of the details.)