I have around 500 GB of data in one table of MySQL which has around 5 billion records. It has around 15 columns. It has index on all the required columns. When I do select * from big_table where index_column = some_value
it takes couple of minutes to return the data.
I am not sure how indexing works internally here.Here is my understanding
- DB vendor will bring indexed column complete(not just data in where clause) data in memory first
- Find the values(in where clause) under data fetched in step_1 to get the record row location in actual table.
- Then another IO call will go to disk to get all required data based on row location fetched in step_2
Is that correct ? I am not sure on first step i.e. whether complete data of indexed column is fetched in memory or values are searched on disk itself without bringing the complete data in memory ?
Best Answer
This is an example for Db2, so a number of details will differ if you use another DBMS, but in general, it will look pretty much the same. Lets create a sample table:
So, we have a table with 100000 rows. Let's examine the unique index, before we add any data we will have just the root node:
NLEVELS is the height of the tree, and NLEAF are the number of leaf-pages. After adding data:
So we have a tree that in ASCII-art looks something like:
The 206 leaf-pages address all data in the table. If we look at the plan for a query like:
it will look like:
But what information did we touch to achieve this?
POOL_DATA_P_READS (12) is the number of pages that could not be found in memory and had to be read from disk. In any normal situation, 1 logical read results in 0 or 1 physical reads.
POOL_INDEX_L_READS (66) is the number of pages read from memory for index
So, even with this minimal amount of data in the table we only touch a fraction of pages (66) compared with the total number of pages (206+intermediate pages) for indexes.
In general, a read from the index will be a walk down the tree choosing a path at each new level.