Mysql – Does a unique clustered index give array-like lookup performance

clustered-indexindexindex-tuningMySQL

Suppose I have a MySQL table with a unique integer index id that is auto-incremented for each row.

id | PlanetName
---+-----------
 0 | Mercury
 1 | Venus
 2 | Neptune

Suppose that I am primarily interested in optimizing queries that look up rows by the unique index. For example:

SELECT PlanetName FROM Planets WHERE id = 2

If I were programming in another language and I had the data loaded up into an array, then it would be very quick to perform these lookups by pointer arithmetic. I would like to be able to get the same sort of performance when I carry out the queries in MySQL.

My thinking is:

a clustered index means the data is stored sequentially (i.e., very similarly to how a primitive array is stored in memory)
a unique integer index ensures that SQL does not have to worry about duplicate rows

So if I create a unique clustered index for id, does that mean that MySQL will carry out a pointer arithmetic-type lookup?

Best Answer

Yes and no.

An "array" in a programming language does some trivial address arithmetic to quickly locate the Nth entry in the array.

In MySQL, most indexes are built as B+Trees. (See Wikipedia) This structure is more complex than an array, but still the best available. WHERE id=2 requires drilling down a "tree" of nodes to locate the item with "2" in the "id" column.

A BTree has several advantages over a simple array. And these advantages are necessary for the generality of database operations.

Database tables are designed to handle an arbitrarily large number of items. Arrays are limited to what can fit in RAM. This effectively prevents the address arithmetic of arrays, and forces some other implementation.
You can DELETE * FROM t WHERE id=2. (Arrays don't allow holes; BTrees do.) Note that it prevents the use of "address arithmetic" to locate a record.
A BTree index can use a string for lookup -- WHERE name = 'Venus'. And this works as easily and almost as fast as using numbers.
Since BTrees are "blocks"; they get scattered around and they could become empty. This leads to the overhead of maintenance of the tree. Don't worry about this; BTrees are still very fast, on average.
"Clustering" in this context implies that ids 1,2,3,... are "consecutive" and "adjacent" in some sense. (Or "1,3,...", if id=2 has been deleted.) In practice, there are about 100 consecutive values sitting in one B+Tree block. This allows for "range" to be very efficient. Example: WHERE name BETWEEN 'Mars' AND 'Venus' would be store alphabetically.
When a 'range' query runs off the end of a B+Tree block, the next block is linked from it. (This is the "+".)
Technically, array lookup is O(1) and a BTree lookup is O(logN). But the log is not that big a number -- A BTree for a million rows is about 3 levels deep; for a trillion rows, only about 6 levels deep. That is, looking up a row in a trillion-row table is only about twice as slow as in a million-row table.
MySQL (InnoDB specifically) requires a unique PRIMARY KEY and clusters it with the data.
Hence, a lookup by the PK (WHERE id=2, if id is the PK) is one drill-down of one BTree to get to the entire row.
A secondary index (not the PK) is implemented as a B+Tree with the key column(s) plus the column(s) of the PK.
Hence, lookup by a secondary key (SELECT * FROM t WHERE name='Venus' with INDEX(name)) is a little more complex. First drill down the name index to find the id, then drill down the PK+data BTree to find the entire row.
Prevention of duplicates -- This happens when INSERTing or UPDATEing a row. Effectively, it looks up the row via the PRIMARY KEY (if given -- cf AUTO_INCREMENT) and via every UNIQUE index. If any of those are a match, you get an error (unless doing INSERT IGNORE). Otherwise, the PK's BTree and the Unique BTrees are poised to take the new/modified row. Dealing with potential dups is not free, but cheap.
For a BETWEEN, the first row is accessed (O(logN)), then each subsequent row is O(1).

So, yes, an array lookup takes nanoseconds, which is faster than a database lookup, which takes microseconds (if cached in RAM), or milliseconds (if I/O is needed). That is the price you have to pay for unlimited size and many more features.

Related Solutions

Sql-server – Clustered Index Help, where did i go wrong

Are there exceptions to the above clustered index rules?

The general guidance for selecting a clustered index is good, but there are sometimes additional considerations to be taken into account. Occasionally, these extra factors can be more important than the general 'rules'.

Your scenario is somewhat "special" in that you have a very wide table, with a set of queries that request a (presumably) unpredictable set of columns, although the query predicates are normally the same.

The original clustered index arrangement was likely expensive for operations that change data, since changing any part of the clustering key means changing all nonclustered indexes as well. In addition, physically moving whole rows is costly in terms of log generation, especially where a page split occurs.

That said, once a clustered index has been modified and split a fair bit, it will contain quite a bit of free space, making future movements less intensive, just as if a sensible fill factor had been set in the first place, and maintained as part of normal index maintenance.

It seems likely your production system had fallen into a sort of equilibrium, where page splitting occurred at a steady, reasonable rate. Nonclustered index maintenance would still be relatively expensive, but it seems that was not a dominant factor.

The crucial advantages of the original indexing arrangement were:

The clustered index (Company, Region, FlagA, FlagB, FlagC) matches the predicate of:
```
SELECT {unpredictable column list}
FROM table 
WHERE Company = 'company' 
AND Region = 'region' 
AND flagA= 'A'
AND flagB = 'B'
AND flagC = 'C';
```
...while at the same time providing access to whichever columns are listed in the select list.
Queries of the form:
```
SELECT * 
FROM table 
WHERE ID = 1234;
```
...were adequately supported by the nonclustered primary key. This query always returns a single row, so only a single lookup via the clustered index is required after the row is located in the nonclustered index.

Changing to a clustered index on ID, and a nonclustered index on (Company, Region, FlagA, FlagB, FlagC) makes the second query a little more efficient (eliminating one lookup per query), but makes the first query much less efficient (replacing zero lookups with ~5000).

Moreover, it is very possible the optimizer might choose not to use the nonclustered index at all, estimating that a full scan of the table would be cheaper than the 5000 lookups.

You're probably better off leaving the two main indexes as they are for the time being, while you sort out the 40-odd nonclustered indexes, and analyse the workload for the minimal set of covering nonclustered indexes that would be required. Once that data is available, you will be in a better position to consider fundamental indexing changes.

You may also want to check the existing monitoring and maintenance routines for this table. In many scenarios, the 'rule-breaking' clustered index will be just fine, if the table is given a fill factor that just prevents significant page splitting before the next maintenance window comes around.

Mysql – What indexes should one use for this two-table setup

Specific comments

PRIMARY KEY (id),
UNIQUE KEY id_type (id,type)
) ENGINE=InnoDB

The UNIQUE key is totally useless. InnoDB "clusters" the data with the PRIMARY KEY. That is looking up id lets you immediately find the rest of the columns, including type. Also, a PK is a UNIQUE key, so the unique key adds no useful constraint.

WHERE id = X

The PK takes care of that. (Both SELECT and DELETE)

PRIMARY KEY (relationship_id,site_id,content_id),
KEY relationship (relationship_id)

Similarly, the KEY is useless, DROP it.

The table seems a bit strange. Is it describing relationships between sites and contents? And there can be multiple relationships between each pair?

The PK for that table is fine for most queries. But...

SELECT DISTINCT relationhip_id
FROM content_relations
WHERE site_id = X;

will do a "table scan". Since you say it has "very low" frequency, that may be OK. But if you decide to speed it up, add

INDEX(site_id, relationship_id)

INNER JOIN content_relations cr ON r.id = cr.relationship_id
WHERE cr.site_id = X
  AND cr.content_id = Y
  AND r.type = Z

If the optimizer starts with cr,

cr: INDEX(site_id, content_id) -- in either order
r:  the existing PK will be used

would be optimal.

If it starts with r:

r:  INDEX(type, id) -- in this order  (fixed)
cr: the existing PK will be used

Side notes:

I see no reason for adding id AUTO_INCREMENT to the relations table
Do you really need BIGINT (8 bytes), almost no one exceeds INT UNSIGNED (4 bytes, max of 4 billion) for ids.
Why LIMIT 1? You don't care which one? (Perhaps you want an ORDER BY? Caution: That would add wrinkles to the optimal indexes!)
Is relationships just a mapping from an id to a number? Don't bother. Get rid of the table, and simply use type in other tables.
More on creating optimal indexes.

Structure of InnoDB Indexes

A table is a BTree, ordered by the PRIMARY KEY. A leaf node contains all the columns of the table.

A secondary index (UNIQUE or not) is a BTree, ordered by the secondary key. A leaf node contains the columns of the PRIMARY KEY. This implies that, when you look up a row via a secondary key, it first drills down the secondary BTree, then drills down the Primary BTree.

A "covering index" contains all the columns needed for the SELECT, hence there is no need to "reach over" into the data.

UNIQUE is a regular index (secondary index arranged in a BTree), plus a uniqueness constraint. When INSERTing, the constraint is checked. (There are other minor aspects of UNIQUE.)

Back to your question about needing (or not) UNIQUE(id,type): PRIMARY KEY(id) says there is a BTree ordered by the unique column id. Adding UNIQUE (id, type) would build another BTree, containing only id and type (the secondary columns) and nothing extra to handle the PK. Performing SELECT type FROM .. WHERE id=123 will drill down a BTree to find 123 and find type right there. This statement applies whether it uses the PK or the secondary key. Hence, I say that the secondary key cannot justify its existence.

As for "INDEX(site_id,content_id) although relationship_id ...", note that I said that the secondary key will have the columns of the PK. So, it is effectively INDEX(site_id, content_id, relationship_id) Explicitly listing all 3 columns has the advantage of making it more obvious that you have a "covering" index. Note: The order of the columns in a composite index does matter.

Best Answer

Related Solutions

Sql-server – Clustered Index Help, where did i go wrong

Mysql – What indexes should one use for this two-table setup

Related Question