Are there exceptions to the above clustered index rules?
The general guidance for selecting a clustered index is good, but there are sometimes additional considerations to be taken into account. Occasionally, these extra factors can be more important than the general 'rules'.
Your scenario is somewhat "special" in that you have a very wide table, with a set of queries that request a (presumably) unpredictable set of columns, although the query predicates are normally the same.
The original clustered index arrangement was likely expensive for operations that change data, since changing any part of the clustering key means changing all nonclustered indexes as well. In addition, physically moving whole rows is costly in terms of log generation, especially where a page split occurs.
That said, once a clustered index has been modified and split a fair bit, it will contain quite a bit of free space, making future movements less intensive, just as if a sensible fill factor had been set in the first place, and maintained as part of normal index maintenance.
It seems likely your production system had fallen into a sort of equilibrium, where page splitting occurred at a steady, reasonable rate. Nonclustered index maintenance would still be relatively expensive, but it seems that was not a dominant factor.
The crucial advantages of the original indexing arrangement were:
The clustered index (Company, Region, FlagA, FlagB, FlagC) matches the predicate of:
SELECT {unpredictable column list}
FROM table
WHERE Company = 'company'
AND Region = 'region'
AND flagA= 'A'
AND flagB = 'B'
AND flagC = 'C';
...while at the same time providing access to whichever columns are listed in the select list.
Queries of the form:
SELECT *
FROM table
WHERE ID = 1234;
...were adequately supported by the nonclustered primary key. This query always returns a single row, so only a single lookup via the clustered index is required after the row is located in the nonclustered index.
Changing to a clustered index on ID
, and a nonclustered index on (Company, Region, FlagA, FlagB, FlagC) makes the second query a little more efficient (eliminating one lookup per query), but makes the first query much less efficient (replacing zero lookups with ~5000).
Moreover, it is very possible the optimizer might choose not to use the nonclustered index at all, estimating that a full scan of the table would be cheaper than the 5000 lookups.
You're probably better off leaving the two main indexes as they are for the time being, while you sort out the 40-odd nonclustered indexes, and analyse the workload for the minimal set of covering nonclustered indexes that would be required. Once that data is available, you will be in a better position to consider fundamental indexing changes.
You may also want to check the existing monitoring and maintenance routines for this table. In many scenarios, the 'rule-breaking' clustered index will be just fine, if the table is given a fill factor that just prevents significant page splitting before the next maintenance window comes around.
Specific comments
PRIMARY KEY (id),
UNIQUE KEY id_type (id,type)
) ENGINE=InnoDB
The UNIQUE
key is totally useless. InnoDB "clusters" the data with the PRIMARY KEY. That is looking up id
lets you immediately find the rest of the columns, including type
. Also, a PK is a UNIQUE key, so the unique key adds no useful constraint.
WHERE id = X
The PK takes care of that. (Both SELECT
and DELETE
)
PRIMARY KEY (relationship_id,site_id,content_id),
KEY relationship (relationship_id)
Similarly, the KEY is useless, DROP it.
The table seems a bit strange. Is it describing relationships between sites and contents? And there can be multiple relationships between each pair?
The PK for that table is fine for most queries. But...
SELECT DISTINCT relationhip_id
FROM content_relations
WHERE site_id = X;
will do a "table scan". Since you say it has "very low" frequency, that may be OK. But if you decide to speed it up, add
INDEX(site_id, relationship_id)
INNER JOIN content_relations cr ON r.id = cr.relationship_id
WHERE cr.site_id = X
AND cr.content_id = Y
AND r.type = Z
If the optimizer starts with cr
,
cr: INDEX(site_id, content_id) -- in either order
r: the existing PK will be used
would be optimal.
If it starts with r
:
r: INDEX(type, id) -- in this order (fixed)
cr: the existing PK will be used
Side notes:
- I see no reason for adding
id AUTO_INCREMENT
to the relations table
- Do you really need
BIGINT
(8 bytes), almost no one exceeds INT UNSIGNED
(4 bytes, max of 4 billion) for ids.
- Why
LIMIT 1
? You don't care which one? (Perhaps you want an ORDER BY
? Caution: That would add wrinkles to the optimal indexes!)
- Is
relationships
just a mapping from an id to a number? Don't bother. Get rid of the table, and simply use type
in other tables.
- More on creating optimal indexes.
Structure of InnoDB Indexes
A table is a BTree, ordered by the PRIMARY KEY
. A leaf node contains all the columns of the table.
A secondary index (UNIQUE
or not) is a BTree, ordered by the secondary key. A leaf node contains the columns of the PRIMARY KEY
. This implies that, when you look up a row via a secondary key, it first drills down the secondary BTree, then drills down the Primary BTree.
A "covering index" contains all the columns needed for the SELECT
, hence there is no need to "reach over" into the data.
UNIQUE
is a regular index (secondary index arranged in a BTree), plus a uniqueness constraint. When INSERTing
, the constraint is checked. (There are other minor aspects of UNIQUE
.)
Back to your question about needing (or not) UNIQUE(id,type)
: PRIMARY KEY(id)
says there is a BTree ordered by the unique column id
. Adding UNIQUE (id, type)
would build another BTree, containing only id
and type
(the secondary columns) and nothing extra to handle the PK. Performing SELECT type FROM .. WHERE id=123
will drill down a BTree to find 123
and find type
right there. This statement applies whether it uses the PK or the secondary key. Hence, I say that the secondary key cannot justify its existence.
As for "INDEX(site_id,content_id) although relationship_id ...", note that I said that the secondary key will have the columns of the PK. So, it is effectively INDEX(site_id, content_id, relationship_id)
Explicitly listing all 3 columns has the advantage of making it more obvious that you have a "covering" index. Note: The order of the columns in a composite index does matter.
Best Answer
Yes and no.
An "array" in a programming language does some trivial address arithmetic to quickly locate the Nth entry in the array.
In MySQL, most indexes are built as B+Trees. (See Wikipedia) This structure is more complex than an array, but still the best available.
WHERE id=2
requires drilling down a "tree" of nodes to locate the item with "2" in the "id" column.A BTree has several advantages over a simple array. And these advantages are necessary for the generality of database operations.
DELETE * FROM t WHERE id=2
. (Arrays don't allow holes; BTrees do.) Note that it prevents the use of "address arithmetic" to locate a record.WHERE name = 'Venus'
. And this works as easily and almost as fast as using numbers.WHERE name BETWEEN 'Mars' AND 'Venus'
would be store alphabetically.PRIMARY KEY
and clusters it with the data.WHERE id=2
, ifid
is the PK) is one drill-down of one BTree to get to the entire row.SELECT * FROM t WHERE name='Venus'
withINDEX(name)
) is a little more complex. First drill down thename
index to find theid
, then drill down the PK+data BTree to find the entire row.INSERTing
orUPDATEing
a row. Effectively, it looks up the row via thePRIMARY KEY
(if given -- cfAUTO_INCREMENT
) and via everyUNIQUE
index. If any of those are a match, you get an error (unless doingINSERT IGNORE
). Otherwise, the PK's BTree and the Unique BTrees are poised to take the new/modified row. Dealing with potential dups is not free, but cheap.BETWEEN
, the first row is accessed (O(logN)), then each subsequent row is O(1).So, yes, an array lookup takes nanoseconds, which is faster than a database lookup, which takes microseconds (if cached in RAM), or milliseconds (if I/O is needed). That is the price you have to pay for unlimited size and many more features.