Mysql – What indexes should one use for this two-table setup

indexindex-tuningMySQLmysql-5.0

I have the following two-table setup:

CREATE TABLE relationships (
    id bigint(20) NOT NULL AUTO_INCREMENT,
    type varchar(20) COLLATE utf8mb4_unicode_ci NOT NULL,
    PRIMARY KEY (id),
    UNIQUE KEY id_type (id,type)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

CREATE TABLE content_relations (
    relationship_id bigint(20) NOT NULL,
    site_id bigint(20) NOT NULL,
    content_id bigint(20) NOT NULL,
    PRIMARY KEY (relationship_id,site_id,content_id),
    KEY relationship (relationship_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

The currently used indexes are included, but that's exactly my question:

Given the above tables and the following queries, what indexes should one use (instead)?


For the first table only, the following queries are performed (each with the given frequency):

SELECT type
FROM relationships
WHERE id = X;

Frequency: high


DELETE
FROM relationships
WHERE id = X;

Frequency: low



For the second table only, the following queries are performed (each with the given frequency):

SELECT site_id, content_id
FROM content_relations
WHERE relationship_id = X;

Frequency: high


DELETE
FROM content_relations
WHERE relationship_id = X
    AND site_id = Y;

Frequency: moderate


DELETE
FROM content_relations
WHERE relationship_id = X;

Frequency: low


SELECT DISTINCT relationhip_id
FROM content_relations
WHERE site_id = X;

Frequency: very low



Lastly, there is the following query using both tables:

SELECT r.id
FROM relationships r
INNER JOIN content_relations cr ON r.id = cr.relationship_id
WHERE cr.site_id = X
    AND cr.content_id = Y
    AND r.type = Z
LIMIT 1;

Frequency: very high



The first table (i.e., relationships) has an auto-incremented (and thus unique) column id, so there is the PRIMARY KEY (id). I guess that's okay. 😉
Both queries using this table only include just the primary key's column, so that's very good already.
Then there is the query using both tables, which also includes the type column. Does it make sense to use the unique composite KEY (id,type) here? Or is it redundant, or at least not of great advantage? Please keep in mind that there are just a few different values for the type column.

The second table (i.e., content_relations) currently has a PRIMARY KEY that consists of the (unique) combination of all three columns. This key is used for the last query using both tables, and also serves as constraint, because for every relationship(_id) and site(_id) there can only be one (or no) content element with a specific ID.
There are two queries using just the relationship_id column in the WHERE clause, that's why there is the KEY (relationship_id).
Then there is one query that also includes the site_id column. Should one therefore add the composite KEY (relationship_id,site_id)?
I suppose there is no need for the KEY (site_id), which would only be used by a single, very infrequent query, right?

Note: The KEY (relationship_id) is basically a foreign key to the relationships table. However, as I am working in an environment that allows for MySQL versions/engines that do not include foreign key constraints, I cannot use a real FOREIGN KEY, and all its benefits. But this shouldn't be relevant for my question(s), right?


// EDIT: Here is a little more context.
I didn't put it in the question before just because it was already pretty bulky. Since an answerer asked for more, I will do this now, though. Be prepared, however, it's quite a lot information.

The tables are about relationships between content elements (CE). Each CE has an ID and a type. For each type, there is only one CE with a specific ID. There may, however, be CEs A and B (with types A and B), and both have the same ID.

Relations are allowed between two or more CEs that A) have the same type, and B) are from different sites (represented by their unique ID). That's why the type column is part of the relationships table (rather than being a column in the content_relations table).

Sites are stored in another table, which I cannot alter in any way. I just use the sites' unique IDs, and that's all.

The CEs are stored in type-specific tables, which I also cannot alter in any way. I just use a CE's ID and its type, and thus have a unique identifier.

The AUTO_INCREMENT for the id column in the relationships table is used, because there are lots of relationships for a specific type, and I need a unique identifier for every relationship. When inserting a new relationship for a given type, I just have to provide the type, and automatically get a unique id. That's what AUTO_INCREMENT is for, isn't it? Or did I misunderstand you, @Rick James?

I guess I don't really need bigint. As I reference other tables' columns, however, and as these columns are defined as bigint(20), I thought it okay, if not wise, to use the same definition.

There are two usages of LIMIT 1. For the first one, there shouldn't even be more than one entry (because I query for id and compare type, and (id,type) is unique). So yes, I could/should remove this, thanks. The second time, however, there might be several entries – all with the same relationship_id. That's why I happily stop if I got one result.

Is there anything else I should provide you with?


Here is a visual representation of what this is all about:

visual representation

The individual figures represent individual content elements, while their individual type is represented by their individual shape, and their ID is written inside. Each color represents a relationship (i.e., a group of related content elements). As you can see, for any color (i.e., relationship), the shape (i.e., type) is the same, while there is not more than one figure (i.e., content element) from each site.

I hope this makes it more clear – and not more confusing.

Best Answer

Specific comments

PRIMARY KEY (id),
UNIQUE KEY id_type (id,type)
) ENGINE=InnoDB

The UNIQUE key is totally useless. InnoDB "clusters" the data with the PRIMARY KEY. That is looking up id lets you immediately find the rest of the columns, including type. Also, a PK is a UNIQUE key, so the unique key adds no useful constraint.

WHERE id = X

The PK takes care of that. (Both SELECT and DELETE)

PRIMARY KEY (relationship_id,site_id,content_id),
KEY relationship (relationship_id)

Similarly, the KEY is useless, DROP it.

The table seems a bit strange. Is it describing relationships between sites and contents? And there can be multiple relationships between each pair?

The PK for that table is fine for most queries. But...

SELECT DISTINCT relationhip_id
FROM content_relations
WHERE site_id = X;

will do a "table scan". Since you say it has "very low" frequency, that may be OK. But if you decide to speed it up, add

INDEX(site_id, relationship_id)

INNER JOIN content_relations cr ON r.id = cr.relationship_id
WHERE cr.site_id = X
  AND cr.content_id = Y
  AND r.type = Z

If the optimizer starts with cr,

cr: INDEX(site_id, content_id) -- in either order
r:  the existing PK will be used

would be optimal.

If it starts with r:

r:  INDEX(type, id) -- in this order  (fixed)
cr: the existing PK will be used

Side notes:

  • I see no reason for adding id AUTO_INCREMENT to the relations table
  • Do you really need BIGINT (8 bytes), almost no one exceeds INT UNSIGNED (4 bytes, max of 4 billion) for ids.
  • Why LIMIT 1? You don't care which one? (Perhaps you want an ORDER BY? Caution: That would add wrinkles to the optimal indexes!)
  • Is relationships just a mapping from an id to a number? Don't bother. Get rid of the table, and simply use type in other tables.
  • More on creating optimal indexes.

Structure of InnoDB Indexes

A table is a BTree, ordered by the PRIMARY KEY. A leaf node contains all the columns of the table.

A secondary index (UNIQUE or not) is a BTree, ordered by the secondary key. A leaf node contains the columns of the PRIMARY KEY. This implies that, when you look up a row via a secondary key, it first drills down the secondary BTree, then drills down the Primary BTree.

A "covering index" contains all the columns needed for the SELECT, hence there is no need to "reach over" into the data.

UNIQUE is a regular index (secondary index arranged in a BTree), plus a uniqueness constraint. When INSERTing, the constraint is checked. (There are other minor aspects of UNIQUE.)

Back to your question about needing (or not) UNIQUE(id,type): PRIMARY KEY(id) says there is a BTree ordered by the unique column id. Adding UNIQUE (id, type) would build another BTree, containing only id and type (the secondary columns) and nothing extra to handle the PK. Performing SELECT type FROM .. WHERE id=123 will drill down a BTree to find 123 and find type right there. This statement applies whether it uses the PK or the secondary key. Hence, I say that the secondary key cannot justify its existence.

As for "INDEX(site_id,content_id) although relationship_id ...", note that I said that the secondary key will have the columns of the PK. So, it is effectively INDEX(site_id, content_id, relationship_id) Explicitly listing all 3 columns has the advantage of making it more obvious that you have a "covering" index. Note: The order of the columns in a composite index does matter.