MySQL Create Index Semantics: Clustered vs. Nonclustered

clusteringindexMySQL

I'm new to databases and specifically MySQL, so pardon me if this is a simple question.

I'm trying to make sense of what happens with MySQL when I call CREATE INDEX. I know that primary keys are unique and correspond to a clustered index but what happens if I have a table where I have no primary key defined and create an index. I know by default the data structure is a B-tree but it's not clear if this corresponds to a clustered or nonclustered index in the typical sense.

Specifically I have a table that that has four columns as integers:

sales(item_id, time, price, location_id)

The only column that has unique values is time. If I create this table and then later run CREATE INDEX dx ON sales(item_id); what happens?

From here I'm lead to believe that when I create the table without the primary key, because time is unique that is the clustered index and so I'm creating a non-clustered index on item_id.

My questions are: is this correct? If so how can I create a clustered index on a non-unique column in MySQL?

Best Answer

(This assumes you are using `ENGINE=InnoDB.)

Your table must have a primary key: You have 3 choices:

(preferred) You explicitly provide such.
There is a UNIQUE index with non-null column(s). (Sloppy, just make it the PK)
(not a good option) A hidden PK will be provided.

The PK will be clustered and unique. That is the only choice for a pk in MySQL.

No other index can be clustered. Again, that is by-definition.

There are 3 choices for structure of an index:

(the most common) B+Tree. (Everything else in this answer is B+Tree)
FULLTEXT -- for searching for words in text
SPATIAL -- for 2-dimensional searches, such as geographic

sales(item_id, time, price, location_id)

Perhaps you need PRIMARY KEY(item_id, time). But that assumes you will never have two sales for the same item at exactly the same time.

So, it might be safer to have a 5th column:

id INT UNSIGNED AUTO_INCREMENT NOT NULL,
PRIMARY KEY(id)

Let's see the SELECTs; from them we can decide what secondary index(es) you need. Or do it yourself here .

My questions are: is this correct? If so how can I create a clustered index on a non-unique column in MySQL?

That is a trick that very few people have discovered. Once you have something that is unique (such as the 5th column, above), do this instead of the PK above:

PRIMARY KEY(time, id),
INDEX(id)

Now, even it time is not unique, the PK is clustered and unique because of id tacked on. AUTO_INCREMENT does not require more than "being the first column in some index".

Related Solutions

Sql-server – Clustered vs Nonclustered Index

Since we are talking about the clustered index, just because you defined the CI key column as ID, you still have the DeletedDate data in the leaf data pages of the index. That's the nature of the clustered index: It is the table data.

Because you are typically having queries that look like:

select *
from YourTable
where DeletedDate is null;

You will likely benefit from a filtered index.

create nonclustered index IX_YourFilteredNci
on YourTable(<Key Columns Here>)
where DeletedDate is not null;
go

I didn't explicitly put the key columns here (and nonkey columns through the use of the INCLUDE clause) because you didn't publish the DDL of your table.

As in my comment above to your question, the choice of key columns (not just columns, but also the order of the columns) will largely depend on your workload and the typical queries that would be using this index.

If you are looking to cover your query(ies), then you would need to ensure that the index satisfies all of the data required of the query(ies). Not to mention, if you have other WHERE clauses (besides your NULL check on DeletedDate) or joins to consider, then the order of your key columns can be the deciding factor between a scan or a seek. And even though it is filtered, and depending on how much data you have in the index, the penalty could be considerable.

Sql-server – Meaning of “nonclustered located on primary”

This is the name of the filegroup or partition scheme that the index is created on. This can be specified when creating an index with a second ON clause.

The sp_help procedure calls sp_helpindex which retrieves the name from sys.data_spaces

The primary filegroup contains the primary data file and any other files not specifically assigned to another filegroup. All pages for the system tables are allocated in the primary filegroup.

More info about Files and Filegroups here

Best Answer

Related Solutions

Sql-server – Clustered vs Nonclustered Index

Sql-server – Meaning of “nonclustered located on primary”

Related Question