MySQL – Add index to low cardinality text column

indexMySQLperformance

I'm working on project where an enum column is being converted into a text column (I cannot change this). The cardinality of the column is low (7 unique values). Would I gain a performance increase by adding a 10-15 char index, or is the cardinality low enough where the index would result in diminishing returns?

Best Answer

The size or datatype of the column is irrelevant. It is the unique values that matter. If you only have 7 unique values that means 14.286% of the rows have to be considered.

Instead of giving the MySQL Query Optimizer the stress of figuring out that out, you should partitioning the table by hash:

CREATE TABLE mytable
(
    id INT NOT NULL AUTO_INCREMENT,
    ...
    ...
    myenum INT NOT NULL,
    PRIMARY KEY (id)
)
PARTITION BY HASH( myenum )
PARTITIONS 7;

No need to have the myenum in any indexes. Leave it to the MySQL Query Optimizer to search the correct partition should any SELECT query have a WHERE clause that includes AND myenum = ....

If you ever have to increase the number of unique values, you will have to increase the number of partitions.

Give it a Try !!!

UPDATE 2013-10-24 17:57

As I said in the comments, you should partition by the enum with the highest cardinality.

What about the other enums? DO NOT INDEX THE ENUM BY THEMSELVES !!!

If your SELECT queries include WHERE enum2... AND enum3=...`` AND enum4=...`, you should think about making compound indexes of enums.

For example, if you have enum2, enum3, and enum4, you could make compound indexes like these:

ALTER TABLE mytable ADD INDEX (enum2,enum3,enum4);
ALTER TABLE mytable ADD INDEX (enum3,enum4);

Which order should you choose?

CARDINALITY(enum2) > CARDINALITY(enum3)
CARDINALITY(enum3) > CARDINALITY(enum4)

CAVEAT : Again, I like to emphasize, if you partition by enum1, there is no need to index on enum1.

Related Solutions

MySQL VARCHAR vs TEXT – Differences Between VARCHAR and TEXT Data Types

divided linked to some info that explains the basic issue (there's performance differences), but it's not simple enough to say that one's always better than the other. (otherwise, there'd be no reason to have both.) Also, in MyISM, the 64k max size for VARCHAR isn't per field -- it's per record.

Basically, there's 4 ways to store strings in database records:

fixed length
C-style strings (marked with a NULL or similar character at the end of the string)
Pascal style strings (a few bytes to indicate length, then the string)
Pointers (store the string somewhere else)

MyISM uses something similar to #3 for VARCHAR, and a hybrid approach for TEXT where it stores the beginning of the string in the record, then rest of the string somewhere else. InnoDB is similar for VARCHAR, but stores the complete TEXT field outside of the record.

With 1&4, the stuff in the record is always the same length, so it's easier to skip over if you don't need the string, but need stuff after it. Both #2 and #3 aren't too bad for short strings ... #2 has to keep looking for the marker, while #3 can skip ahead ... as the strings get longer, #2 gets worse for this particular use case.

If you actually need to read the string, #4 is slower, as you have to read the record, then read the string which might be stored elsewhere on the disk, depending on just how that database handles it. #1 is always pretty straightforward, and again you run into similar issues where for #2 gets worse the longer the string is, while #3 is a little worse than #2 for very small strings, but better as it gets longer.

Then there's storage requirements ... #1 is always a fixed length, so it might have bloat if most strings aren't the max length. #2 has 1 extra byte; #3 typically has 2 extra bytes if max length = 255, 4 extra bytes if a 64k max. #4 has the pointer length, plus the rules for #3 typically.

For the specific implementations within MySQL 5.1, the docs for MyISM state:

Support for a true VARCHAR type; a VARCHAR column starts with a length stored in one or two bytes.
Tables with VARCHAR columns may have fixed or dynamic row length.
The sum of the lengths of the VARCHAR and CHAR columns in a table may be up to 64KB.

While for InnoDB :

The variable-length part of the record header contains a bit vector for indicating NULL columns. If the number of columns in the index that can be NULL is N, the bit vector occupies CEILING(N/8) bytes. (For example, if there are anywhere from 9 to 15 columns that can be NULL, the bit vector uses two bytes.) Columns that are NULL do not occupy space other than the bit in this vector. The variable-length part of the header also contains the lengths of variable-length columns. Each length takes one or two bytes, depending on the maximum length of the column. If all columns in the index are NOT NULL and have a fixed length, the record header has no variable-length part.
For each non-NULL variable-length field, the record header contains the length of the column in one or two bytes. Two bytes will only be needed if part of the column is stored externally in overflow pages or the maximum length exceeds 255 bytes and the actual length exceeds 127 bytes. For an externally stored column, the two-byte length indicates the length of the internally stored part plus the 20-byte pointer to the externally stored part. The internal part is 768 bytes, so the length is 768+20. The 20-byte pointer stores the true length of the column.

...

as with so many other things when dealing with databases, if you're not sure what's best for your needs, try benchmarking it with similar data & usage, and see how they behave.

PostgreSQL Text Pattern Ops Index – Why Index text_pattern_ops on a Text Column?

The documentation often gives you an answer to such questions. Like in this case, too:

The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale. As an example, you might index a varchar column like this:
CREATE INDEX test_index ON test_table (col varchar_pattern_ops);
Note that you should also create an index with the default operator class if you want queries involving ordinary <, <=, >, or >= comparisons to use an index. Such queries cannot use the xxx_pattern_ops operator classes. (Ordinary equality comparisons can use these operator classes, however.) It is possible to create multiple indexes on the same column with different operator classes.

The documentation goes on to say:

If you do use the C locale, you do not need the xxx_pattern_ops operator classes, because an index with the default operator class is usable for pattern-matching queries in the C locale.

You can check your locale as follows (it is likely to be UTF8 rather than "C"):

postgres=> show lc_collate;
 lc_collate
-------------
 en_GB.UTF-8