MySQL Large Table Indexes – Recommended length of index

indexMySQLperformance

I am working with a table that has a couple hundred million rows of data. It uses varchar(32) UUID's as it's keys both primary and foreign (there are two). I was wondering with this size of table, how much of each field do I need to index to keep good performance but not index every character of those three fields.

For example if I set an index length of 24 characters, will I see any noticeable difference in performance? I was hoping there might be some kind of algorithm out there that can be used to estimate performance based on length of field and length of index.

Currently the table is ~100GB in size, and the index is taking up over half of that space. I would like to recover some of that space, and reduce the growth rate of the table.

Best Answer

This is not an answer to your question (how to cut index length), but an alternative to reach the same objective: use less space in your DB.

If you want to optimize storage you should not store UUIDs as varchar(32) (nor character(32)).

You can store them as binary(16) with no loss of information, and save (at least) 50% of space. A UUID contains 128 bits of information. If stored as a text representing the hexadecimal value, you use 4 bits of information out of every character, which employs 8 bits assuming the text is encoded as ASCII or latin1. So, you waste half your available bits. Storing the information in binary format, you waste none.

My response is heavily inspired in Storing UUID Values in MySQL Tables, which I would recommend you read.

The following example shows you how to store the same UUID data in the two different formats:

 CREATE TABLE t
 (
      uuid_a character(32) PRIMARY KEY
     ,uuid_b binary(16) UNIQUE KEY
 ) ;

Both uuid_a and uuid_b will be used to store the same UUID, encoded in two different ways.

Assuming we have a table called generator_64k with 64k of data... we fill t with lots of uuid() using:

 -- We fill it with random data
 INSERT INTO 
     t 
 SELECT
      replace(u,'-','')             -- text version
     ,unhex(replace(u,'-',''))      -- binary version
 FROM
     (SELECT 
         uuid() AS u 
      FROM 
         generator_64k
     ) AS s0 ;

... and we add also a well known value:

 -- We fill it with one known piece of data
 INSERT INTO 
     t 
 SELECT
      replace(u,'-','')             -- text version
     ,unhex(replace(u,'-',''))      -- binary version
 FROM
     (SELECT
         'aab5d5fd-70c1-11e5-a4fb-b026b977eb28' AS u
     ) AS s0 ;

You can then query the table and look for the known value using either of the representations:

 -- Encoded as character(32)
 SELECT
     uuid_a, hex(uuid_b)
 FROM 
     t
 WHERE
     uuid_a = replace('aab5d5fd-70c1-11e5-a4fb-b026b977eb28', '-', '') ;

 -- Encoded as binary(16)
 SELECT
     uuid_a, hex(uuid_b)
 FROM 
     t
 WHERE
     uuid_b = unhex(replace('aab5d5fd-70c1-11e5-a4fb-b026b977eb28', '-', '')) ;

You get, in both cases, the same result:

 uuid_a                          | hex(uuid_b)                    
 :------------------------------- | :-------------------------------
 aab5d5fd70c111e5a4fbb026b977eb28 | AAB5D5FD70C111E5A4FBB026B977EB28
 

In practice, you would only use uuid_b in your table.

By doing this, you halve the amount of space used to store your UUIDs (both in table and indexes), without having to resort to using a substring in your index (which can affect your performance, and still not save that much).

See all the code at dbfiddle here