Mysql – How to predict index size

index-tuningMySQLmysql-5.7

I have a table with ~100M records of unique urls VARCHAR(1023) like /t/some_page_name_..., an average length is 51 chars.

I'm trying to select the best way to apply a unique index on the table, so I need to calculate by which column unique index will have the smallest size:

VARCHAR(1023) (url) - data size ~52 bytes, index size - ??? bytes
VARCHAR(64) (SHA2(url, 256)) - data size - 65 bytes, index size - ??? bytes
BINARY(32) (UNHEX(SHA2(url, 256))) - data size - 32 bytes, index size - ??? bytes

Can you help me?
Thanks

Best Answer

First approximation: An InnoDB table or index will be 2-3 times as large as you would expect from adding up the column sizes.

I get 53 for VARCHAR(1023) with an average of 51 characters. VARCHAR for non-small max needs 2 bytes for length. This also assumes there are no non-English letters in the url. For example, each Chinese character takes 3-4 bytes. The index size needs to be computed in bytes.

For your 3 examples, there won't be a lot of difference, since most of the 2x-3x comes from BTree overhead, padding, row overhead, etc.

Another thing that can make a significant difference in index size is whether the rows are are inserted in sorted order -- or not. That leads to well packed BTree blocks -- or not.

Also, your numbers are incomplete. Three different cases: MyISAM index, InnoDB PRIMARY KEY, InnoDB secondary key.

For InnoDB secondary key, the PRIMARY KEY is tacked on. So, you need to add whatever sized those column(s) are. Typical case: 4 bytes for INT.

Even if you get past all that, there will be anomalies. A 1-row table will have 16KB for each secondary index. At a certain table size, the "allocation unit" switches from 16KB to 4MB. After that, adding a row will rarely change the index size, but sometimes will show a big jump.

Etc, etc.

An aside... Indexes using SHA2 (or other digests/hashes) are notoriously bad for performance on huge tables. They lead to random accesses, which blows out cache, which leads to reads and writes being I/O bound, which is deadly for huge tables.