MySQL – How to Store Numeric Value as External Key for Performance

indexmariadbMySQLperformancestorage

This is something I was pondering recently. trying to determine how to store certain values. The best wide spread example I can think of is Federal Employee Identification Number – more commonly knows as tax id or social security for companies.

For various business / tax / legal reasons we need to store this number for some of our clients. Now this is a numeric 9 digit field . However it doesn't change. There are no calculations to be done with it, this is an external reference field that we have no control over. So should it be stored as char(9) or int(9) ? What are the benefits of storing one way or another? What's the most efficient way: in terms of space usage(I'm assuming int) , and in terms of performance ( speed of look ups) assuming that we will create a unique index on this column either way?

For the purpose of this exercise DB doesn't matter, but if want to elaborate I use MariaDB cluster ( MySQL)

Best Answer

Before start, please note that external keys are usually bad as primary keys, because they are larger than needed, they can change and some people may not have one, or have duplicated ones (even if legally that shouldn't happen).

Speaking correctly, the EIN is a code of 9 digits. As such, the technically correct value should be a string. In particular, for MySQL (yes, the database and storage engine matters a lot, although not so much between MySQL or MariaDB), that should be (I'm assuming it is a not null field) a:

CHAR(9) charset ascii NOT NULL

That should only take 9 bytes of fixed space in most engines. Please remember to use ascii, and not other charset, as you may waste 3 or 4 bytes per character.

The reason to check if logically a number should be an integer or a string is checking if it has to be used in arithmetic operations and the ordering. If someone asks you to order the EINs, '000-00-0009' should be before '100-00-0000', so zeros are important.

Having said that, using something like

int(9) UNSIGNED ZEROFILL NOT NULL

Would take you less size (4 bytes) and on the command line client it will show you the preceding zeros:

mysql> SELECT * FROM ein;
+-----------+
| ein       |
+-----------+
| 123456789 |
| 000000001 |
+-----------+

One common misunderstanding of int(NUMBER), is that we are telling the integer that it can only be up to 9 digits. That is incorrect. An int will always take 4 bytes and, in the case of an unsigned int, it will take values from 0 to 4,294,967,295. Most applications will ignore the size anyway.

Less space means less amount of bytes read and written to disk, and more free space on memory. And even if 4 to 9 bytes may not be a lot, if you have 200 million records, that is 1GB less of information (more if you have into account indexes). If you do not have a large amount of records, that may not affect you so much, although depending on the operations performed (it may involve transforming it into a character), both approaches may have its downsides.

There is a last option, probably way worse in performance, but that could help with consistency. MySQL has the "NUMERIC/DECIMAL" datatype, so you could also define it as:

ein decimal(9,0) zerofill unsigned not null

That should only take 4 bytes in MySQL, but it disallows both non-number characters (something that char allows) and numbers > 999,999,999 and <0 (unlike int), so it may be a bit less error-prone, but usually fixed-point arithmetic takes more cpu cycles than simple integers.

In general, performance issues can only be checked once you have the whole system running– sometimes trying to over-optimize is a mistake if there are not so many records and it makes working with them more complicated. Stick with the easiest one at first and optimize later.