Mysql – Avoid foreign keys on an 800-GB database

database-designMySQL

This question is for the experienced developers out there.

My friend recently told me that his employer has a legacy code base whose MySQL database has now grown to around 800 GB in size. What's interesting is that there are very few foreign key constraints, resulting in some duplicated bad data, but since that doesn't harm anything (apparently), nobody is overly concerned.

Now I can't imagine life without foreign keys, so when I asked him why there were no such checks, he replied by saying that a foreign key is another index that the database has to maintain, and takes more space.

Yes, space. So here's my question: Just how much more space can an index take? 10% of the DB? 20%? Even if his reasoning was wrong, there's no denying that foreign keys entail a performance penalty. Would you, as a developer, let go of foreign keys and instead rely on code to do proper housekeeping once the DB approached 1 TB in size?

Best Answer

...a foreign key is another index that the database has to maintain

I think your colleague may be laboring under the idea that columns containing foreign keys must be indexed, which isn't always the case. (See ypercube's comment below.) Foreign keys aren't anything other than values that the schema gives a special meaning. The decision to index such a column is the same as for any other: if it will frequently be the subject of selection criteria (i.e., part of a WHERE clause), index it according your DBMS's preferred practices.

Indexes come into play in the parent table's primary key column because those columns meet the criteria described above. More on that in a minute.

...and takes more space.

Almost everything in this field is a space/time trade-off, and indexes are one of them: you consume more storage during writing to consume less time during reading. It would be very hard to argue that adding an index to a table doesn't consume additional space, because it does. Making those kinds of decisions without the context of the whole schema is failing to see the forest for the trees.

As Codd described it, normalization was a very early application of what would become the DRY principle. Non-normalized data is inherently repetitive; normalized data is not. References in child tables are (necessarily) repetitive but have the benefit of being smaller than the original. (Normalization is more for preserving integrity rather than space, but that's another discussion.)

For example, a million copies of the string 1600 Pennsylvania Avenue stored in a million rows of a non-normalized table occupy 25 MB. Move that string to a separate table of addresses (where it's stored exactly once), refer to it a million times using an eight-byte, numeric foreign key and the whole mess occupies 8 MB plus 25 bytes for the actual string. That's space savings of about 67%.

This foreign-key relationship may work better performance-wise if the table of addresses (parent) is indexed by its primary key. That index is a lot smaller because the N:1 nature of normalization means that, on average, there will be fewer rows to index and the index won't need updating as often.

It's hard to say without seeing the schema, but a lack of normalization may be the reason the database grew to 800 GB in the first place.

Just how much more space can an index take?

That's a hard question to answer in general terms. The idea behind most forms of indexing is to allow the traversal of a small amount of data to locate something in a large corpus. A side effect is that the data structures used to represent it tend to end up much smaller. That said, a poorly-designed or poorly-selected mechanism combined with pathologically-bad data can lead to indexes that are much larger than the original data. The opposite, more-usual case is that the index is a small fraction. I've done a lot of work with full-text indexers and find they tend to chew up less than 5% of the original data. But that's just for my use cases and may not apply to anyone else's.