Mysql – Avoid foreign keys on an 800-GB database

database-designMySQL

This question is for the experienced developers out there.

My friend recently told me that his employer has a legacy code base whose MySQL database has now grown to around 800 GB in size. What's interesting is that there are very few foreign key constraints, resulting in some duplicated bad data, but since that doesn't harm anything (apparently), nobody is overly concerned.

Now I can't imagine life without foreign keys, so when I asked him why there were no such checks, he replied by saying that a foreign key is another index that the database has to maintain, and takes more space.

Yes, space. So here's my question: Just how much more space can an index take? 10% of the DB? 20%? Even if his reasoning was wrong, there's no denying that foreign keys entail a performance penalty. Would you, as a developer, let go of foreign keys and instead rely on code to do proper housekeeping once the DB approached 1 TB in size?

Best Answer

...a foreign key is another index that the database has to maintain

I think your colleague may be laboring under the idea that columns containing foreign keys must be indexed, which isn't always the case. (See ypercube's comment below.) Foreign keys aren't anything other than values that the schema gives a special meaning. The decision to index such a column is the same as for any other: if it will frequently be the subject of selection criteria (i.e., part of a WHERE clause), index it according your DBMS's preferred practices.

Indexes come into play in the parent table's primary key column because those columns meet the criteria described above. More on that in a minute.

...and takes more space.

Almost everything in this field is a space/time trade-off, and indexes are one of them: you consume more storage during writing to consume less time during reading. It would be very hard to argue that adding an index to a table doesn't consume additional space, because it does. Making those kinds of decisions without the context of the whole schema is failing to see the forest for the trees.

As Codd described it, normalization was a very early application of what would become the DRY principle. Non-normalized data is inherently repetitive; normalized data is not. References in child tables are (necessarily) repetitive but have the benefit of being smaller than the original. (Normalization is more for preserving integrity rather than space, but that's another discussion.)

For example, a million copies of the string 1600 Pennsylvania Avenue stored in a million rows of a non-normalized table occupy 25 MB. Move that string to a separate table of addresses (where it's stored exactly once), refer to it a million times using an eight-byte, numeric foreign key and the whole mess occupies 8 MB plus 25 bytes for the actual string. That's space savings of about 67%.

This foreign-key relationship may work better performance-wise if the table of addresses (parent) is indexed by its primary key. That index is a lot smaller because the N:1 nature of normalization means that, on average, there will be fewer rows to index and the index won't need updating as often.

It's hard to say without seeing the schema, but a lack of normalization may be the reason the database grew to 800 GB in the first place.

Just how much more space can an index take?

That's a hard question to answer in general terms. The idea behind most forms of indexing is to allow the traversal of a small amount of data to locate something in a large corpus. A side effect is that the data structures used to represent it tend to end up much smaller. That said, a poorly-designed or poorly-selected mechanism combined with pathologically-bad data can lead to indexes that are much larger than the original data. The opposite, more-usual case is that the index is a small fraction. I've done a lot of work with full-text indexers and find they tend to chew up less than 5% of the original data. But that's just for my use cases and may not apply to anyone else's.

Related Solutions

Mysql – indexes, foreign keys and optimization

In the example you show, the expression would use the index on criteria to resolve which rows to retrieve, and the leaf nodes in the index contain references to the corresponding rows, either by physical location in the case of MyISAM or by primary key value in the case of InnoDB. That way it avoids a full table scan.

Note that if the query optimizer thinks your query matches too much of the table, it will ignore the index, because it's just as fast to do the table-scan. By analogy, this is why no one bothers to put the word "the" in the index at the back of a book.
No, MySQL requires that an index exists for the foreign key. If an index already exists, InnoDB will use it, but it will create one if necessary. I have found some corner cases where you can trick InnoDB by dropping the index after defining the foreign key constraint, but this would not be advisable, because the index is important for checking referential integrity quickly.
It's true that insert/update/delete needs to update indexes to reflect values added or removed. But it's not as bad as you might fear. Also, InnoDB implements a "change buffer" which records how your update affects secondary indexes, and reconciles it with the whole index later.
InnoDB and MyISAM support only BTREE indexes. InnoDB also has an internal feature called the adaptive hash which does do a hash lookup for frequently-requested index entries. But this is completely automatic and transparent; you don't have to define any index or tune any config parameters.

Mysql – Using Foreign Keys in database

No, you have not defined any Foreign key constraints.

The MySQL documentation page about FOREIGN KEY constraints has more details and possible options.

Here's an example for the invoices -> customers Foreign Key:

CREATE TABLE IF NOT EXISTS `nt_invoices` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `customer_id` int(11) NOT NULL,
  `owner_id` int(11) NOT NULL,
  `invoice_title` varchar(80) NOT NULL,
  `invoice_issue_date` date NOT NULL,
  `due_date` date NOT NULL,
  `description` text NOT NULL,
  `created_by` int(11) DEFAULT NULL,
  `updated_by` int(11) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),

  KEY customer_id_index (customer_id),        --- index to be used by the FK

  CONSTRAINT customer_invoice_fk              --- the FK constraint's name
    FOREIGN KEY (customer_id)
      REFERENCES nt_customers(id)
      ON UPDATE CASCADE                       --- or other action
      ON DELETE RESTRICT                      --- here, too  

) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

Best Answer

Related Solutions

Mysql – indexes, foreign keys and optimization

Mysql – Using Foreign Keys in database

Related Question