Mysql – String Pooling for MySQL

innodbMySQLstoragevarchar

So I'm having this table that logs all errors happened in our production code. It looks like this:

id  | date | type         | message       | file         | line
----+------+--------------+---------------+--------------+-----------
INT | DATE | VARCHAR(128) | VARCHAR(4096) | VARCHAR(255) | SMALLINT

Now, in most cases errors will be reported via a simple HTTP-API, issued from the client's server running our software. Naturally, when something happens, it happens often. So most errors get logged between 10 and 200 times, in rare cases up to 1k times.

So I end up with many records having the same value for the message field (and for type and file as well). Since I could not find any details via Google, I'm now wondering whether it would be wise to manually store the strings in another table like this:

hash        | string
------------+----------
VARCHAR(32) | VARCHAR(4096)

Assuming I would hash the values of the three fields with MD5, I could shrink their space usage down to VARCHAR(34), using the scheme

l$I'm a literal value.             for fields < 32 chars
h$md5(field)                       for fields > 32 chars

I did some benchmarking and it seems to save about 15 to 20% from my error table. To handle the concurrency with transaction, my tables use InnoDB. What is the difference between MySQL VARCHAR and TEXT data types? points out that InnoDB will store string values always outside the normal table, but is it pooling or storing each string for itself?

Additionally, I plan on using PostgreSQL, too. rolleyes 🙂

Any suggestions? Is it wise to go the extra mile and do string pooling?

Best Answer

Small starting clarification: the article you linked to on InnoDB text/blob storage is a little out of date with MySQL 5.5, and the barracuda row format. This article is more up to date.

On to your question:

In the schema you have, each row will be less than ~8K, so you can guarantee across both antelope and barracuda row formats that all data will be stored in-line - not requiring external/overflow pages.

If you were to require overflow pages, they are never de-duplicated (which is what I would probably describe your 'pooling' mechanism as). Even worse than they are never de-duplicated, they are never shared...

If you could have a record too big to fit inline (~8K limit), each text/blob that needs to be moved out will take a minimum of a 16K page to itself.