MySQL Optimization – How to Optimize Storage for a Table

MySQLoptimization

I'm using MySql to store a basic table of this format:

    id      int(11)          //Auto-Incrementing ID
    data    varchar(5120)    //Random input data, compressed by a program, not mysql.
    -----------------------
    Row size ΓΈ  916 B

That's kind of troubling, because right now I have about 5,000 records each month, and I'd like to optimize that better than ending up averaging out at 1mb/1000 records.

I set it up this way originally because I wanted to be able to capture the larger sets, but they're very infrequent, as you can see here by this graph:

enter image description here

Link to graph API

count   n*128 bytes
+-------+----------
1       28
1       26
1       24
2       22
8       21
4       20
13      19
12      18
16      17
27      16
43      15
58      14
69      13
114     12
184     11
262     10
399     9
588     8
807     7
1224    6
1245    5
546     4
73      3
9       2
6       1
1       0

My main concern here is how much space I've wasted just to have accommodated the tail end of that chart, and what happens when I end up with even larger outliers? I'll have to bump up my row size again, when roughly 80% of the data fits very nicely into a varchar 1024 block, 1/5th the size of what I'm using now.

So how should I have built this thing in the first place?

Best Answer

Since it is built already, let's look at what you have. There is an interesting way to plan columns definitions for data currently present. If the table name is mydata, try running this query:

SELECT data FROM mydata PROCEDURE ANALYSE();

This will not display any of your data. This will examine the column data itself, calculate statistics based on the first 256 distinct values (by default, you can specify the distinct value count) and recommend the proper column type the table should have.

As long as the table is MyISAM, don't worry too much about the rows size because the default row format is DYNAMIC. If the table is InnoDB (and I sure hope it is not), please make sure the data is not part of the PRIMARY KEY. Your clustered index will grow like pouring undiluted MiracleGro in your backyard.

You may need to break up the data column into chunks of 128 and run either MD5 (32 character output) or SHA1 (40 character output) and concatenate those MD5 or SHA1 outputs and store them. That would save at 75% of storage. Have fun trying to code that breakup. This was just a rambling suggestion off the top of my head.

You may want to consider using Sphinx Indexing as well.