MySQL – BLObs Hitting Performance

blobMySQLmysql-5.6

One of my tables contains seven BLObs. Six of these refer to raw data and are typically 1.0 – 1.5Mb each. The seventh is the result of processing this data and can be nearly equal in size to the sum of the other BLObs but may be significant less if some of the data has been rejected for various reasons. The raw data was originally kept in a directory structure but there were some issues with logical naming of directories and also with someone changing some directory names. Therefore keeping all the data in the database would appear the correct way to do it. The raw data is needed in case a different algorithm is developed and it needs to be reprocessed. The problem is that as information is added the performance is degrading. If a copy is put on another server and the six raw data BLObs are deleted performance improves significantly. Research on the web suggests that the raw data is read even if it is not used in the query. Creating two tables, one for the raw data and the other for the result of processing, seems to be a solution but this would mean repeating some of the meta data which seems bad design. Is there a better way of tacking this?

Best Answer

Do not use ENGINE=MyISAM, that will lead to blobs being costly for any operation on the table. Instead, use InnoDB.

"Vertical partitioning" is generally not needed, because of the way InnoDB effectively does it for you....

With InnoDB, there are 4 ROW_FORMATs; let's try to pick the optimal one. In all cases, some, or all, of each blob is stored away from the row itself. More in a minute.

But first, let's check the queries. Do not use SELECT *; always specify only the necessary columns. This avoids reaching into other blocks to get the blobs.

The ROW_FORMATs are redundant, compact, dynamic, and compressed. For your case, I recommend either DYNAMIC or COMPRESSED. These store none of the blob in the record itself; instead they have a 20-byte pointer to where the blob is stored. Only if you need to touch the blob does it run off to that location.

COMPRESSED is the same as DYNAMIC, except for a complicated block-level compression technique. Most benchmarks I see say you get about 2x compression. This is not as good as you might get by compressing the blobs in the client. For example, most 'text' compresses 3x. But most image formats are already compressed, so compression is a waste.

So, I leave it to you to decide between dynamic and compressed. If in doubt, test with your data and your queries.

You should have innodb_file_per_table=1 when you CREATE the table. You may need BARRACUDA and you do need at least version 5.6; 5.7 would be better.

Two tables, one for raw, one for combined? Maybe. Given the benefits described above, there may not be any benefit in having two table. However, if there is a lot of activity in one column, it might be better to move it to a separate table (plus 1:1 id, etc). (This is especially important for things like "like/unlike" counters.)

Since you are working with bulky things, be careful that you do not have long-running transactions. Such can lead to timeouts, etc. Still, use BEGIN...COMMIT where appropriate.

Related Solutions

Mysql – Why is MySQL order by performance poor within a single partition of a partitioned table

Assuming that you range partition your table by the (auto incrementing) sequence and you limit your query by another value (day number), you might see, that a lot of partitions will be touched when you use:

EXPLAIN PARTITIONS SELECT * FROM table_name WHERE day_number=X ORDER BY...

The reason for that is, that there is no linkage between the partitioning attribute and the data filter. Since MySQL uses a different index for each partition, it needs to open all partitions to find all data matching day_number.

If you know the partition name, maybe by querying INFORMATION_SCHEMA.PARTITIONS before the real select, and you use MySQL-5.6+, you can use partition selection like

SELECT * FROM table_name PARTITION(partition_name) WHERE day_number=X ORDER BY...

But this is just chitchat based on a lot of assumtions. To give you a real answer, you need to post the query plan and more information about table structure - like Phil said before.

MySQL – Relational Schema Problem for Friendship and Messages with Big Data Manipulation

Check this out, and links within (1, 2). InnoDB has a limit of 4 billion tables. There is no compelling reason to use the file system to store data that is better stored in a database.

Take a look here to see that virtually every form of social networking software uses an underlying database - normally MySQL or PostgreSQL (which I would recommend). 2 million rows is very small in todays database terms - no need to have 40.000 tables (1 for each user either) - although you may beed to look at some form of partitioning - take a look at the answer here.

Anyway, with 40.000 users, shouldn't you perhaps be thinking of using more than 1 machine? 1 guy here tried putting > 32.000 schemas onto a MySQL instance - and couldn't (follow links there also). With 40K users, put 20K in a separate schema on each server if you want to - see here also. Implicit assumption - not all of your schemas will be very active simultaneously.

Databases are designed for searching and sorting and aggregating data - file systems aren't - are you going to grep huge files? Not that it's impossible, but your time is probably better used doing other things.

Final suggestion, download some of the Open Source stuff out there and check to see what they've done - keep what you like and throw away the rest.

[EDIT in response to OP's questions]

You asked:

The private messages aren't going to be searched i guess, is it better to keep the file system for this job?

I would still hold them in the database - you never know when you're going to need them - file systems aren't generally designed for lots and lots of small files. Plus, in a file system scenario, how do you store who is replying to whom? By having yet more files? Or by putting in a header? Uggghh! A BLOB for the message with fields for replying_to_message_ID &c. - much easier!

You also asked:

As for the friendship records i will use the database, will InnoDB make the difference regarding the maximum rows and the performance?

See the post by Bill Karwin here. Bill Karwin in training manager with Percona. Anybody with Percona is likely to know a thing or two about MySQL.

People argue about the performance of MyISAM v. InnoDB - Bill Karwin again. MyISAM is old and will soon be deprecated. Don't use unless you have a compelling reason to do so (my guess is that you won't).

Best Answer

Related Solutions

Mysql – Why is MySQL order by performance poor within a single partition of a partitioned table

MySQL – Relational Schema Problem for Friendship and Messages with Big Data Manipulation

Related Question