Mysql – What data is duplicated when MySQL/MariaDB BLOB columns are copied

blobbulkcopyinsertmariadbMySQL

Let table_1 be created as follows:

CREATE TABLE table_1 (
    id INT AUTO_INCREMENT PRIMARY KEY,
    some_blob BLOB
);

Let table_2 be created as follows:

CREATE TABLE table_2 (
    id INT AUTO_INCREMENT PRIMARY KEY,
    some_blob BLOB
);

What I want to know is, after I run this table-copying query

INSERT INTO table_2 (id, some_blob) SELECT id, some_blob FROM table_1;

will the actual text within each some_blob field of the table_1 table be duplicated and stored on disk, or will the DB have only duplicated pointers to the disk locations containing the BLOB data?

One argument for why BLOB copying must involve the duplication of actual content reasons as follows:

Duplication of BLOB content is necessary because changes to BLOB data in table_1 should not also take place in table_2. If only the disk pointers were duplicated then content changes in one table would be reflected in the other table, which violates the properties of a correct copy operation.

Now I present an alternative method that the DB could implement to satisfy this copy operation. This alternative shows the above argument is not necessarily true. The DB could only duplicate disk pointers during the execution of the given INSERT statement, then whenever an UPDATE occurs which seeks to modify the BLOB data in one of the tables, the DB would only then allocate more space on disk to store the new data which is part of the UPDATE query. A BLOB data segment then is only deleted when there no longer exists any disk pointers to it, and a particular BLOB data segment could potentially have many disk pointers pointing to it.

So which of these strategies does MySQL/MariaDB use when executing the given INSERT statement, or does it use a different strategy?

Why I am asking this question

Currently I am running a couple of UPDATE queries which are copying large amounts of BLOB data from one table to another in the same database (over 10 million rows of BLOB data). The queries have been running for a while. I am curious about whether the performance is so slow because some of the columns I am comparing are poorly indexed, because these queries are literally copying over the content instead of disk pointers, or perhaps because of both of these reasons.

I use an INSERT in the question's example because this simplifies the concept of database internals that I am trying to understand.

Best Answer

Separate copy.

Copying just a disk pointer would necessitate "reference counting", locks in places where there are none now, etc.

You could implement such by keeping the BLOBs in a separate table and somehow link to them. You could use MD5 (or other digest) for de-dupping. However, I suspect the benefit would be outweighed by the complexity and potential for bugs.

"for a long time" -- If there are many rows being copied in the same transaction, note that there is extra effort to prepare for ROLLBACK in the case of a crash.