MySQL – Managing TEXT Disk Space

innodbMySQL

I have a plain text file with 3 million articles divided with \n.
The file is 58G

When I'm adding the articles to an InnoDB table one per row

  • INT id
  • TEXT article_content

the .ibd file of this table (innodb_file_per_table=1) takes up 98G

Can someone explain if such increase in disk space of a table comparing to a text file is normal ? Is there a way to store such content more efficiently ?

Best Answer

If the table is that simple (INT and TEXT), you cannot change the design any more.

Suppose your table is called mydb.myarticles. If you run this TableSize Query,

SELECT
    FORMAT(data_length,2)  / POWER(1024,1) DataPagesKB,
    FORMAT(index_length,2) / POWER(1024,1) IndexPagesKB,
    FORMAT(data_length,2)  / POWER(1024,2) DataPagesMB,
    FORMAT(index_length,2) / POWER(1024,2) IndexPagesMB,
    FORMAT(data_length,2)  / POWER(1024,3) DataPagesGB,
    FORMAT(index_length,2) / POWER(1024,3) IndexPagesGB,
    (data_length+index_length) DataAndIndexesInBytes
FROM information_schema.tables
WHERE table_schema='mydb' and table_name='myarticles'\G

You will see how much space the data pages are taking up. If you feel DataPagesGB is significantly lower than 98GB, then run:

OPTIMIZE TABLE mydb.myarticles;

which will actually execute

ALTER TABLE mydb.myarticles ENGINE=InnoDB;
ANALYZE TABLE mydb.myarticles;

This will shrink the table as much as possible. Then, run the TableSize Query again. Chances are, you may not get much shrinkage.

Keep this in mind: The default size of an InnoDB Page is 16KB (InnoDB_page_size). Any TEXT fields over 16KB will cause InnoDB to split the row up into two or more pages. In other words

  • Text field > 32 KB may need up to 3 pages for the one row
  • Text field > 48 KB may need up to 4 pages for the one row
  • ...
  • Text field > 16n KB may need up to (n+1) pages for the one row

At least one of those pages will have lots of wasted space.

CAVEAT

I would strongly advise against changing the InnoDB_page_size for compression purposes because I wrote a post back on Mar 02, 2012 on how InnoDB creates additional wasted space in the Buffer Pool changing to the Barracuda Format, requiring a much larger Buffer Pool. This would be in addition to the chaining of the increased number of InnoDB pages.

Your Original Question

Can someone explain if such increase in disk space of a table comparing to a text file is normal ? Is there a way to store such content more efficiently ?

Yes, it is normal for such an increase especially if many rows have TEXT values bigger than 16KB. If you add any additional indexes (such a FULLTEXT index or a prefix index), you will bloat the Index Pages.

You can estimate the bookkeeping info by subtracting DataAndIndexesInBytes from the size of the .ibd and dividing that difference by 3 million. That answer will give you the bookkeeping info needed per row.

SUGGESTIONS

  • Use a different character set and collation (One byte per character)
  • Limit the characters of the TEXT fields to the most commonly used characters
  • Avoid using multibyte characters
  • Use TEXT field lengths less than 16KB