Mysql – estimate the row count of a large thesql table using the disk space

MySQL

Sysadmin here (not a developer by trade), so go easy please 😉

I have a mysql table, on a linux server that uses innodb_file_per_table.

When I try to do select count(foo) from table; It doesn't finish. I gave it a good 5 minutes before interrupting.

The table.ibd file is about 43G (45176848384). Since the count() seemed like it was never going to finish, I decided to try the following to get an estimate of the row count:

I dumped the first 10,000 rows of the table and imported it to a local database that also has innodb_file_per_table and I hope all else equal.

That file, after I imported it, (with only 10,000 rows) is about 2M (2097152)

So:

45176848384/2097152
21542
21542*10000
215420000

Then, is it correct to assume that the ~43G table is around 215 million rows? Are there other considerations I'm overlooking?

Or is there a better way for me to determine how many rows are in this table?

Best Answer

You cannot make that kind of approximation work if the InnoDB table experiences DELETE and UPDATEs. With innodb_file_per_table enabled and a table called mydb.mytable, this is what you get

  • /var/lib/mysql/mydb/mytables.frm : Table Format Info
  • /var/lib/mysql/mydb/mytables.ibd : This file contains data pages and index pages making up the table and its indexes (Each page is 16KB).
    • There will be fragmentation amongst the data pages due to DELETEs and UPDATEs
    • There will be fragmentation amongst the index pages due to BTREE operations (node balancing, node splits, etc).

Even if you defragment the InnoDB with OPTIMIZE TABLE, there is no way of knowing if a single row occupies two or more pages.

ALTERNATIVE APPROACH

You could probably force the index statistics to give you a round figure. You could just run

SHOW INDEXES FROM mydb.mytable;

Look at the Cardinality column of the PRIMARY KEY. That will give you a very close guess at the rowcount. However, if you run SHOW INDEXES FROM mydb.mytable; multiple times on a large table, you get a different number each time. That's because InnoDB will scan the BTREE nodes and give a roundabout number. If you want to go by the Cardinality of the PRIMARY KEY as the rowcount, you should run this

SET GLOBAL innodb_stats_on_metadata = 0;

before running SHOW INDEXES FROM mydb.mytable;

Believe me, this is a much better guess that dumping a subset of the data and guessing from the filesize.