I already answered this in StackOverflow : https://stackoverflow.com/a/11636341/491757
If you want to defrag all your MyISAM tables, here is a shell script to do so...
MYSQL_USER=root
MYSQL_PASS=rootpassword
MYSQL_CONN="-u${MYSQL_USER} -p${MYSQL_PASS}"
SQL="SELECT CONCAT('OPTIMIZE TABLE ',table_schema,'.',table_name,';') "
SQL="${SQL} FROM information_schema.tables "
SQL="${SQL} WHERE engine='MyISAM' AND table_schema NOT IN "
SQL="${SQL} ('information_schema','performance_schema','mysql')"
mysql ${MYSQL_CONN} -ANe"${SQL}" > GlobalMyISAMOptmizeTable.sql
less GlobalMyISAMOptmizeTable.sql
Once you trust the script visually, just run it
mysql ${MYSQL_CONN} < GlobalMyISAMOptmizeTable.sql
Give it a Try !!!
Let's break this question up into a few parts.
Q: I need to insert 1mm rows a day. Is that a lot?
Not really. 1mm divided by 24 hours divided by 60 minutes divided by 60 seconds gives you about 12 inserts per second. For a rough frame of perspective, it's not unusual to see 1,000 inserts per second in typical commodity servers with no tuning.
Granted, your load won't be perfectly averaged out like that - you'll have bursts of load - but I wouldn't make database platform decisions based on less than 10k-20k inserts per second. Any platform out there will work fairly well.
Q: How should I structure the data?
Zoom out - don't think table, think databases. If you're going to be keeping this data permanently, and it's truly insert-only with no updates, then you probably want to start a new database for time lengths. Your inserts may only go into one table in one database, but every year, create a new database (MyApp_2015) and seal the old 2014 data as read-only. You can stop backing it up (as long as you've still got a good backup once), stop doing index maintenance, statistics updates, etc.
The PHP will only ever have to know about the current database for inserts, making your design a lot easier. The archival process becomes a DBA task much later down the road as long as you go in knowing that there will be more than one database involved.
If you were doing more than 1,000 inserts per second sustained, and you wanted easier performance management, then I'd also suggest building sharding into the initial design regardless of the database platform. Don't get me wrong, any modern database can handle over 1,000 inserts per second, but designing sharding in now just gives you more flexibility later on. At 12 inserts per second, it's just not worth the design/testing hassle.
Q: How should I do reporting?
In an ideal world, reports would not be done against the live server. Run the reports against a restored or replicated copy of the database. This does two things: it reduces load on the live server, and it validates your backups, guaranteeing that you've got your valuable data elsewhere.
Best Answer
You are backing up the raw files, not a "dump" of the rows (via mysqldump or Xtrabackup), correct?
When you
DELETE
, the rows are held onto in case of aROLLBACK
(explicit or due to crash). This may increase the file size.Data (and index) files never shrink in MySQL without explicit action. In particular, there is no way to shrink
ibdata1
without dumping and reloading.So...
Plan A: Use mysqldump or xtrabackup to extract the rows. Then, if you like, send that through tar and gzip (or whatever).
Plan B: If the tables are "file_per_table",
OPTIMIZE TABLE
to shrink the.ibd
files before the next dump. (You are stuck with the bloatedibdata1
.)Plan C: If not file_per_table, that is, if all the data is in
ibdata1
, do a one-time dump, rm ibdata1, reload. This is somewhat tricky and somewhat risky. Test before risking the production data.If you will regularly be purging old data, consider using Partitioning as an efficient way to purge old data. If you intend to keep only 3 months' worth, then partition by week (14 partitions). Details here. It involves
DROP PARTITION
which will not leave behind stuff likeDELETE
does.As for why the "steep increase" did not continue to happen... The deleted rows left "free" space in the tablespace; that was being reused. For now, your dumps will continue to be about the same size. (Note: gzip, alone, will fluctuate in size as the data changes, even if the the input size seems to stay the same. This is a characteristic of compression algorithms.)