Mysql – Analyze MySQL what needs to be archived

archiveMySQL

I'm not a DBA, I'm a DevOps guy trying to manage a MySQL database which is ever growing and desperately needs some archiving. My question is not how to archive. I think I have a pretty good idea and have SO for anything I might not know yet.

I'm trying to figure out "what" to archive. I don't have a clue what our apps need, or how often they access the records. I can tell you our DB has 410 tables in it. Our largest tables has 4 million records in it. I don't have the time to go through each and every one of those, ask how much they get used, or accessed, etc. I'm trying to trim down our development environments database size. It's slow to refresh them, and our RDS bill is massive. I know for certain our development environments do not need all this data. We also run a sanitizing script which rips out PII and other real world sensitive information. I don't think it is necessary to sanitize 8+ years or possibly older of records just for a development environment database.

Goal: Have a smaller developer environment database to A) trim down bill B) reduce time for data refresh.

Question: What are common things you look at when you are making decisions on what to archive, and how much to archive? Are there tools that help you identify good candidates? Are there common queries that you run to help you make this decision?

Best Answer

Typically this kind of decisions are taken with product managers. They are the ones who can say things like "customers should be able to read 2 years old information, but not necessarily older than that".

MySQL (or any other DBMS, as far as I know) doesn't provide a way to find out when a row was last read.

But there are ways to find out related information. This won't solve your problem, but hopefully you'll find useful indications.

  • Some tables have a column with the timestamp of the last update. Usually the default value is CURRENT_TIMESTAMP, but check: if there is no default value, I wouldn't consider the column reliable. You can't be sure that the application always updates the column.
  • You can temporarily add triggers AFTER UPDATE and AFTER DELETE. Triggers can write the value of the primary key into a separate table, using INSERT IGNORE. Keep them going for some time, and then check the oldest row that was updated/deleted. Writing such triggers manually is impossible, but this can easily be automated.
  • If you use Percona Server or MariaDB, you can enable the user_statistics plugin. It tells you how many times each table and index is read/written. If you're lucky, you'll find out that some tables are never read and can be archived. Even if not, the number of read rows per hour could possibly give you a hint (or not... this is highly dependent on your database logic, so I can't give more advice on this).
  • If you suspect that a portion of rows is never read, try a trick: read them with a SELECT three times. If the first query was sensibly slower than the second and third, those rows were not cached. Repeat the experiment at different days. It's a very empirical and unreliable test, use it as a hint and nothing more.
  • If the slow log is enabled, it logs the queries that take more than long_query_time seconds. You can set long_query_time=0 to log all the queries (this is a good practice for many reasons). You can then grep all queries that mention a certain table. You can make a script to run them all, and remember the oldest read row. Careful, you'll need to run SELECTs but not INSERT, UPDATE, etc.