Mysql – What architecture is best to store a large 3 dimensional array in MySQL

Architecturedatabase-designinnodbMySQL

I am looking to create a large 3 dimensional database in MySQL. The structure will basically be a standard MySQL table, with a time component / dimension added. See the following analogy:

{ x, y, z } = { column, row, time }

The z dimension will be time, and I would like to store as much as possible. We are hoping for approximately 1TB of total stored data with modest performance. In other words, we are looking to take a snapshot of one MySQL Table, every 30 seconds, for years. It will be the same table every time, and will be about 10 columns {x} by 1000 rows {y} (approximately 50KB). So if you will, we want to take a snapshot of a 50KB table every 30 seconds.

So this creates the following problem: averting the need to store an endless amount of tables. I have read in various posts on Stack Exchange that it's bad architecture to have millions of tables in a database, and with such a design performance will suffer. So here are the two possible architectures I can think of:

  1. Create a new table and name it using an epoch time, and create millions of new tables endlessly (not good).

  2. Create one database with two columns: epoch_time and json. For every snapshot of the original table, every 30 seconds, convert that into a json string and store the entire table in the json column. So basically, a database with millions of rows containing json serialized tables.

Would number 2 be the best architecture? Is there a better way that I may be missing?

Best Answer

Deltas

The data you are "snapshotting" -- how often does it change?

I suggest looking into storing only the "deltas". When some piece of the snapshot does not change at all, the delta is empty, and you can store nothing.

For reconstructing a snapshot at some point in the past, the processing is costly -- you need to walk through the versions, applying the deltas as you go.

There are two ways to run the deltas -- forward or backward. Going 'forward', you would start with the original (complete) snapshot, then apply deltas until the desired time. Going 'backward' has the advantage that the most recent snapshot is complete. The going backward 'subtracts' off the changes.

Since you say "for years", it is probably wise to take complete snapshots every, say, day. Then finding a particular 30-second second snapshot won't involve more than 2880 deltas. This obviously leads to a speed/space tradeoff -- full snapshots are bulky, but infrequent snapshots leads to long 'reconstruction' times.

Trigger

Rather than "snapshotting", use a TRIGGER to build an "audit trail". This is similar the "deltas" I mentioned, but it is better in that it is continuous, not "every 30 seconds". The case I remember had over a billion rows in the audit trail; each row had (approximately) the timestamp, table name, PRIMARY KEY, and a compressed JSON blob of all the columns for that rows. Your needs may be better served by some variant of that.

Schema

Until I see the actual queries, I will advise against PARTITIONing since it is usually of no performance benefit.

The link about table size limits is missing one number: 64TB is the limit for one non-partitioned InnoDB table.