Mysql – Data Warehouse: How to i query daily snapshots

database-designMySQL

I have some snapshots of a database that are not timeseries. For example:

  • Snapshot day 1:

    +----+---------------+------------+------------+        
    | ID |     Title     |  Category  |    Date    |
    +----+---------------+------------+------------+
    | 1  | My First Post | helloworld | 2015-01-01 |
    +----+---------------+------------+------------+
    
  • Snapshot day 2 (A new post is added today):

    +----+----------------+------------+------------+        
    | ID |      Title     |  Category  |    Date    |
    +----+----------------+------------+------------+
    | 1  | My first post  | helloworld | 2015-01-01 |
    | 2  | My second post | other      | 2015-01-02 |
    +----+----------------+------------+------------+
    
  • Snapshot day 3 (Post 2 is removed today):

    +----+---------------+------------+------------+        
    | ID |     Title     |  Category  |    Date    |
    +----+---------------+------------+------------+
    | 1  | My First Post | helloworld | 2015-01-01 |
    +----+---------------+------------+------------+
    

So between days, a row of the table may or not may be constant. Now, I need to be able to use a query like this:

SELECT category, COUNT(*) from day1.My_table group by category

This is for one table of one day. If we want to count the daily average of posts by category in a month we should do something like:

SELECT category, SUM(cnt) / 30 
from ( 
    SELECT category, COUNT(*) as cnt 
    from day1.My_table 
    group by category 
        UNION ALL SELECT category, COUNT(*) as cnt 
                  from day2.My_table 
                  group by category 
        UNION ALL ... 
        UNION ALL SELECT category, COUNT(*) as cnt 
                  from day30.My_table 
                  group by category
) group by category

Another example, the number of post published in a month:

SELECT COUNT(distinct id) 
from ( 
    SELECT id 
    from day1.My_table 
    UNION ALL ... 
    UNION ALL SELECT id 
              from day30.My_table
) 

Basically we would need to consider a weight. If we have day1.My_table and day5.My_table, every post that is in day1 and not in day5 will be counted as it was also in day 2,3,4. Every post that is day1 and day5 will count as if it is in every day of the month (= until the next snapshot).

So in case i would like to consider the average number of post per day of >=6 months a go, where I have just 1 snapshot, I would assign to that snapshot a weight of 30.

So, the average post published in a month for a range >= 6 months ago is:

SELECT category, SUM(cnt) / 30 
from ( 
    SELECT category, COUNT(*)*30 as cnt 
    from day1.My_table 
    group by category --- Note: I'm not considering the range defined from the user in this example.
) group by category;

As comment also stated, I would need to do a query like:

Select category, AVG(*) 
from [fromRange-toRange].MyTable; 

For an extreme solution, I'm considering the idea of implementing a metalanguage to let the future user (ex. marketng people) to do a query like this.

Do you think there is a way to accomplish this in Drill without the meta-language? I would do this using a recursive UDF but they can't return queries.

Every snapshot is large 250GB, and I want to be able to compare these dataset with other external data (I don't know beforehand the scheme of these dataset).

Is there a solution suitable for Apache Drill? Or is there another solution for this problem?

Also any meta-language or paper about this problem is appreciated.

Edit:
We haven't transactional data. We have data that change in time, and can be added or removed; for this reason we need everyday snapshots. Also we don't know beforehand the queries that will be performed, so we can't know what type of aggregation to be done.
Also every row has about 100 columns, and there are say 250GB per snapshot (Mysql tables).
We also need full-text search on this data on every row, on every possible day.

An example of search could be "How many post were about sometopic?" So it has to search all posts for the sometopic keyword. Every snapshot may or not have the same rows. Also two snapshots could have the same post, but slightly modified.

Best Answer

Let's think out of the box. Instead of having a "snapshot", let's have a "log". What you currently have is "current" state of things; adding a "log" would provide the "history", from which could be derived the 'lost' info.

One way to implement the log is to have a TRIGGER on INSERT or UPDATE of the table, and have the trigger write to the log file. This log will not be pleasant for the ad hoc queries, so have a nightly job (or maybe hourly) that summarizes the changes for the day -- net gain (or loss) of number of posts, etc. The "day2" info and the "last month" info can then be derived from this summary table quite quickly. Or perhaps a second level of summarization that declares what the state was for each day. I doubt if UNION would be needed. The "snapshot" would not be involved.