Mysql – Data Warehouse: How to i query daily snapshots

database-designMySQL

I have some snapshots of a database that are not timeseries. For example:

Snapshot day 1:

+----+---------------+------------+------------+        
| ID |     Title     |  Category  |    Date    |
+----+---------------+------------+------------+
| 1  | My First Post | helloworld | 2015-01-01 |
+----+---------------+------------+------------+

Snapshot day 2 (A new post is added today):

+----+----------------+------------+------------+        
| ID |      Title     |  Category  |    Date    |
+----+----------------+------------+------------+
| 1  | My first post  | helloworld | 2015-01-01 |
| 2  | My second post | other      | 2015-01-02 |
+----+----------------+------------+------------+

Snapshot day 3 (Post 2 is removed today):

+----+---------------+------------+------------+        
| ID |     Title     |  Category  |    Date    |
+----+---------------+------------+------------+
| 1  | My First Post | helloworld | 2015-01-01 |
+----+---------------+------------+------------+

So between days, a row of the table may or not may be constant. Now, I need to be able to use a query like this:

SELECT category, COUNT(*) from day1.My_table group by category

This is for one table of one day. If we want to count the daily average of posts by category in a month we should do something like:

SELECT category, SUM(cnt) / 30 
from ( 
    SELECT category, COUNT(*) as cnt 
    from day1.My_table 
    group by category 
        UNION ALL SELECT category, COUNT(*) as cnt 
                  from day2.My_table 
                  group by category 
        UNION ALL ... 
        UNION ALL SELECT category, COUNT(*) as cnt 
                  from day30.My_table 
                  group by category
) group by category

Another example, the number of post published in a month:

SELECT COUNT(distinct id) 
from ( 
    SELECT id 
    from day1.My_table 
    UNION ALL ... 
    UNION ALL SELECT id 
              from day30.My_table
)

Basically we would need to consider a weight. If we have day1.My_table and day5.My_table, every post that is in day1 and not in day5 will be counted as it was also in day 2,3,4. Every post that is day1 and day5 will count as if it is in every day of the month (= until the next snapshot).

So in case i would like to consider the average number of post per day of >=6 months a go, where I have just 1 snapshot, I would assign to that snapshot a weight of 30.

So, the average post published in a month for a range >= 6 months ago is:

SELECT category, SUM(cnt) / 30 
from ( 
    SELECT category, COUNT(*)*30 as cnt 
    from day1.My_table 
    group by category --- Note: I'm not considering the range defined from the user in this example.
) group by category;

As comment also stated, I would need to do a query like:

Select category, AVG(*) 
from [fromRange-toRange].MyTable;

For an extreme solution, I'm considering the idea of implementing a metalanguage to let the future user (ex. marketng people) to do a query like this.

Do you think there is a way to accomplish this in Drill without the meta-language? I would do this using a recursive UDF but they can't return queries.

Every snapshot is large 250GB, and I want to be able to compare these dataset with other external data (I don't know beforehand the scheme of these dataset).

Is there a solution suitable for Apache Drill? Or is there another solution for this problem?

Also any meta-language or paper about this problem is appreciated.

Edit:
We haven't transactional data. We have data that change in time, and can be added or removed; for this reason we need everyday snapshots. Also we don't know beforehand the queries that will be performed, so we can't know what type of aggregation to be done.
Also every row has about 100 columns, and there are say 250GB per snapshot (Mysql tables).
We also need full-text search on this data on every row, on every possible day.

An example of search could be "How many post were about sometopic?" So it has to search all posts for the sometopic keyword. Every snapshot may or not have the same rows. Also two snapshots could have the same post, but slightly modified.

Best Answer

Let's think out of the box. Instead of having a "snapshot", let's have a "log". What you currently have is "current" state of things; adding a "log" would provide the "history", from which could be derived the 'lost' info.

One way to implement the log is to have a TRIGGER on INSERT or UPDATE of the table, and have the trigger write to the log file. This log will not be pleasant for the ad hoc queries, so have a nightly job (or maybe hourly) that summarizes the changes for the day -- net gain (or loss) of number of posts, etc. The "day2" info and the "last month" info can then be derived from this summary table quite quickly. Or perhaps a second level of summarization that declares what the state was for each day. I doubt if UNION would be needed. The "snapshot" would not be involved.

Related Solutions

MySQL optimization – year column grouping – using temporary table, filesort

I don't see a lot of opportunity for improvement.

The index you added was probably a big help, because it's being used for the range matching on the WHERE clause (type => range, key => tran_date), and it's being used as a covering index (extra => using index), avoiding the need to seek into the table to fetch the row data.

But since you're using functions to construct the financial_year value for the group by, both the "using filesort" and "using temporary" can't be avoided. But, those aren't the real problem. The real problem is that you're evaluating MONTH(tran_date) 346,485 times and YEAR(tran_date) at least that many times... ~700,000 function calls in one second doesn't seem too bad.

Plan B: I am definitely not a fan of storing redundant data, and I'm dead-set against making the application responsible for maintaining it... but one option I might be tempted to try would be to create a dashboard_stats_by_financial_year table, and use after-insert/update/delete triggers on the transactions1 table to manage keeping those stats current.

That option has a cost, of course -- adding to the amount of time it takes to update/insert/delete a transaction... but, waiting > 1200 milliseconds for stats for your dashboard is a cost, too. So it may come down to whether you want to pay for it now or pay for it later.

Mysql – Get week/month data

If you are looking to return the average val for each week and every month that has a date in the table, then you can implement the following.

Sample data:

CREATE TABLE yourtable (`day` datetime, `val` int);

INSERT INTO yourtable (`day`, `val`)
VALUES
    ('2012-01-01 00:00:00', 465),
    ('2012-01-02 00:00:00', 896),
    ('2012-08-15 00:00:00', 678),
    ('2012-09-01 00:00:00', 324),
    ('2012-12-02 00:00:00', 74),
    ('2012-12-03 00:00:00', 65),
    ('2012-12-04 00:00:00', 35);

Weekly Average:

To get the average by week, I would GROUP BY both the week number and the year, using the MySQL functions week() and year(). Then if you have data from multiple years the average will be shown by year as well (otherwise your values could be skewed). Your query would be similar to this:

select week(day) Week,
  year(day) Year,
  avg(val) WeekAvg
from yourtable
group by week(day), year(day)

See SQL Fiddle with Demo. The result is:

| WEEK | YEAR | WEEKAVG |
-------------------------
|    1 | 2012 |   680.5 |
|   33 | 2012 |     678 |
|   35 | 2012 |     324 |
|   49 | 2012 |      58 |

Monthly Average:

Now to get the average val by month, you can use the MySQL functions monthname() and year().

select monthname(day) Month, 
  year(day) Year, 
  avg(val) AvgValue
from yourtable
group by monthname(day), year(day)

See SQL Fiddle with Demo. The result is:

|     MONTH | YEAR | AVGVALUE |
-------------------------------
|    August | 2012 |      678 |
|  December | 2012 |       58 |
|   January | 2012 |    680.5 |
| September | 2012 |      324 |

Combined Result:

You did not specify if you want this data in the same query. But if you do, then you can easily use a UNION ALL to consolidate the data into one result set. In the UNION ALL query, I added a column to identify the data as either month or week (but you can remove that if needed):

select monthname(day) Month, 
  year(day) Year, 
  avg(val) AvgValue,
  'MonthAvg' Source
from yourtable
group by monthname(day), year(day)
union all
select week(day) Week,
  year(day) Year,
  avg(val) WeekAvg,
  'WeekAvg' Source
from yourtable
group by week(day), year(day)

See SQL Fiddle with Demo The UNION ALL will produce a combined result set similar to this:

|     MONTH | YEAR | AVGVALUE |   SOURCE |
------------------------------------------
|    August | 2012 |      678 | MonthAvg |
|  December | 2012 |       58 | MonthAvg |
|   January | 2012 |    680.5 | MonthAvg |
| September | 2012 |      324 | MonthAvg |
|         1 | 2012 |    680.5 |  WeekAvg |
|        33 | 2012 |      678 |  WeekAvg |
|        35 | 2012 |      324 |  WeekAvg |
|        49 | 2012 |       58 |  WeekAvg |

Best Answer

Related Solutions

MySQL optimization – year column grouping – using temporary table, filesort

Mysql – Get week/month data

Related Question