Sql-server – Best way of aggregating, storing and using data in SQL Server (triggers, scheduled jobs, SSAS?)

sql serversql-server-2008-r2

We have a table with spatial data and some attributes and need to aggregate it so it could be used in various queries. We have the mechanism to do the aggregation on-the-fly via CLR stored procedures but I was wondering what would be the best way of caching this aggregated data in SQL Server 2008 R2 as the aggregation is very slow.

So we have table like this:

[TableId] [int] NOT NULL,
[FeatureId] [int] NOT NULL,
[MeasureFrom] [float] NOT NULL,
[MeasureTo] [float] NOT NULL,
[Value] [smallint] NOT NULL,
[Timestamp] [datetime2](7) NOT NULL

FeatureId, MeasureFrom and MeasureTo basically define an interval and Rating is the value of this interval. Note that FeatureId actually references a real lines (shapes) from a different table. We have about 250k rows in this table currently with overlapping intervals and various timestamps (historical records). Our aggregation process tries to find the latest Value across all intervals so it needs to cut and join intervals based on timestamp and keeps the related value.

As the on-the-fly querying is slow and takes a lot of resources, we thought we might need to create another database storing the results of this aggregation. The source table is constantly updated (but not too often, once in a couple of days).

What would be the best way of creating this database and keeping the aggregated values up-to-date? I can think of using triggers, or scheduled task. I have no experience with SSAS but would that be more suitable?

Just to note, we have a few similar tables storing slightly different values (some with more that one Value column) and hence the reason for a separate database rather than another table in the original database.

How quick would be the querying of this separate database from the original database using cross-database joins?

Edited:
To demonstrate what our "aggregation" does, here is some sample data:

FeatureId | MeasureFrom | MeasureTo | Value | Timestamp
1         | 1           | 20        | 2     | 2015-01-01
1         | 5           | 15        | 3     | 2015-01-02
1         | 9           | 10        | 8     | 2015-01-03

And the result we are getting:

FeatureId | MeasureFrom | MeasureTo | Value | Timestamp
1         | 1           | 5         | 2     | 2015-01-01
1         | 5           | 9         | 3     | 2015-01-02
1         | 9           | 10        | 8     | 2015-01-03
1         | 10          | 15        | 3     | 2015-01-02
1         | 15          | 20        | 2     | 2015-01-01

As you can see, the result actually contains more rows than the original data. Basically we need the latest Value at any point within the area. The latest value is determined based on value in the Timestamp column. When there is a gap within intervals, this should propagate to the resulting dataset.

The code is actually written in .NET and it is quite complicated. The algorithm is as follows: Get all data for a feature of interest, order the records by timestamp descending, process all records in a loop. Take the interval (MeasureFrom, MeasureTo), intersect it with the result that you have, if you get some part that you don't have already, add it to the result. Continue with the next record. Data represents surveys done regularly so further you go in time, the more data will be thrown away as you will already have newer Value for the same Interval.

Surveys are coming in with random Timestamps so it is not always the latest one that comes in. Also same process is then repeated for other features as you might be interested in getting values for a couple of features at once. And this is the result people would query often.

Ideally we would like to keep the aggregation result as close to being real-time as possible.

To explain the required algorithm a bit more, see the link posted by @dnoeth (in the comment below) http://sqlmag.com/sql-server/packing-intervals-priorities . It solves similar problem with the difference of using priorities instead of Timestamps and the solution described uses features from SQL Server 2012.

Best Answer

The aggregate table can probably stay in the same DB. You can however create it on a separate filegroup and disk.

Full update of a new aggregate table

One way of doing it using pure SQL and your sample:

WITH list ([Type], [FeatureId], [Measure], [Value], [Timestamp], [ID]) as (
    SELECT *
        , ROW_NUMBER() OVER(PARTITION BY [FeatureId] ORDER BY [Measure]) 
    FROM (
        SELECT 0, [FeatureId], [MeasureFrom], [Value], [Timestamp] FROM data
        UNION ALL
        SELECT 1, [FeatureId], [MeasureTo], [Value], [Timestamp] FROM data
    ) l([Type], [FeatureId], [Measure], [Value], [Timestamp])
)
SELECT l1.FeatureId, [MeasureFrom] = l1.Measure, [MeasureTo] = l2.Measure
    , [Value] = CASE WHEN l1.Type = 0 THEN l1.Value ELSE l2.Value END
    , [Timestamp] = CASE WHEN l1.Type = 0 THEN l1.Timestamp ELSE l2.Timestamp END
FROM list l1
INNER JOIN list l2 ON l1.FeatureId = l2.FeatureId AND l1.ID+1 = l2.ID
;

With this really small sample, I am not sure it covers all your needs. It may help to add a bigger sample with more data.

It does 4 table scans. Since you have 250k rows, it may not perform so well. It will probablibly better to run it in batch of X consecutive FeatureIds.

This could be done using a job and a SSIS package with either a single update or batch update. You would have to truncate the aggregate table first.

Trigger on each new row in main table

For new rows, using a trigger, this query could be used:

WITH new([FeatureId], [MeasureFrom], [MeasureTo], [Value], [Timestamp]) as (
--    SELECT 1, 1, 20, 2, '2015-01-01'
--  SELECT 1, 5, 15, 3, '2015-01-02'
    SELECT 1, 9, 10, 8, '2015-01-03'
), gap([FeatureId], mn, mx) as (
    SELECT n.[FeatureId], mn.mn, mx.mx
    FROM new n
    CROSS APPLY (SELECT mn = MAX([MeasureFrom]) FROM data3 WHERE [FeatureId] = n.[FeatureId] AND [MeasureFrom] < n.MeasureFrom) mn
    CROSS APPLY (SELECT mx = MIN([MeasureTo]) FROM data3 WHERE [FeatureId] = n.[FeatureId] AND [MeasureTo] > n.MeasureTo) mx
), list (x,[FeatureId], [Measure], [Value], [Timestamp], [ID]) as (
    SELECT *
        , ROW_NUMBER() OVER(PARTITION BY [FeatureId] ORDER BY [Measure]) 
    FROM (
        SELECT 0, d.[FeatureId], d.[MeasureFrom], d.[Value], d.[Timestamp] 
        FROM data3 d
        INNER JOIN gap g ON d.[FeatureId] = g.[FeatureId] AND (d.MeasureFrom >= g.mn AND d.MeasureFrom < g.mx)
        UNION ALL
        SELECT 1, d.[FeatureId], d.[MeasureTo], d.[Value], d.[Timestamp] 
        FROM data3 d
        INNER JOIN gap g ON d.[FeatureId] = g.[FeatureId] AND d.MeasureTo = g.mx
        UNION ALL
        SELECT 2, [FeatureId], [MeasureFrom], [Value], [Timestamp] FROM new
        UNION ALL
        SELECT 3, [FeatureId], [MeasureTo], [Value], [Timestamp] FROM new
    ) l(x, [FeatureId], [Measure], [Value], [Timestamp])
) 
MERGE data3 AS target
USING (
    SELECT l1.FeatureId, [MeasureFrom] = l1.Measure, [MeasureTo] = l2.Measure
        , [Value] = COALESCE(d.[Value], l1.[Value])
        , [Timestamp] = COALESCE(d.[Timestamp], l1.[Timestamp])
    FROM list l1
    INNER JOIN list l2 ON l1.FeatureId = l2.FeatureId AND l1.ID+1 = l2.ID
    LEFT JOIN data3 d ON d.MeasureFrom = l1.Measure OR d.MeasureTo = l2.Measure
) as source(FeatureId, [MeasureFrom], [MeasureTo], [Value], [Timestamp])
ON (target.FeatureId = source.FeatureId AND target.[MeasureFrom] = source.[MeasureFrom])
WHEN MATCHED THEN
    UPDATE SET target.[MeasureTo] = source.[MeasureTo]
WHEN NOT MATCHED BY target THEN
    INSERT (FeatureId, [MeasureFrom], [MeasureTo], [Value], [Timestamp])
    VALUES (source.FeatureId, source.[MeasureFrom], source.[MeasureTo], source.[Value], source.[Timestamp])
;

You must replace the new CTE by the values of the new inserted row in the trigger and use it to merge with the aggregate table.

Scheduled update of newly added rows

If you can find a way to get a list of all newly added rows to the main table since the last update of the aggregate table, you could schedule a job every x minutes or hours and only update what is necessary based on what has been recently added.

The trigger query will work as well with scheduled updates.

This can be run as a schedueled job and within a SSIS package.