Sql-server – Group By hour over large dataset

querysql server

Using MS SQL 2008 I am selecting an averaged field from 2.5 million records. Each record represents one second. MyField is an hourly average of those 1 second records. Of course the server CPU hits 100% and the selection takes too long. I need to possibly save those averaged values so that SQL does not have to select all those records on each request. What can be done?

  SELECT DISTINCT
         CONVERT(VARCHAR, [timestamp], 1)+' '+ CAST(DATEPART(Hh,[timestamp]) as VARCHAR) AS TimeStampHour,
         MIN([timestamp]) as TimeStamp,
         AVG(MyField) As AvgField
    FROM MyData
   WHERE TimeStamp > '4/10/2011'
GROUP BY CONVERT(VARCHAR, [timestamp], 1)+' '+ CAST(DATEPART(Hh,[timestamp]) as VARCHAR)
ORDER BY TimeStamp

Best Answer

The part of the query is maxing out the CPU for long periods is the the functions in the GROUP BY clause and the fact that the grouping is always going to require an unindexed sort in this instance. While an index on the timestamp field will help the initial filter this operation has to be performed on every row that the filter matches. Speeding up this are using a more efficient route to do the same job as suggested by Alex will help, but you still have a huge inefficiency there because what-ever function combination you use the query planner is not going to be able to come up with something that will be helped by any index so it will have to run through every row first running the functions to compute the grouping values, only then can it order the data and compute the aggregates over the resulting groupings.

So the solution is to somehow make the process group by something it can use an index for, or otherwise remove the need to consider all the matching rows at once.

You could maintain an extra column for each row containing the time rounded to the hour, and index this column for use in such queries. This is denormalising your data so might feel "dirty" but it would work and would be cleaner than caching all the aggregates for future use (and updating that cache as the base data is altered). The extra column should be maintained by trigger or be a persisted computed column, rather than maintained by logic elsewhere, as this will guarantee all current and future places that might insert data or update the timestamp columns or existing rows result in consistent data in the new column. You can still get the MIN(timestamp) out. What the query will result in this way is still a walk down all the rows (this can not be avoided, obviously) but it can do it index order, outputting a row for each grouping as it gets to the next value in the index rather than having to remember the whole set of rows for an unindexed sort operation before the grouping/aggregation can be performed. It will use a lot less memory too, as it won't need to remember any rows from prior grouping values in order to process the one it is looking at now or the rest of them.

That method removes the need find somewhere in memory for the whole result set and do the unindexed sort for the group operation and removes the computation of the group values out of the big query (moving that job out to the individual INSERTs/UPDATEs that produce the data) and should allow such queries to run acceptably without needing to maintain a separate store of the aggregated results.

A method that doesn't denormalise your data, but does still require extra structure, is to use a "time table", in this case one containing one row per hour for all the time you are likely to consider. This table would not consume a significant amount of space in a DB or appreciable size - to cover a timespan of 100 years a table containing one row of two dates (the start and end of the hour, such as '2011-01-01@00:00:00.0000','2011-01-01@00:00:59.9997', the "9997" being the smallest number of milliseconds a DATETIME field won't round up to the next second) which are both part of the clustered primary key will take ~14Mbyte of space (8+8 bytes per row * 24 hours/day * 365.25 days/year * 100, plus a bit for the overhead of the clustered index's tree structure but that overhead won't be significant). With such a table you could do something like the following:

SELECT CONVERT(VARCHAR, [timestamp], 1)+' '+ CAST(DATEPART(Hh,[timestamp]) as VARCHAR) AS TimeStampHour
     , MIN([timestamp]) as TimeStamp
     , AVG(MyField) As AvgField
FROM TimeRangeByHours tt
INNER JOIN MyData md ON md.TimeStamp BETWEEN tt.StartTime AND tt.EndTime
WHERE tt.StartTime > '4/10/2011'
GROUP BY tt.StartTime
ORDER BY tt.StartTime

This means that the query planner can arrange for the index on MyData.TimeStamp to be used. The query planner should be bright enough to work out that it can walk down the tame table in step with the index on MyData.TimeStamp, again outputting one row per grouping and discarding each set or rows as it hits the next grouping value. No storing all the intermediate rows somewhere in RAM then performing an unindexed sort on them. Of course this method requires that you create the time table and make sure it spans far enough both backwards and forwards, but you can use the time table for queries against many date fields in different queries, where as the "extra column" option would require an extra computed column for each date field you needed to filter/group by this way, and the small size of the table (unless you need it to span 10,000 years of course, in which case it'll be around 1.4Gb) will mean it it likely to always be in RAM.

The time table method has an extra difference (that could be quite advantageous) compared to you current situation and the computed column solution: it can return rows for periods for which there is no data, simply by changing the INNER JOIN in the example query above to be a LEFT OUTER one.

Some people suggest not having a physical time table, but instead always returning it from a table returning function. This means the content of the time table is never stored on (or needs to be read from) disk and if the function is well written you never have to worry about how long the time table needs to span back and forth in time, but I doubt the CPU cost of producing an in-memory table for some rows every query is worth the small saving of hassle of creating (and maintaining, should its timespan need to extend beyond your initial version's limit) the physical time table.

A side note: you don't need that DISTINCT clause on your original query either. The grouping will ensure that these queries only return one row per period under consideration so the DISTINCT will do nothing other than spin the CPU a little more (unless the query planner notices that the distinct would be a no-op in which case it'll ignore it and use no extra CPU time).