Sql-server – SELECT every nth value by Date from large table

sql serversql-server-2012

Rewrote this question 2012-06-20 for clarity, check edits for older version

Background and Question

I collect data for a couple of sensors on some devices. This data I want to display for the user as curves in a graph, i.e. I want to plot the measured points and draw lines between them. The user is able to decide for what date range he wants to show values for by zooming and panning in the graph.

The number of measured values begins to add up and at this moment I have a number of values in the range of 500'000'000, and counting. For a single sensor I can at most find about 2'000'000 values as of now, but It will most likely increase. These values are not logged at an even interval (i.e. every second) but instead when a change can be measured (see the difference in the date column below).

Date  SensorValue
----- -----------
10    123
30    118
70    114
85    115
90    116
95    117

Since the display area for this graph is 1000px there's no point fetching all those values. Instead I try to fetch at most 1000 values and keep them evenly distributed inside the date range the user has zoomed in upon. Exactly which values inside the date range that is returned doesn't matter as long as they are evenly distributed in terms of the date (this is important to remember since they aren't logged at an even interval).

I store these values in a quite simple table as can be seen below, with only one (clustered) index. Notice that the Date is stored as an integer ("unix time") not as a datetime2 field. The table is partitioned by Date if that matters.

CREATE TABLE [dbo].[SensorValues](
  [DeviceId] [int] NOT NULL,
  [SensorId] [int] NOT NULL,
  [SensorValue] [int] NOT NULL,
  [Date] [int] NOT NULL,
CONSTRAINT [PK_SensorValues] PRIMARY KEY CLUSTERED 
(
  [DeviceId] ASC,
  [SensorId] ASC,
  [Date] DESC
);

I'm searching for one of two things, either an improved query that will return these values faster, or a different approach on how to extract those values. If anything is unclear, keep asking me and I will try to answer.

My current approach

To achieve this I calculate the difference in time for the date range I'm fetching data for and divides this by the number of values I want to fetch at most. I store this in the variable I call @resolution below.

When I query the database it calculates and groups the matching rows by the date divided by @resolution and picks out one value per group. I do know that this is not a fool-proof algorithm. In some cases it returns more values than requested, but it doesn't matter that much for me since it at least keeps the number of values down to a manageable level.

For a sensor with small amount of data (tens of thousand of values), this is fairly quick but when it comes to a sensor with much data (millions of rows) it slows down and takes a couple of seconds (at least) to query the database. How should one go about querying for this information in a better way?

SELECT sv.SensorValue, sv.Date
  FROM SensorValues sv
  INNER JOIN (SELECT MAX(svInner.Date) AS svInner.Date
                FROM SensorValues svInner
                WHERE svInner.SensorId = @sensorid
                  AND svInner.DeviceId = @deviceid
                  AND svInner.Date BETWEEN @startdate AND @stopdate
                  AND svInner.Sensorvalue != -32767
                  AND svInner.SensorValue != -32768
                  GROUP BY svInner.Date / @resolution) j ON sv.Date = j.Date
  WHERE sv.SensorId = @sensorid
    AND sv.DeviceId = @deviceid
    AND sv.SensorValue != -32767
    AND sv.SensorValue != -32768
  ORDER BY sv.Date DESC

Query Plan for above query

Example: If I want to select 3 values between time 0 and 95 from the table example above I do the following.

I calculate the @resolution to (95-0)/3=31 and then for each row between time 0 and 95, calculates Date/@resolution (integer division everywhere):

Date  SensorValue Date/@resolution
----- ----------- ----------------
10    123         0
30    118         0
70    114         2
85    115         2
90    116         2
95    117         3

I then select one value from each group by using MAX(Date).

Date  SensorValue Date/@resolution
----- ----------- ----------------
30    118         0
90    116         2
95    117         3

Best Answer

Technically to display n-th value you use the ranking functions like RANK, DENSE_RANK or ROW_NUMBER. Which one exactly depends, but what you describe matches best ROW_NUMBER:

with cte as (
 select row_number() over (order by Date) as rn, *
 from table)
where rn % 1000 = 0;

But you are saying that you want o do this for performance reasons, avoid going through millions of rows. Such a query had already done the damage, it read from disk all the rows and had to sort them so the performance price was already payed.

A better alternative to selecting a random sample of data is to use the TABLESAMPLE clause, see Limiting Result Sets by Using TABLESAMPLE:

select * 
from table tablesample (100 rows);

TABLESAMPLE will be way more efficient as it actually avoids reading all the data, it only samples some of the pages in the table and returns all rows in the sampled pages.

But consider that if you use TABLESAMPLE your WHERE clause are applied after the sampling. So it may be that the sample does not contain any row for the device/sensor you're interested in. This is specially true for sensor with small data.