Sql-server – How to optimize this query over this logging table

database-designdenormalizationperformancesql serversql-server-2012

I am trying to optimize a table layout for a table that logs events.

The log table contains three relevant columns: Timestamp, ItemId, LocationId
Each row means that at a given time, a certain item has been seen at a certain location.

2017-01-01 10:00    Item A has been seen at location 1
2017-01-01 10:01    Item A has been seen at location 1
2017-01-01 11:00    Item B has been seen at location 1
2017-01-01 11:01    Item B has been seen at location 2
2017-01-01 11:02    Item A has been seen at location 2
2017-01-01 11:03    Item B has been seen at location 1

There are about 100 different locations, 20.000 new items per day, a million events per day, and 14 days of logs.

Now I need to run queries on this data, such as:

  • Which items were at location '1' at time '2017-01-01 11:00'
    (= which items have been seen at location 1 before time '2017-01-01 11:00', and were not seen elsewhere after being seen at 1 but before '2017-01-01 11:00'

To get this data I can execute

SELECT DISTINCT  ItemId     
FROM events e1 
WHERE LocationId = 1
  AND e1.TimeStamp < '2017-01-01 11:00'
  AND NOT EXISTS (SELECT 1 FROM events e2
                  WHERE e2.LocationId <> e1.LocationId
                    AND e2.ItemId = e1.ItemId
                    AND e2.TimeStamp >= e1.TimeStamp
                    AND e2.TimeStamp <'2017-01-01 11:00')

enter image description here

Currently, this query takes about 15 seconds, when there is zero load on the database. The goal is to execute this query in less that 100ms, with heavy load. I don't think this is possible with the current design.

I have an index on item and location, and a clustered index on timestamp

Is there a table layout that would allow me to perform this query more efficiently?

Or is there a query that would work with the existing table?

Best Answer

You could try a different query:

SELECT ItemID
  FROM (SELECT ItemID
              ,ROW_NUMBER() OVER (PARTITION BY ItemID ORDER BY TimeStamp DESC) rn
              ,LocationId
          FROM events
         WHERE TimeStamp < '2017-01-01 11:00'
       ) e1
  WHERE LocationId = 1
    AND rn = 1
;

No promises that this will do any better (it could actually be worse); it's just a different approach.

Also - if it makes sense, you may want to put a lower bound on the possible TimeStamp values; if you can ignore everything more than 12 hours prior to the time you're looking for, that could eliminate a large number of rows.