I am trying to optimize a table layout for a table that logs events.
The log table contains three relevant columns: Timestamp, ItemId, LocationId
Each row means that at a given time
, a certain item
has been seen at a certain location
.
2017-01-01 10:00 Item A has been seen at location 1
2017-01-01 10:01 Item A has been seen at location 1
2017-01-01 11:00 Item B has been seen at location 1
2017-01-01 11:01 Item B has been seen at location 2
2017-01-01 11:02 Item A has been seen at location 2
2017-01-01 11:03 Item B has been seen at location 1
There are about 100 different locations, 20.000 new items per day, a million events per day, and 14 days of logs.
Now I need to run queries on this data, such as:
- Which items were at location '1' at time '2017-01-01 11:00'
(= which items have been seen at location 1 before time '2017-01-01 11:00', and were not seen elsewhere after being seen at 1 but before '2017-01-01 11:00'
To get this data I can execute
SELECT DISTINCT ItemId
FROM events e1
WHERE LocationId = 1
AND e1.TimeStamp < '2017-01-01 11:00'
AND NOT EXISTS (SELECT 1 FROM events e2
WHERE e2.LocationId <> e1.LocationId
AND e2.ItemId = e1.ItemId
AND e2.TimeStamp >= e1.TimeStamp
AND e2.TimeStamp <'2017-01-01 11:00')
Currently, this query takes about 15 seconds, when there is zero load on the database. The goal is to execute this query in less that 100ms, with heavy load. I don't think this is possible with the current design.
I have an index on item and location, and a clustered index on timestamp
Is there a table layout that would allow me to perform this query more efficiently?
Or is there a query that would work with the existing table?
Best Answer
You could try a different query:
No promises that this will do any better (it could actually be worse); it's just a different approach.
Also - if it makes sense, you may want to put a lower bound on the possible
TimeStamp
values; if you can ignore everything more than 12 hours prior to the time you're looking for, that could eliminate a large number of rows.