Sql-server – Table design for efficiently retrieving “most recent row” per FK’d item whose PK is a GUID

clustered-indexsql server

Database is SQLAzure

I've a recent requirement to store incremental updates in status against an Order item, whose primary key is a GUID (stored as an nvarchar(36) :/ ). An OrderStatusLog entry might happen at any time, it would have minimally the OrderId, ChangeDateTime, and the new Status.

There may be hundreds of status updates per order (but realistically, probably only tens) and they're all kept for audit reasons so that at any point the questions "what were the status changes over time?" and "what was the order status at time X?" can be answerd.

It will also be queried, potentially several hundred times (but again, most likely tens), what the current (i.e the most recent row) status is. Generally the chances that a recent Order will be queried are much higher than an old order, which is where I think the Order Id being a GUID might be a nuisance, because there isn't an innate order to them (nor can I change them to be sequential) so status change records end up scattered all over the storage regardless of when they were created

I'm not so interested to know the date the change occurred for the frequent question of "what is the most recent status?" and auto numbering doesn't have to start with 1 for every different order, so the Revision column could just be a single auto inc int and it could answer the "what is the most recent row for OrderId X?". For the infrequent question of "what were all the changes in order?" an autoinc answers it too, and for or "what was the status on date X?" the date is useful, but this will be seldom asked

Does the OrderId being a GUID present a problem no matter what? Is it still worth creating a table that has a clustering index of OrderId (guid) and e.g. Revision (an auto incrementing int) as its key columns (in that order) – is it the ideal arrangement for a historical log of status changes such as this? Or should I not be trying to influence storage order, and instead be saying "well, the likely number of change records will be small, and retrieving and sorting them all every time they're needed is likely fairly inconsequential" and just nonclustering the OrderId and including the date(or revision) and status?

Note; I can't put the current status on the Order table alas (not my restriction)

Edit: some extra info on the tables:

Order                        --stores orders
Id NVARCHAR(36) PRIMARY KEY, --guid stored as string (yes.. :/ )
... other fields

OrderStatusLog               --tracks all the changes an order went through. An INSERT-only (and SELECT) table
OrderId UNIQUEIDENTIFIER,    --references Oder(id) but not an FK officially because of the type mismatch
ChangeDate DATETIME,         --the time of the change in status
Status                       --the status as of time ChangeDate

I'll frequently want to (100s of times a day):

SELECT TOP 1 Status FROM OrderStatusLog WHERE OrderId = @id ORDER BY ChangeDate DESC

Infrequently (10s a day):

SELECT TOP 1 Status FROM OrderStatusLog WHERE OrderId = @id AND ChangeDate < @d ORDER BY ChangeDate DESC

Seldom (1 a day):

SELECT * FROM OrderStatusLog WHERE OrderId = @id ORDER BY ChangeDate DESC

99% of the time these queries would be operating on rows inserted within the past 24 hours.

I'm trying to decide how to index this log table given that it has a guid. My understanding of indexing in SQLS is that a clustered index influences the order of storage of rows on disk, and using GUIDs causes a bit of a nuisance with reoganizing because a GUID may naturally want to sit in the middle of existing data, meaning that data pages end up splitting. I'd like to know the implications of resource use of clustering vs nonclustering; if one indexing strategy means that inserts are more costly as (on average) they cause more data to be shuffled around disk but it makes selecting the data faster/lower cost, versus another strategy wher the insert might be cheap but the cost of selecting be much higher.

At this time the use case is just for those shown; no other uses are planned

This log table doesn't have, or seemingly need, a primary key and though infeasible that two updates would happen in the same millisecond (or whatever) as a more generic solution I'd like to know how to control for situations where updates could come in the same millisecond (i.e. I'm thinking an autoinc int column could help satisfy ORDER BY x DESC for "most recent" and then the "could be simultaneous" is moot)

As I've often advised others, I've looked at doing some benching of this. I created 3 tables of identical columns, made:

one a clustered on OrderId+ChangeDate+Status,
one a nonclustered unique on OrderId+ChangeDate include Status, and
one a nonclustered nonunique OrderId include ChangeDate+Status

I filled each table with the same set of 100,000 records, of approx 10,000 distinct guids and ~10 status changes over a random time frame. Table load performance was roughly the same – about 33 seconds on my local SQL Express from a C# app.

I ran 100,000 iterations of the frequent query against each table, hitting every Guid in each table 10 times with runtimes of 23 seconds (clustered), 22 seconds (nonclustered OI-CD include S) and 28 seconds (nonclustered OI include CD-S).

Index and table storage stats:

After swapping the clustered unique index on all table cols out for a clustered nonunique on OrderId+ChangeDate, and swapping the nonclustered unique on OI+CD out to be unique (and truncating all tables, backing up db, then re-loading it)

And plans:

Best Answer

Slightly different (but probably materially the same) suggestion from Akina would be to index on (OrderId, ChangeDateTime) INCLUDE (Status) since you're only selecting Status never filtering on it directly. This would service queries of the kind:

SELECT TOP 1 status
FROM sourcetable
WHERE orderid = @GUID
-- AND ChangeDateTime <= @ChangeDateTime
ORDER BY ChangeDateTime DESC

GUIDs are obviously not the most ideal column to index on because of their data size, and more so because of their randomness resulting in higher fragmentation, but if it's all you have to work with then it's better to index it than not. You'll still see improved query performance with the index on it.

There's probably going to be no difference by adding an INY IDENTITY column to use in your index vs using the existing DATETIME field, as they both take up the same storage anyway.

There's also not much inherent difference between CLUSTERED and NONCLUSTERED indexing in your case, but if you have the luxury, and your examples are the most common type of querying that will be done, then go with the CLUSTERED index as that will provide you greater flexibility in the future with minimizing performance hits should you end up needing to SELECT other fields.

The exact index recommendation in that case would then be CREATE UNIQUE CLUSTERED INDEX IX_IndexName ON TableName (OrderId, ChangeDateTime). You're able to leverage the UNIQUE clause here since your GUID guarantees it, which can slightly help performance of the index as well.

Related Solutions

Sql-server – Update and select concurrently running on same set of rows but different indexes are referred

What you are looking at is more of a blocking scenario. Transactions usually happen quickly and you would not notice a difference when an update and select are issued at the same time.

As per your scenario, the update will happen first. Then depending on your concurrency level, the select will be blocked till the end of the update transaction. In a good case scenario, this happens at the micro second level, and the end user does not notice it.

One way a deadlock can occur: the update transaction, while blocking the select transaction, should request for a resource that is held by the select transaction.

Sql-server – Storage order vs Result order

Let me try to explain why you should not do that, why you should never assume that an SQL-product will return a result set in a specific order, unless you specify so, whatever indices - clustered or non-clustered, B-trees or R-Trees or k-d-trees or fractal-trees or whatever other exotic indices a DBMS is using.

Your original query tells to the DBMS to search the SensorValues table, find rows that match the 3 conditions, order those rows by Date descending, keep only the first row from those and - finally - select and return only the SensorValue column.

SELECT TOP 1 SensorValue
  FROM SensorValues
  WHERE SensorId = 53
    AND DeviceId = 3819
    AND Date < 1339225010
  ORDER BY Date DESC ;

These are very specific orders you have given to the DBMS and the result will most probably be the same every time you run the query (there is a chance it might not, if you have more than one row that match the conditions and have the same max Date but different SensorValue but lets assume for the rest of the conversation that no such rows exist in your table).

Does the DBMS have to do this, to run this query, the exact way I describe it above? No, of course not and you know that. It may not read the table but read from an index. Or it may use two indexes if it thinks it's better (faster). Or three. Or it may use a cached result (not SQL Server but other DBMS cache query results). Or it may use parallel execution one time and not the next time it runs. Or ... (add any other feature that affects execution and execution plans).

What is guaranteed though is that it will return the exact same result, every time you run it - as long as no rows are inserted, deleted or updated.

Now lets see what your suggestion says:

SELECT TOP 1 SensorValue
  FROM SensorValues
  WHERE SensorId = 53
    AND DeviceId = 3819
    AND Date < 1339225010 ;

This query tells to the DBMS to search the SensorValues table, find rows that match the 3 conditions, ~~order those rows by Date descending,~~, don't care about the order, keep only one row and - finally - select and return only the SensorValue column.

So, it basically tells the same as the first one, except that it tells that you want one result only that matches the conditions and you don't care which one.

Now, can we assume that it will give always the same result because of the clustered index?
- If it does use this clustered index every time, yes.

But will it use it?
- No.

Why not?
- Beacuse it can. The query optimizer is free to choose a path of execution every time it runs a statement. Whatever path it sees fit at that time for that statement.

But isn't using the clustered index the best/fastest way to get results?
- No, not always. It might be the first time you run the query. The second time, it may use a cached result (if the DBMS has such a feature, not SQL Server^*). The 1000th time the result may have been removed from the cache and another result may exist there. Say, you had executed this query just before:

SELECT TOP 1 SensorValue
  FROM SensorValues
  WHERE SensorId = 53
    AND DeviceId = 3819
    AND Date < 1339225010
  ORDER BY Date ASC ;         --- Notice the `ASC` here

and the cached result (from the above query) is another, different one that still matches your conditions but is not the first in your (wanted) ordering. And you have told the DBMS not to care about the order.

OK, so only cache can affect this?
- No, many other things, too.

other indexes were considered, at that time by the DBMS as better for this query.
a developer changed or completely removed this clustered index you had.
you or some other developer added another index that the optimizer decided it's more efficient to use than the CI.
you updated to a new version and the new optimizer has a minor bug or a change in how it ranks and chooses execution plans.
statistics were updated.
parallel execution was chosen instead.

^{*: SQL Server does not cache query results but the Enterprise Edition does have an Advanced Scanning feature which is kind of similar in that you may get different results because of concurrent queries. Not sure exactly when this kicks in though. (thnx @Martin Smith for the tip.)}

I hope you are convinced that you should never rely that an SQL query will return results in a specific order, unless you specify so. And never use TOP (n) without ORDER BY, unless of course you just want n rows in the result and you don't care which ones are returned.

Best Answer

Related Solutions

Sql-server – Update and select concurrently running on same set of rows but different indexes are referred

Sql-server – Storage order vs Result order

Related Question