Sql-server – Table design for efficiently retrieving “most recent row” per FK’d item whose PK is a GUID

clustered-indexsql server

Database is SQLAzure

I've a recent requirement to store incremental updates in status against an Order item, whose primary key is a GUID (stored as an nvarchar(36) :/ ). An OrderStatusLog entry might happen at any time, it would have minimally the OrderId, ChangeDateTime, and the new Status.

There may be hundreds of status updates per order (but realistically, probably only tens) and they're all kept for audit reasons so that at any point the questions "what were the status changes over time?" and "what was the order status at time X?" can be answerd.

It will also be queried, potentially several hundred times (but again, most likely tens), what the current (i.e the most recent row) status is. Generally the chances that a recent Order will be queried are much higher than an old order, which is where I think the Order Id being a GUID might be a nuisance, because there isn't an innate order to them (nor can I change them to be sequential) so status change records end up scattered all over the storage regardless of when they were created

I'm not so interested to know the date the change occurred for the frequent question of "what is the most recent status?" and auto numbering doesn't have to start with 1 for every different order, so the Revision column could just be a single auto inc int and it could answer the "what is the most recent row for OrderId X?". For the infrequent question of "what were all the changes in order?" an autoinc answers it too, and for or "what was the status on date X?" the date is useful, but this will be seldom asked

Does the OrderId being a GUID present a problem no matter what? Is it still worth creating a table that has a clustering index of OrderId (guid) and e.g. Revision (an auto incrementing int) as its key columns (in that order) – is it the ideal arrangement for a historical log of status changes such as this? Or should I not be trying to influence storage order, and instead be saying "well, the likely number of change records will be small, and retrieving and sorting them all every time they're needed is likely fairly inconsequential" and just nonclustering the OrderId and including the date(or revision) and status?

Note; I can't put the current status on the Order table alas (not my restriction)

Edit: some extra info on the tables:

Order                        --stores orders
Id NVARCHAR(36) PRIMARY KEY, --guid stored as string (yes.. :/ )
... other fields

OrderStatusLog               --tracks all the changes an order went through. An INSERT-only (and SELECT) table
OrderId UNIQUEIDENTIFIER,    --references Oder(id) but not an FK officially because of the type mismatch
ChangeDate DATETIME,         --the time of the change in status
Status                       --the status as of time ChangeDate

I'll frequently want to (100s of times a day):

SELECT TOP 1 Status FROM OrderStatusLog WHERE OrderId = @id ORDER BY ChangeDate DESC

Infrequently (10s a day):

SELECT TOP 1 Status FROM OrderStatusLog WHERE OrderId = @id AND ChangeDate < @d ORDER BY ChangeDate DESC

Seldom (1 a day):

SELECT * FROM OrderStatusLog WHERE OrderId = @id ORDER BY ChangeDate DESC

99% of the time these queries would be operating on rows inserted within the past 24 hours.


I'm trying to decide how to index this log table given that it has a guid. My understanding of indexing in SQLS is that a clustered index influences the order of storage of rows on disk, and using GUIDs causes a bit of a nuisance with reoganizing because a GUID may naturally want to sit in the middle of existing data, meaning that data pages end up splitting. I'd like to know the implications of resource use of clustering vs nonclustering; if one indexing strategy means that inserts are more costly as (on average) they cause more data to be shuffled around disk but it makes selecting the data faster/lower cost, versus another strategy wher the insert might be cheap but the cost of selecting be much higher.

At this time the use case is just for those shown; no other uses are planned

This log table doesn't have, or seemingly need, a primary key and though infeasible that two updates would happen in the same millisecond (or whatever) as a more generic solution I'd like to know how to control for situations where updates could come in the same millisecond (i.e. I'm thinking an autoinc int column could help satisfy ORDER BY x DESC for "most recent" and then the "could be simultaneous" is moot)


As I've often advised others, I've looked at doing some benching of this. I created 3 tables of identical columns, made:

  • one a clustered on OrderId+ChangeDate+Status,
  • one a nonclustered unique on OrderId+ChangeDate include Status, and
  • one a nonclustered nonunique OrderId include ChangeDate+Status

I filled each table with the same set of 100,000 records, of approx 10,000 distinct guids and ~10 status changes over a random time frame. Table load performance was roughly the same – about 33 seconds on my local SQL Express from a C# app.

I ran 100,000 iterations of the frequent query against each table, hitting every Guid in each table 10 times with runtimes of 23 seconds (clustered), 22 seconds (nonclustered OI-CD include S) and 28 seconds (nonclustered OI include CD-S).

Index and table storage stats:

enter image description here

enter image description here

After swapping the clustered unique index on all table cols out for a clustered nonunique on OrderId+ChangeDate, and swapping the nonclustered unique on OI+CD out to be unique (and truncating all tables, backing up db, then re-loading it)

enter image description here

And plans:

enter image description here

enter image description here

enter image description here

Best Answer

Slightly different (but probably materially the same) suggestion from Akina would be to index on (OrderId, ChangeDateTime) INCLUDE (Status) since you're only selecting Status never filtering on it directly. This would service queries of the kind:

SELECT TOP 1 status
FROM sourcetable
WHERE orderid = @GUID
-- AND ChangeDateTime <= @ChangeDateTime
ORDER BY ChangeDateTime DESC

GUIDs are obviously not the most ideal column to index on because of their data size, and more so because of their randomness resulting in higher fragmentation, but if it's all you have to work with then it's better to index it than not. You'll still see improved query performance with the index on it.

There's probably going to be no difference by adding an INY IDENTITY column to use in your index vs using the existing DATETIME field, as they both take up the same storage anyway.

There's also not much inherent difference between CLUSTERED and NONCLUSTERED indexing in your case, but if you have the luxury, and your examples are the most common type of querying that will be done, then go with the CLUSTERED index as that will provide you greater flexibility in the future with minimizing performance hits should you end up needing to SELECT other fields.

The exact index recommendation in that case would then be CREATE UNIQUE CLUSTERED INDEX IX_IndexName ON TableName (OrderId, ChangeDateTime). You're able to leverage the UNIQUE clause here since your GUID guarantees it, which can slightly help performance of the index as well.