Indexing for an ascending date value

indexperformance

Let's say you have a table of invoices

CREATE TABLE Invoice
(  InvoiceID int IDENTITY(1,1) PRIMARY KEY,
   InvoiceDate DateTime CONSTRAINT [df_invoicedate] DEFAULT GETDATE(),
   ....
)

The invoice table has an InvoiceDate which is always the current date/time when the record is inserted.

Let's say you index this date

CREATE INDEX idx_InvoiceDate ON Invoice(InvoiceDate)

Note this is a non-clustered index (held in a separate file).

Let's say we know the RDBMS uses B+tree format for index files.

Bearing in mind that all inserts will hit the same page of the index file (because they are always at the end of the sort order), and that inserts into a single location are worst-case for B+tree insert, what special considerations should be observed?

Will this index end up lopsided? Should it be reindexed regularly? Should I use a low, normal, or high fill factor?

Should we try to find something else to put into the index prior to the date, e.g. an invoice type, to get the tree to balance out more evenly?

Will there be issues with hotspots or index page contention, because all inserts hits the same part of the index? What can I do to mitigate that?

I understand that for an ascending surrogate key, a reverse key index might be better. But this is a date and I may want to perform scans of a date range. Is there anything I can do that will alleviate any of the issues that reverse key indices are meant to address?

Or is all of this a total non-issue, and indices of this kind are fine and I should stop worrying about it?

Best Answer

Will this index end up lopsided?
No. The way the BTree algorithm works keeps it balanced. (One interpretation of the "B" in BTree is for "balanced".) As leaf pages fill they cause parent pages to split, with half the rows going to each new parent. Further rows cause further splits, which cascade up the tree until the root has to split, at which point a new root page is created. At every stage all paths from root to leaf have the same length and hence "balanced".

Should it be reindexed regularly?
Yes. Each DBMS has its own idiosynchracies but likely regular re-build of indexes will be recommended. If for no other reason, the statistics will get out of date eventually which will result in less efficient query plans.

Should I use a low, normal, or high fill factor?
As all writes will be on the right-most pages leaving free space anywhere else will be a waste. There will never be any writes to use up that free space. Use a fill factor of 100%.

Should we try to find something else to put into the index
Indexes aren't (generally speaking) created for entertainment purposes. They're there to improve query performance or support constraint enforcement. If your workload would benefit from a <something> + date index then create that. If not, don't. Indexes require extra work during writes and consume space and maintenance time. Create the ones you need and no more. And no less. Keep in mind filtered indexes and included columns. Bear in mind that optimiser math may mean your indexes aren't used anyway.

Will there be issues with hotspots or index page contention .. What can I do to mitigate that?
Well, yes, in theory. First test and prove you'll have a problem meeting your expectied transaction-per-second count. If you can't, are you sure it's this index that's the limiting factor? If it is, in-memory tables and snapshot isolation may implement concurrency mechanisms different to the "base" system's, depending on DBMS. There may be alternative storage engines whose characteristics differ. Try those to see if the pain receeds. You can always post a follow-up question here with quantatative observations!

I understand that for an ascending surrogate key..
Surrogate keys are made-up values that replace human-intelligible natural keys. They're used for reasons releated to DBMS implementation peculiarities and are not inherent to the relational model. Your date is not a surrogate key - it is business data.

a reverse key index might be better.
Hmmm .. OK, so instead of writing all today's rows to key value 20161021 you'd write them to .. 12016102? How does that help? Hashing won't help either: to use the hashed values for lookup each date would have to produce the same hash so there's still a hotspot.

Or is all of this a total non-issue, and indices of this kind are fine and I should stop worrying about it?
Probably. The vast majority of indexes on most systems are like this. By-and-large they work just fine and the DBMS is able to process significant workloads on modest hardware. Indexing is one of those things you can (and should) tweak ad nauseam after go-live. Concentrate on delivering a normalised, feature-rich, debugged system. Tune it afterwards.