Ok, I am making a lot of assumptions (INT instead of VARCHAR(50) being one of them) with this answer, so feel free to correct me if needed. The problem with option B is that it introduces a new join to relate Users to Alerts without any real added benefit. If joining on the UserID, it is best to index the UserID, so you can utilize seeks for your joins.
For Option A, UserID will be the clustering key (index key for the clustered index) on the Users table. UserID will be a nonclustered index key on Alerts table. This will cost 16 bytes per Alert.
For Option B, UserID will be the clustering key on the Users table. UserId will probably be the clustering key in UserMap too, to make joining more efficient. UserKey (assuming this is an INT) would then be a nonclustered index key on the Alerts table. This will cost 4 bytes per Alert. And 20 bytes per UserMap.
Looking at the big picture, one relationship, for Option A, costs 16 bytes of storage, and involves 1 join operation. Whereas, one relationship, for Option B, costs 24 bytes of storage, and involves 2 join operations.
Furthermore, there are a possibility of 340,282,366,920,938,000,000,000,000,000,000,000,000 uniqueidentifiers and only 4,294,967,296 INTs. Implementing a uniqueidentifier to INT map for a this type of relationship could cause unexpected results when you start reusing INTs.
The only reason for creating this type map table, is if you plan on creating a Many to Many relationship between Users and Alerts.
Taking all of this into consideration, I would recommend Option A.
I hope this helps,
Matt
Holy cow, you've got a lot of questions in here. Let's break this down.
Q: Will SQL "move" the existing rows to maintain the clustering, or will it let the table become "fragmented"?
Think of a database as a collection of pages - literal pieces of paper laid out on your desk. Think about the dictionary for now. If you wanted to add more words to the dictionary, you could add them in place if the pages had empty space.
When you first start out with an empty dictionary, this is relatively easy. But think about a mature dictionary with thousands of paper pages in it, all full.
When you want to add more words to that mature dictionary, odds are there isn't going to be any space left on the page. SQL Server will "tear" a page - it will take a brand new page somewhere else, and move some of the words over onto that new page. The new page would be at the end of the dictionary. The good news is that immediately after that action, there's now a half-empty page at the end of your dictionary, and also at the middle, both with space to add words.
If you happen to be adding them in that order, that is. (This is why the way you load data becomes increasingly important.)
Could this cause a big performance hit if the import is done one row at a time?
Forget the index for a second - adding data one row at a time is just plain inefficient regardless of the indexing structure. SQL Server is a set-based system - whenever you can work in sets, you probably should.
What happens when I query the data?
You didn't ask this, but I'm asking it for you, hahaha.
Think back about the aftermath of our inserts. Now we've got a dictionary that's mostly ordered, but when you get to a few points of the dictionary, you'll have to jump to the back to read from a few other pages. If these pages are all cached in your memory (RAM, buffer pool, etc) then the overhead just isn't going to be that large. Most memory access is random anyway - it's not like SQL Server stores your dictionary in memory in order.
On the other hand, if you need to fetch the data from conventional magnetic hard drives (spinning rust), then you can end up getting a bit of a performance benefit if that data is stored in order. The real design goal here, though, is to get the data from RAM instead of getting it from drives. The difference between defragmented data on disk versus fragmented data on disk is nowhere near as significant as the difference between getting it from disk versus getting it from RAM.
Should I rather just not bother with the ordering of the rows and just add an identity column as the primary key and an index on the Date column to help with my queries?
Bingo: this is the difference between physical database design and logical database design. Programmers have to worry a lot about physical database design initially, but as long as your database is under, say, 100GB in size, you can fix logical design in post, so to speak. Put an identity field on there for starters, cluster on it, and then after being live for a few months, revisit the index design to maximize performance.
Now, having said that, once you're experienced with this type of decisionmaking, then you'll be better equipped to guesstimate indexes right from the start. Even so, I don't even usually put much thought into index design initially. Users never seem to query the data the way I would have expected.
Best Answer
I'm not sure why you need a recursive CTE. If you just need to split the data into two tables based on a date range you can do the following: