Sql-server – What happens during “big” inserts on a table with a clustered composite key

sql server

My SQL knowledge is limited, so the terms I'll be using are most likely not the right ones.

I have a table that will store test results, for multiple locations.

The tests will be recorded in different databases at different locations (no network connection) and the "master" location will regularly "import" the test results from the other locations.

I plan to have a clustered composite primary key on the LocationId (int) and Date (datetime) columns, in that order. The reasoning is that it should keep all the results for a location together and I will almost never do a query by date range, but rather by date range and location.

The row size will be 80 to 100 bytes and the number of test results should not exceed a few millions. A typical "import" will insert 50 to 100 thousands results from another location.

What will happen during imports? Will SQL "move" the existing rows to maintain the clustering, or will it let the table become "fragmented"? Could this cause a big performance hit if the import is done one row at a time? Should I rather just not bother with the ordering of the rows and just add an identity column as the primary key and an index on the Date column to help with my queries?

Best Answer

Holy cow, you've got a lot of questions in here. Let's break this down.

Q: Will SQL "move" the existing rows to maintain the clustering, or will it let the table become "fragmented"?

Think of a database as a collection of pages - literal pieces of paper laid out on your desk. Think about the dictionary for now. If you wanted to add more words to the dictionary, you could add them in place if the pages had empty space.

When you first start out with an empty dictionary, this is relatively easy. But think about a mature dictionary with thousands of paper pages in it, all full.

When you want to add more words to that mature dictionary, odds are there isn't going to be any space left on the page. SQL Server will "tear" a page - it will take a brand new page somewhere else, and move some of the words over onto that new page. The new page would be at the end of the dictionary. The good news is that immediately after that action, there's now a half-empty page at the end of your dictionary, and also at the middle, both with space to add words.

If you happen to be adding them in that order, that is. (This is why the way you load data becomes increasingly important.)

Could this cause a big performance hit if the import is done one row at a time?

Forget the index for a second - adding data one row at a time is just plain inefficient regardless of the indexing structure. SQL Server is a set-based system - whenever you can work in sets, you probably should.

What happens when I query the data?

You didn't ask this, but I'm asking it for you, hahaha.

Think back about the aftermath of our inserts. Now we've got a dictionary that's mostly ordered, but when you get to a few points of the dictionary, you'll have to jump to the back to read from a few other pages. If these pages are all cached in your memory (RAM, buffer pool, etc) then the overhead just isn't going to be that large. Most memory access is random anyway - it's not like SQL Server stores your dictionary in memory in order.

On the other hand, if you need to fetch the data from conventional magnetic hard drives (spinning rust), then you can end up getting a bit of a performance benefit if that data is stored in order. The real design goal here, though, is to get the data from RAM instead of getting it from drives. The difference between defragmented data on disk versus fragmented data on disk is nowhere near as significant as the difference between getting it from disk versus getting it from RAM.

Should I rather just not bother with the ordering of the rows and just add an identity column as the primary key and an index on the Date column to help with my queries?

Bingo: this is the difference between physical database design and logical database design. Programmers have to worry a lot about physical database design initially, but as long as your database is under, say, 100GB in size, you can fix logical design in post, so to speak. Put an identity field on there for starters, cluster on it, and then after being live for a few months, revisit the index design to maximize performance.

Now, having said that, once you're experienced with this type of decisionmaking, then you'll be better equipped to guesstimate indexes right from the start. Even so, I don't even usually put much thought into index design initially. Users never seem to query the data the way I would have expected.