Sql-server – What happens during “big” inserts on a table with a clustered composite key

sql server

My SQL knowledge is limited, so the terms I'll be using are most likely not the right ones.

I have a table that will store test results, for multiple locations.

The tests will be recorded in different databases at different locations (no network connection) and the "master" location will regularly "import" the test results from the other locations.

I plan to have a clustered composite primary key on the LocationId (int) and Date (datetime) columns, in that order. The reasoning is that it should keep all the results for a location together and I will almost never do a query by date range, but rather by date range and location.

The row size will be 80 to 100 bytes and the number of test results should not exceed a few millions. A typical "import" will insert 50 to 100 thousands results from another location.

What will happen during imports? Will SQL "move" the existing rows to maintain the clustering, or will it let the table become "fragmented"? Could this cause a big performance hit if the import is done one row at a time? Should I rather just not bother with the ordering of the rows and just add an identity column as the primary key and an index on the Date column to help with my queries?

Best Answer

Holy cow, you've got a lot of questions in here. Let's break this down.

Q: Will SQL "move" the existing rows to maintain the clustering, or will it let the table become "fragmented"?

Think of a database as a collection of pages - literal pieces of paper laid out on your desk. Think about the dictionary for now. If you wanted to add more words to the dictionary, you could add them in place if the pages had empty space.

When you first start out with an empty dictionary, this is relatively easy. But think about a mature dictionary with thousands of paper pages in it, all full.

When you want to add more words to that mature dictionary, odds are there isn't going to be any space left on the page. SQL Server will "tear" a page - it will take a brand new page somewhere else, and move some of the words over onto that new page. The new page would be at the end of the dictionary. The good news is that immediately after that action, there's now a half-empty page at the end of your dictionary, and also at the middle, both with space to add words.

If you happen to be adding them in that order, that is. (This is why the way you load data becomes increasingly important.)

Could this cause a big performance hit if the import is done one row at a time?

Forget the index for a second - adding data one row at a time is just plain inefficient regardless of the indexing structure. SQL Server is a set-based system - whenever you can work in sets, you probably should.

What happens when I query the data?

You didn't ask this, but I'm asking it for you, hahaha.

Think back about the aftermath of our inserts. Now we've got a dictionary that's mostly ordered, but when you get to a few points of the dictionary, you'll have to jump to the back to read from a few other pages. If these pages are all cached in your memory (RAM, buffer pool, etc) then the overhead just isn't going to be that large. Most memory access is random anyway - it's not like SQL Server stores your dictionary in memory in order.

On the other hand, if you need to fetch the data from conventional magnetic hard drives (spinning rust), then you can end up getting a bit of a performance benefit if that data is stored in order. The real design goal here, though, is to get the data from RAM instead of getting it from drives. The difference between defragmented data on disk versus fragmented data on disk is nowhere near as significant as the difference between getting it from disk versus getting it from RAM.

Should I rather just not bother with the ordering of the rows and just add an identity column as the primary key and an index on the Date column to help with my queries?

Bingo: this is the difference between physical database design and logical database design. Programmers have to worry a lot about physical database design initially, but as long as your database is under, say, 100GB in size, you can fix logical design in post, so to speak. Put an identity field on there for starters, cluster on it, and then after being live for a few months, revisit the index design to maximize performance.

Now, having said that, once you're experienced with this type of decisionmaking, then you'll be better equipped to guesstimate indexes right from the start. Even so, I don't even usually put much thought into index design initially. Users never seem to query the data the way I would have expected.

Related Solutions

Sql-server – Identifying rows which don’t match a master row

You can also do this with dynamic SQL without having to manually build out all the column names.

DECLARE @sql NVARCHAR(MAX), @c1 NVARCHAR(MAX), @c2 NVARCHAR(MAX);

SELECT @c1 = N'', @c2 = N'';

SELECT 
  @c1 = @c1 + ',' + QUOTENAME(name),
  @c2 = @c2 + ' AND m.' + QUOTENAME(name) + ' = s.' + QUOTENAME(name)
 FROM sys.columns
 WHERE name <> 'LocationID'
 AND [object_id] = OBJECT_ID('dbo.table1');

SET @sql = ';WITH s AS (
       SELECT ' + STUFF(@c1, 1, 1, '') + ' FROM dbo.table1
       EXCEPT 
       SELECT ' + STUFF(@c1, 1, 1, '') + ' FROM dbo.table1_master
     ) 
     SELECT m.LocationID
 FROM s INNER JOIN dbo.table1 AS m ON 1 = 1
 ' + @c2;

SELECT @sql;
--EXEC sp_executesql @sql;

You can take the output of this query as is and store the query somewhere, or you can comment out the SELECT and uncomment the EXEC and leave it as permanent dynamic SQL - in this case it will automatically adapt to column changes in the two tables.

Another idea (assuming LocationID is unique) - and it occurred to me you may want to include the master row so you can quickly spot the columns that are different:

  ;WITH c AS 
  (
    SELECT t.LocationID, m.setting1, m.setting2, ...
      FROM dbo.table1 AS t CROSS JOIN dbo.table1_master AS m
  )
  SELECT DISTINCT src = '> master', setting1, setting2, ...
    FROM c
  UNION ALL
  (
    SELECT RTRIM(LocationID), setting1, setting2, ...
      FROM dbo.table1
    EXCEPT
    SELECT RTRIM(LocationID), setting1, setting2, ...
      FROM c
  )
  ORDER BY src;

This version is a little cheaper (mostly by avoiding the DISTINCT against the master table, at the cost of needing to specify all of the columns one more time - which again you can automate as per above):

  ;WITH m AS 
  (
    SELECT setting1, setting2, ... 
      FROM dbo.table1_master
  ),
  c AS 
  (
    SELECT src = RTRIM(t.LocationID), m.setting1, m.setting2, ...
      FROM dbo.table1 AS t CROSS JOIN m
  )
  SELECT src = '> master', setting1, setting2, ...
    FROM m
  UNION ALL
  (
    SELECT RTRIM(LocationID), setting1, setting2, ...
      FROM dbo.table1
    EXCEPT
    SELECT src, setting1, setting2, ...
      FROM c
  )
  ORDER BY src;

However all of these options are poorer performers with worse plans than Rachel's simple LEFT JOIN. I tried to stick to the theme of using EXCEPT even though it is more about syntax than performance.

The key takeaway is that if the column count is too high to deal with manually, you can use the dynamic SQL approach above to construct whatever query you want to use - and you can do that one time and store the result, or have the code generated every time. To generate Rachel's query using dynamic SQL, not much needs to change:

DECLARE @sql NVARCHAR(MAX), @and NVARCHAR(MAX), @anycol NVARCHAR(128);
SELECT @sql = N'', @and = N'';

SELECT @and = @and + ' AND t.' + QUOTENAME(name) + ' = m.' + QUOTENAME(name)
  FROM sys.columns
  WHERE [object_id] = OBJECT_ID('dbo.table1_master');

SELECT TOP (1) @anycol = QUOTENAME(name)
  FROM sys.columns
  WHERE [object_id] = OBJECT_ID('dbo.table1_master')
  ORDER BY name;

SET @sql = 'SELECT locationID
FROM dbo.table1 AS t
LEFT OUTER JOIN dbo.table1_master AS m ON 1 = 1' 
  + @and + ' WHERE m.' + @anycol + ' IS NULL;';

SELECT @sql;
--EXEC sp_executesql @sql;

Sql-server – Relationship that are optionally more specific

Instead of:

ALTER TABLE Donations ADD FOREIGN KEY(EventID) REFERENCES Events(ID)
ALTER TABLE Donations ADD FOREIGN KEY(LocationID) REFERENCES Locations(ID)

I would do this:

ALTER TABLE Donations ADD FOREIGN KEY(EventID) REFERENCES Events(ID)
ALTER TABLE Donations ADD FOREIGN KEY(EventID, LocationID) REFERENCES Locations(EventID, ID)

That would guarantee that "if the LocationID is specified, then that LocationID should be a Location that belongs to the specified EventID".

That done, your approach makes sense.

Best Answer

Related Solutions

Sql-server – Identifying rows which don’t match a master row

Sql-server – Relationship that are optionally more specific

Related Question