This isn't fragmentation.
Fragmentation is generated of course, but deletes will simply create "islands" of remaining pages, which is less evil then GUID/clustered key INSERT fragmentation.
If you're PK is an IDENTITY, then CreationDate
should roughly track this so you're actually deleting chunks of contiguous rows anyway.
- Do you have an index on
CreationDate
- Do you have delete cascades?
- Is the TOP 1000 in a single transaction?
For point 3, doing a loop inside a transaction is pointless: is this it?
At some point, a statistics update may be needed if you delete enough rows but I don't think it's that.
Other options:
- why not use TRUNCATE TABLE, wrapped in a stored procedure with EXECUTE AS OWNER
- use SYNONYMs for poor man's partitioning
The way this is designed you only have suboptimal choices. Random GUIDs are not well suited as clustered index keys, since they are neither small (which affects the size of all secondary indexes) nor sequential (unless you can use NEWSEQUENTIALID()
) which leads to index fragmentation, which leads to wasted space, slower insert performance and slower query performance through more I/O.
On the other hand, if your normalized tables are linked by such a GUID then each join depends on them and you will have to bite the bullet and use them as primary keys with clustered index anyway. Just create the PRIMARY KEY
constraint and the clustered index in separate steps so you can define PAD_INDEX = ON
and FILLFACTOR=50
to slow down the fragmentation somewhat. Still, expect to do regular, expensive index REBUILD
s to reduce the inevitable fragmentation.
Your secondary indexes must not start with the id, because that renders them useless! Imagine a telephone book, where each entry is given a random or running id, then the phone book is sorted by that id plus the name. Have fun searching a given name in that. A useable index must start with the column that is used in the where- or join clause.
So, with the clustered indexes created so far you cover queries of the type
SELECT p.productname, s.name as StoreName
FROM Products p
INNER JOIN Store s ON p.storeid = s.id
The query runs through the products, can efficiently look up the store ids and has immediate access to the store name, since the store id index is clustered.
Now you want to do this:
SELECT p.productname, s.name as StoreName
FROM Products p
INNER JOIN Store s ON p.storeid = s.id
WHERE p.productname LIKE 'A%'
For this you need a nonclustered index with just productname as the key column (and optionally storeid as included column, if you do frequent range searches on productname).
OK, what about the reverse case?
SELECT p.productname, s.name as StoreName
FROM Store s
INNER JOIN Products p ON p.storeid = s.id
WHERE s.name = 'My little cornershop'
For this, you need two additional indexes: One nonclustered on store with the name column and one nonclustered on products with storeid as the column. SQL Server can efficiently find the store record (expecting only one record), then through the second index can find all product entries for this store (still only a few compared to all entries in product), then for each of these products go through the clustered index (the clustered index key is automatically part of each nonclustered index) to get to the productname column.
I hope you see the pattern here. Create a nonclustered index for each column that gets queried with a high selectivity (meaning that only a small subset of all the rows will be selected).
The row-columns are completely useless in this scenario, just drop them to save space.
Using client generated GUIDs is attractive from the client point of view. You can create coherent datasets (such as a new customer including his first order) and push them to the database without caring for the correct INSERT order and without having to read database generated ids afterwards to update your object model. But you pay a nontrivial performance price for this when it comes to getting the data back from the database, as I hopefully made clear above. The large primary key (8 bytes) gets added to each nonclustered index and blows up its size, and you get a heavily fragmented clustered index which is never good.
Using IDENTITY
values for primary keys has disadvantages at INSERT time, but pays off every time after that.
Best Answer
Hopefully with that size you are on Enterprise edition? If not, good luck :) (just kidding see bottom of answer)
If you are on Enterprise, I would suggest looking into partitioning. This has saved my butt a time or two when dealing with this same kind of scenario. Create your partitions on the date field and you will have to best determine how big you want those partitions. I have done one partition per day and I've also done one partition per month. It all depends on how much history you need (remember you only get a set limit on the number of partitions per table).
If you are limiting the partition to a day AND your queries are only for a specific day, the optimizer should be able to use partition elimination to pick just the partitions it needs for your query. Also, when the time comes to purge old data after several years, partition sliding makes it really easy/quick to purge data (rather than delete statements). Same with just archiving that old data to another table completely.
Non-Enterprise ideas:
Speaking of archiving to another table, that would help as well and not require Enterprise. If this table is only used for reads for anything older than a month or a year, then you can:
1) Create a second table named Stuff_Archive
2) Move everything older than a month or a year (your preference)
3) Rename your current Stuff table to Stuff_Current
4) Create a view named Stuff that unions Stuff_Current and Stuff_Archive. This way any application that uses Stuff will still be able to read from both. And you can change your queries to just go off Stuff.
One thing I think that may help even further is to add a constraint that you will need to change each time you move records to Stuff_Archive (I have not tested this, but I plan on trying it on a DB I currently need to maintain like this). This constraint would be on the date field so it helps the optimizer know "Ok, even though I'm unioning these two tables, you only actually even need to look at table A if I'm asking for dates between X and Y and table B if I'm asking for dates between Y and Z". Theoretically, I would assume the stats of the column/index would already tell the optimizer that, but I have always wondered (again, not had time to test) if a constraint would help?
And the other non-Enterprise idea would be to create a non-clustered index on the date field and even better (if you can fit it into a maintenance window) a filtered index where dt>=somedate in the past and dt<= somedate in the future. That way any queries you write will be able to use that filtered index and it will only be as big as the data within that date range (rather than on all that history you have in your ever growing table). In otherwords: If you create a filtered index for a week, you are only dealing with 500k*7=3.5m records rather than 5billion records. You just have to make sure you rebuild that index often enough to keep that window described in the filtered where criteria within the timeframe for your queries. So if you can rebuild it daily and you only need yesterday's data, then great. But if you can only rebuild it on the weekend, then you may need to make sure stats are kept up to date throughout the week.