SQL Server 2012 – Good Strategy for Big Tables and Archiving

sql serversql-server-2012vldb

I have a table with roughly ~500,000 rows a day getting added into it. The database supports an OLTP system. The make up of it is a bit like this:

create table Stuff 
(StuffID int identity not null primary key,  
ValueA decimal(18,4) not null,   
ValueB decimal(18,4) not null,  
ValueC decimal(18,4) not null,   
ValueD decimal(18,4) not null,  
StuffName varchar(10) not null,   
Created datetime not null default(getdate())  )

I have a covering index on it because of the type of queries that run on it.

Generally, I'm only working with the current day's data. Very few of my queries are targetting the primary key. I've been considering moving the clustered index off of the primary key and onto the Created (datetime) column because so much of my querying is based on that, and I usually want the resultsets in chronological order.

I'd like to keep everything in one table because from a functional standpoint I have no reason to move it, but I'm worried that by not having an archive strategy of some type that eventually I'll have "too many" rows in there. In five years I should have one billion records, which maybe isn't a big deal considering the relatively small data size of my table.

Boiled down: for a small data sized table with 500K rows getting added per day, is there any particular strategy I should look towards? Will I encounter a general slowdown because of size, or is there really nothing to worry about?

Best Answer

Hopefully with that size you are on Enterprise edition? If not, good luck :) (just kidding see bottom of answer)

If you are on Enterprise, I would suggest looking into partitioning. This has saved my butt a time or two when dealing with this same kind of scenario. Create your partitions on the date field and you will have to best determine how big you want those partitions. I have done one partition per day and I've also done one partition per month. It all depends on how much history you need (remember you only get a set limit on the number of partitions per table).

If you are limiting the partition to a day AND your queries are only for a specific day, the optimizer should be able to use partition elimination to pick just the partitions it needs for your query. Also, when the time comes to purge old data after several years, partition sliding makes it really easy/quick to purge data (rather than delete statements). Same with just archiving that old data to another table completely.

Non-Enterprise ideas:
Speaking of archiving to another table, that would help as well and not require Enterprise. If this table is only used for reads for anything older than a month or a year, then you can:
1) Create a second table named Stuff_Archive
2) Move everything older than a month or a year (your preference)
3) Rename your current Stuff table to Stuff_Current
4) Create a view named Stuff that unions Stuff_Current and Stuff_Archive. This way any application that uses Stuff will still be able to read from both. And you can change your queries to just go off Stuff.

One thing I think that may help even further is to add a constraint that you will need to change each time you move records to Stuff_Archive (I have not tested this, but I plan on trying it on a DB I currently need to maintain like this). This constraint would be on the date field so it helps the optimizer know "Ok, even though I'm unioning these two tables, you only actually even need to look at table A if I'm asking for dates between X and Y and table B if I'm asking for dates between Y and Z". Theoretically, I would assume the stats of the column/index would already tell the optimizer that, but I have always wondered (again, not had time to test) if a constraint would help?

And the other non-Enterprise idea would be to create a non-clustered index on the date field and even better (if you can fit it into a maintenance window) a filtered index where dt>=somedate in the past and dt<= somedate in the future. That way any queries you write will be able to use that filtered index and it will only be as big as the data within that date range (rather than on all that history you have in your ever growing table). In otherwords: If you create a filtered index for a week, you are only dealing with 500k*7=3.5m records rather than 5billion records. You just have to make sure you rebuild that index often enough to keep that window described in the filtered where criteria within the timeframe for your queries. So if you can rebuild it daily and you only need yesterday's data, then great. But if you can only rebuild it on the weekend, then you may need to make sure stats are kept up to date throughout the week.

Related Solutions

Sql-server – Index fragmentation increase as more row being deleted

This isn't fragmentation.

Fragmentation is generated of course, but deletes will simply create "islands" of remaining pages, which is less evil then GUID/clustered key INSERT fragmentation.

If you're PK is an IDENTITY, then CreationDate should roughly track this so you're actually deleting chunks of contiguous rows anyway.

Do you have an index on CreationDate
Do you have delete cascades?
Is the TOP 1000 in a single transaction?

For point 3, doing a loop inside a transaction is pointless: is this it?

At some point, a statistics update may be needed if you delete enough rows but I don't think it's that.

Other options:

why not use TRUNCATE TABLE, wrapped in a stored procedure with EXECUTE AS OWNER
use SYNONYMs for poor man's partitioning

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

The way this is designed you only have suboptimal choices. Random GUIDs are not well suited as clustered index keys, since they are neither small (which affects the size of all secondary indexes) nor sequential (unless you can use NEWSEQUENTIALID()) which leads to index fragmentation, which leads to wasted space, slower insert performance and slower query performance through more I/O.

On the other hand, if your normalized tables are linked by such a GUID then each join depends on them and you will have to bite the bullet and use them as primary keys with clustered index anyway. Just create the PRIMARY KEY constraint and the clustered index in separate steps so you can define PAD_INDEX = ON and FILLFACTOR=50 to slow down the fragmentation somewhat. Still, expect to do regular, expensive index REBUILDs to reduce the inevitable fragmentation.

Your secondary indexes must not start with the id, because that renders them useless! Imagine a telephone book, where each entry is given a random or running id, then the phone book is sorted by that id plus the name. Have fun searching a given name in that. A useable index must start with the column that is used in the where- or join clause.

So, with the clustered indexes created so far you cover queries of the type

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id

The query runs through the products, can efficiently look up the store ids and has immediate access to the store name, since the store id index is clustered.

Now you want to do this:

SELECT p.productname, s.name as StoreName 
FROM Products p
INNER JOIN Store s ON p.storeid = s.id
WHERE p.productname LIKE 'A%'

For this you need a nonclustered index with just productname as the key column (and optionally storeid as included column, if you do frequent range searches on productname).

OK, what about the reverse case?

SELECT p.productname, s.name as StoreName 
FROM Store s
INNER JOIN Products p ON p.storeid = s.id
WHERE s.name = 'My little cornershop'

For this, you need two additional indexes: One nonclustered on store with the name column and one nonclustered on products with storeid as the column. SQL Server can efficiently find the store record (expecting only one record), then through the second index can find all product entries for this store (still only a few compared to all entries in product), then for each of these products go through the clustered index (the clustered index key is automatically part of each nonclustered index) to get to the productname column.

I hope you see the pattern here. Create a nonclustered index for each column that gets queried with a high selectivity (meaning that only a small subset of all the rows will be selected).

The row-columns are completely useless in this scenario, just drop them to save space.

Using client generated GUIDs is attractive from the client point of view. You can create coherent datasets (such as a new customer including his first order) and push them to the database without caring for the correct INSERT order and without having to read database generated ids afterwards to update your object model. But you pay a nontrivial performance price for this when it comes to getting the data back from the database, as I hopefully made clear above. The large primary key (8 bytes) gets added to each nonclustered index and blows up its size, and you get a heavily fragmented clustered index which is never good.

Using IDENTITY values for primary keys has disadvantages at INSERT time, but pays off every time after that.

Best Answer

Related Solutions

Sql-server – Index fragmentation increase as more row being deleted

Sql-server – SQL Server 2008 R2 : A Possible indexing strategy for the given schema

Related Question