Sql-server – a Scalable Storage Mechanism for large TTL data collections

blobriaksql serverstorage

We currently have a legacy webservice that stores each xml request/response in Sql Server. The data only needs to persist for 3 days before it is considered expired. Sql Server is not good at deleting rows since every delete forms part of the transaction log. The db currently grows at 6-10gb per day and this is going to increase. Only around 1% of the responses that are stored are ever recalled therefore this is a very write heavy application. Each request/response xml document can be upto 14k in size.

What storage mechanism would you choose for upto 50/100gb of data per day?

I understand the solution is not sustainable and I am really looking for a tactical fix since we cannot easily change how all our clients query and re-query the data. We could look into a db that has native support for TTL (Riak, Postgres) etc or maybe a file/blob s3/azure storage solution is a better fit? The issue with a cloud blob storage solution could be lookup performance if we had to scan multiple buckets (since buckets have capacity limits) especially compared to the current sql server single table lookup.

Open to ideas and suggestions?

Best Answer

I have created a very simple demo of how partition switching might work for you:

USE tempdb
GO

SET NOCOUNT ON
GO

IF OBJECT_ID('dbo.largeTable') IS NOT NULL DROP TABLE dbo.largeTable
IF OBJECT_ID('dbo.largeTable1') IS NOT NULL DROP TABLE dbo.largeTable1
IF EXISTS ( SELECT * FROM sys.partition_schemes WHERE name = 'ps_date' ) DROP PARTITION SCHEME ps_date
IF EXISTS ( SELECT * FROM sys.partition_functions WHERE name = 'pf_date' ) DROP PARTITION FUNCTION pf_date
GO

CREATE PARTITION FUNCTION pf_date (DATE) AS RANGE RIGHT FOR VALUES ( '1 Jan 2013', '1 Feb 2013', '1 Mar 2013', '1 Apr 2013', '1 May 2013', '1 Jun 2013', '1 Jul 2013', '1 Aug 2013', '1 Sep 2013', '1 Oct 2013', '1 Nov 2013', '1 Dec 2013' );
GO

-- !!TODO don't use ALL TO PRIMARY, instead create individual files and filegroups
CREATE PARTITION SCHEME ps_date AS PARTITION pf_date ALL TO ( [PRIMARY] )
GO

IF OBJECT_ID('dbo.largeTable') IS NULL
CREATE TABLE dbo.largeTable 
    ( 
    rowId INT IDENTITY, 
    someData UNIQUEIDENTIFIER DEFAULT NEWID(), 
    dateAdded DATE DEFAULT GETDATE(), 
    addedBy VARCHAR(30) DEFAULT SUSER_NAME(), 
    ts ROWVERSION,

    CONSTRAINT pk PRIMARY KEY(dateAdded, rowId) 
    ) ON [ps_date](dateAdded)
GO


CREATE TABLE dbo.largeTable1
    ( 
    rowId INT IDENTITY, 
    someData UNIQUEIDENTIFIER DEFAULT NEWID(), 
    dateAdded DATE DEFAULT GETDATE(), 
    addedBy VARCHAR(30) DEFAULT SUSER_NAME(), 
    ts ROWVERSION,

    CONSTRAINT pk2 PRIMARY KEY(dateAdded, rowId) 
    ) ON [PRIMARY]
GO


-- Create some dummy data
INSERT INTO dbo.largeTable DEFAULT VALUES
GO 5

-- Multiply the data a bit
INSERT INTO dbo.largeTable ( someData, dateAdded, addedBy ) 
SELECT someData, DATEADD( month, -2, dateAdded ), addedBy
FROM dbo.largeTable
UNION ALL
SELECT someData, DATEADD( month, -1, dateAdded ), addedBy
FROM dbo.largeTable 
UNION ALL
SELECT someData, DATEADD( month, 1, dateAdded ), addedBy
FROM dbo.largeTable
GO


-- Have a look at the data
SELECT 'before' s, $PARTITION.pf_date( dateAdded ) p, dateAdded, COUNT(*) AS records
FROM dbo.largeTable
GROUP BY dateAdded
GO

-- Switch out oldest partition with data and truncate it
ALTER TABLE dbo.largeTable SWITCH PARTITION 9 TO dbo.largeTable1
GO

TRUNCATE TABLE dbo.largeTable1
GO

SELECT 'after' s, $PARTITION.pf_date( dateAdded ) p, dateAdded, COUNT(*) AS records
FROM dbo.largeTable
GROUP BY dateAdded
GO

-- Merge the range as no longer required
ALTER PARTITION FUNCTION pf_date() MERGE RANGE ( '1 Sep 2013' );
GO

TRUNCATE TABLE can be a minimally logged operation under certain conditions. Please consult the Data Loading Performance Guide for a fuller treatment on the topic. There is also a section on "Deleting All Rows from a Partition or Table".

Good luck!