Sql-server – How to manage 3.1 billion rows of data

database-designsql server

I am currently tasked with implementing a storage schema for a relatively large amount of data. The data will primarily be accessed to determine a current data point value, but I am also required to track the past six months of history for data trending/analytics.

A recent requirement was added to track the min / max / sum value for the past hour.

NOTE: Ideally, I would like to consider a MongoDB option, but I need to demonstrate that I have exhausted the SQL-Server options first.

The Data

The following table represents the primary data source (queried most frequently). The table will have approximately five million rows. The data changes will predominantly be UPDATE statements with very occasional INSERT statements after the initial data load. I have opted to cluster the data by dataPointId as you will always be selecting all values for a given data point.

// Simplified Table
CREATE TABLE [dbo].[DataPointValue](
    [dataPointId]  [int] NOT NULL,
    [valueId]      [int] NOT NULL,
    [timestamp]    [datetime] NOT NULL,
    [minimum]      [decimal](18, 0) NOT NULL,
    [hourMinimum]  [decimal](18, 0) NOT NULL,
    [current]      [decimal](18, 0) NOT NULL,
    [currentTrend] [decimal](18, 0) NOT NULL,
    [hourMaximum]  [decimal](18, 0) NOT NULL,
    [maximum]      [decimal](18, 0) NOT NULL

    CONSTRAINT [PK_MeterDataPointValue] PRIMARY KEY CLUSTERED ([dataPointId],[valueId])
)

The second table is notably larger at approximately 3.1 billion rows (representing the past six months of data). Data older than six months will be purged; otherwise strictly data INSERT statements (~200 rows / sec, 720,000 rows / hour, 17 million rows / week).

// Simplified Table
CREATE TABLE [dbo].[DataPointValueHistory](
    [dataPointId] [int]            NOT NULL,
    [valueId]     [int]            NOT NULL,
    [timestamp]   [datetime]       NOT NULL,
    [value]       [decimal](18, 0) NOT NULL,
    [delta]       [decimal](18, 0) NOT NULL

    CONSTRAINT [PK_MeterDataPointHistory] PRIMARY KEY CLUSTERED ([dataPointId], [valueId], [timestamp])

)

The expectation is that this table will double in size as the number of tracked data point values increases to 400 rows / sec (so reaching ~ 10 billion isn't out of the question).

The Question(s) (yes, I am asking more than one… they are closely all related).

I am currently using a SQL-Server 2008 R2 Standard Edition database. I am likely going to make the case for upgrading to Enterprise Edition if can obtain the desired performance level with table partitions (or MongoDB if cannot hit required performance levels with SQL-Server). I would like your input on the following:


1) Given that I need to compute the min, max and sum for the past hour (as in now - 60 minutes). What is the best approach for tracking recent data:

  • Hold recent data in memory of the data service. Write out computed min/max/average with each data UPDATE.

  • Query the recent history from the history table (impacts next question?) during each UPDATE statement. Query would be accessing the latest data for a data point value and should only be scanning over the last million records or so?

  • Store the recent history in the DataPointValue row itself to avoid the history table lookup? Perhaps stored as a delimited string and processed inside the UPDATE proc?

  • Other option I haven't considered?


2) For DataPointValueHistory, queries against the datable will always be by dataPointId and one or more valueId's. The data queried will typically be for the last day, week or month, but may be for the full six months in some cases.

I am currently generating a sample data set to experiment with whether it makes more sense to cluster by dataPointId/valueId/timeStamp or timeStamp/dataPointId/valueId. If anyone has experience with dealing with a table of this size and willing to offer their insight, it would be appreciated. I am leaning towards the latter option to avoid index fragmentation, but query performance is critical.

  • Cluster DataPointValueHistory by dataPointId -> valueId -> timeStamp

  • Cluster DataPointValueHistory by timeStamp -> dataPointId -> valueId


3) Finally, as mentioned above, I think it will make sense to partition the DataPointValueHistory table. Any suggestions on how to best partition the history data would be greatly appreciated.

  • If clustered by timestamp first, I am thinking that the data should be partitioned by week (27 partitions total). The oldest partition would be purged after week 27.

  • If clustered by dataPointId first, I am thinking that the data should be partitioned by some modulus of the id?

As I have very limited experience with table partitioning, your expertise would be appreciated.

Best Answer

I found this analysis very useful when I was researching on building an analytics solution which would have billions of rows in one table.

http://leiliweb.wordpress.com/2012/12/11/partitioned-table-and-index-strategies-using-sql-server-2008/