Sql-server – Strategy to deal with multi billion row simulation results and partitioning

partitioningsql-server-2012

I am trying to come up with a good way to deal withthe results of statistical simulations from a dba point of view. We generate about 500 million rows per day, most of which are "garbage" (i.e. the results are seen and discarded as not something we look for) and some need to be preserved. Dealing with them outside of partioning is hard.

Data is currently MOSTLY in a 3 table hierarchy (trade–order–update) with a trade having multiple orders which get multiple updates each. There is a 4th table (parameter) that contains the parameter for every simulation. This is small and unproblematic though.

We right now write the data to 3 staging tables and analyze there – temporary solution.

I would like some people to review this idea.

  • Partition the staging tables with x "buckets". A simulation assigns a bucket (smallint) ad then writes into this bucket. This allows fast deletion of a simulation. AS we only run about 100 simulations per week, a 1000-20000 partition set on the tables is enough to keep the data as long as we need (initial review).

  • When data is ok, we move it from the staging (via stored procedure) into final data warehosue tables. Again, we need to partition them, and we will use a similar bucket approach. As mulitple simulations will run into identical buckets (updating the data) this is a relatively small number of buckets.

Anyone done that?

The idea behind the bucket approach is that I can pregenerate the buckets and do not have to modify the partitioning function. Sadly SQL Server, contrary to Oracle, has no auto partition, otherwise I could use a simple ID field. I really try to avoid dynamically modifying the partitioning schema here. This way I can have a simple smallint "bucket id", a prepared partitioning schema and can basically assign every simulation / run a bucket id -easy to join. Any negatives?

Best Answer

We were doing something not unlike this with clickstream data. We unplugged SQLServer and plugged in Vertica (a columnar analytics ANSI-SQL compliant RDBMS)...we've never looked back. Multi-minute queries dropped to millsecond queries, and data loads dropped from hours to seconds. If/when you start outgrowing it, add more nodes and rebalance online. Very nicely done product.

The Community Edition is free for 3 nodes and 1Tb of data (which gets compressed a LOT more than you would expect, so 1Tb is quite a lot of data), and the commercial version (unlimited nodes) is about half the cost of SQLServer EE for TB-size data (it's licensed by data size). You might want to give it a look. ;-)

Btw, I have no affiliation with Vertica or HP, I'm just a very pleased customer/data architect.

Cheers, Dave Sisk