Sql-server – Strategy to deal with multi billion row simulation results and partitioning

partitioningsql-server-2012

I am trying to come up with a good way to deal withthe results of statistical simulations from a dba point of view. We generate about 500 million rows per day, most of which are "garbage" (i.e. the results are seen and discarded as not something we look for) and some need to be preserved. Dealing with them outside of partioning is hard.

Data is currently MOSTLY in a 3 table hierarchy (trade–order–update) with a trade having multiple orders which get multiple updates each. There is a 4th table (parameter) that contains the parameter for every simulation. This is small and unproblematic though.

We right now write the data to 3 staging tables and analyze there – temporary solution.

I would like some people to review this idea.

Partition the staging tables with x "buckets". A simulation assigns a bucket (smallint) ad then writes into this bucket. This allows fast deletion of a simulation. AS we only run about 100 simulations per week, a 1000-20000 partition set on the tables is enough to keep the data as long as we need (initial review).
When data is ok, we move it from the staging (via stored procedure) into final data warehosue tables. Again, we need to partition them, and we will use a similar bucket approach. As mulitple simulations will run into identical buckets (updating the data) this is a relatively small number of buckets.

Anyone done that?

The idea behind the bucket approach is that I can pregenerate the buckets and do not have to modify the partitioning function. Sadly SQL Server, contrary to Oracle, has no auto partition, otherwise I could use a simple ID field. I really try to avoid dynamically modifying the partitioning schema here. This way I can have a simple smallint "bucket id", a prepared partitioning schema and can basically assign every simulation / run a bucket id -easy to join. Any negatives?

Best Answer

We were doing something not unlike this with clickstream data. We unplugged SQLServer and plugged in Vertica (a columnar analytics ANSI-SQL compliant RDBMS)...we've never looked back. Multi-minute queries dropped to millsecond queries, and data loads dropped from hours to seconds. If/when you start outgrowing it, add more nodes and rebalance online. Very nicely done product.

The Community Edition is free for 3 nodes and 1Tb of data (which gets compressed a LOT more than you would expect, so 1Tb is quite a lot of data), and the commercial version (unlimited nodes) is about half the cost of SQLServer EE for TB-size data (it's licensed by data size). You might want to give it a look. ;-)

Btw, I have no affiliation with Vertica or HP, I'm just a very pleased customer/data architect.

Cheers, Dave Sisk

Related Solutions

SQL Server – Will Table Partitioning Speed Up Queries

No. Partitioning does not speed things up. Your table that has 2 million records before partitioning, it will continue to have exactly the same 2 million records after partitioning. If you want to look only at a small subset of the records use an index. It does seem like your data is a multi-tenant schema with the tenant key be the ManufacturerID. In such cases the most likely design is to have the ManufacturerID be a leading key in the clustered indexes. BTW, I recommend reading Multi-Tenant Data Architecture.

Partitioning would help in scenarios involving fast data switch-in and switch-out, or in scenarios requiring distribution of data do different physical paths (filegroups). A good read is How To Decide if You Should Use Table Partitioning.

Before you say 'but what about partition elimination?' I'll say this: there's almost nothing partition elimination can do that an index cannot do better. Not to mention that partition elimination is a lottery while range scan on leading key in clustered index is pretty much a sitting duck. Besides, how many tenants will you have per partition? And how will you distribute them evenly, consider how unbalanced they seem to be? Both these problems are, again, addressed much better by a leading key in the clustered index. And, not least, consider the huge design changes partitioning brings (e.g. kiss goodbye the unique/primary key constraints or introduce unaligned indexes), see Special Guidelines for Partitioned Indexes, pay special attention to the 'Memory Limitations' section.

Partitioning is an awesome feature, but is not a performance feature. An example of what partitioning could help is to do a very fast switch-in of the staged data (which is a problem you raised), but that would require one partition per manufacturer, and it will not scale, assuming that each 'manufacturer' is a client of your business (a tenant).

Mysql – Does partitioning a MySQL table also increase index creation performance

Partitioning usually rather decreases ALTER TABLE speed. Index and data is stored in separate files for each partition (a separate .ibd file for INNODB if innodb_file_per_table=1), so that you might make alter in parallel. But partitioning add an extra processing step to choose the partition for storing data/index, thus the whole operation is made slower.

mysql> create table alter_test (id int primary key auto_increment, v   char(20));
mysql> load data infile '/tmp/1' into table alter_test;
Query OK, 500000 rows affected (9.98 sec)

mysql> alter table alter_test add index i_test(v);
Query OK, 0 rows affected (4.55 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> create table alter_test (id int primary key auto_increment, v  char(20)) partition by key(id) partitions 10;
mysql> load data infile '/tmp/1' into table alter_test;
Query OK, 500000 rows affected (9.98 sec)

mysql> alter table alter_test add index i_test(v);
Query OK, 0 rows affected (4.83 sec)

As for the shared lock for ALTER TABLE, all partitions are locked in the same time, so partitioning does not help.

Hints:

Upgrade to MySQL 5.6, since it supports online alter table.
Use scripts such as oak-online-alter-table or pt-online-schema-change as a workaround for MySQL version < 5.6

Best Answer

Related Solutions

SQL Server – Will Table Partitioning Speed Up Queries

Mysql – Does partitioning a MySQL table also increase index creation performance

Related Question