Sql-server – SQL Server 2008 Partitioning Evaluation

partitioningperformancesql server

I'm trying to do a performance comparison of a table with partitioning and the same table without partitioning.

So we have Subcriber and SubscriberPartitioned tables.

Tables structure is:

SubscriberId | Name | Email | Telephone | UserId (partition column)

Data: I'm using SQL Data Generator:

to fill the tables with 10 million rows each
the UserId range goes from 1-200

Partitioning: I have partitioned the table SubcriberPartioned in 2 according to the UserId, which goes from 1 to 200. So we have approx 5 million rows in each partition.

Performance measure: I'm using SQL Server Profiler to measure the query times.

Why does SQL Server differ so much in query times?

For example the query
```
select * from subcriber where SubscriberId = 1000 
```
… will the first time take like 40 seconds, and if I re-execute it will take less than a second. If I try with another Id it also will take less than a second. But the first time took a lot of time.

So, is there any optimization or caching being done automatically in the background by SQL Server?

I was expecting the same query to take the same time if repeated over time…
Which do you think would be the best queries to run and how to measure them in order to test the partitioning of the table and to see the optimization gained with the partition strategy instead of the normal table.

EDIT:

After answers on this post I made some cold start tests, and found that the non partitioned table was performing faster than the partitioned table.

Specially this query caught my attention:

select * from [table] where IdUser = 100

Why doesn't the partitioned table improve the performance on it? The above query includes a filter by the partition column, wouldn't that focus the engine to scan only half of the records on the partitioned approach?

EDIT 2:

I have performed the tests again. Assigning the second partition a different file group, and I have reached a performance improvement of 50 % in the mentioned query.

Best Answer

What do you expect your real life volume of data to be?

For 10 million rows, I wouldn't bother with partitioning. The overhead far outweighs the benefits: partitioning isn't a silver bullet to cure performance issues.

To answer,

Point 1: on the first run, data needs loaded into memory ("buffer pool") and will stay cached until evicted based on memory pressure and usage. Personally, I'd test with the cache filled because you'd expect your app to require that data very often, especially if you think partitioning is the solution to some problem

For point 2, what queries do you expect to run in production? The queries should be representative of this production load. However they should test different realistic filter combinations with and without partition key at least.

Edit, some reading, after comments below:

Related Solutions

Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

A partitioned table is really more like a collection of individual tables stitched together. So your in example of clustering by IncidentKey and partition by IncidentDate, say that the partitioning function splits the tables into two partitions so that 1/1/2010 is in partition 1 and 7/1/2010 is partition two. The data will be layed out on disk as:

Partition 1:
IncidentKey    Date
ABC123        1/1/2010
ABC123        1/1/2011
XYZ999        1/1/2010

Partition 2:
IncidentKey    Date
ABC123        7/1/2010
XYZ999        7/1/2010

At a low level there really are two, distinct rowsets. Is the query processor that gives the illusion of a single table by creating plans that seek, scan and update all rowsets together, as one.

Any row in any non-clustered index will have have the clustered index key to which it corresponds, say ABC123,7/1/2010. Since the clustered index key always contains the partitioning key column, the engine will always know in what partition (rowset) of the clustered index to search for this value (in this case, in partition 2).

Now whenever you're dealing with partitioning you must consider if your NC indexes will be aligned (NC index is partitioned exactly the same as the clustered index) or non-aligned (NC index is non-partitioned, or partitioned differently from clustered index). Non-aligned indexes are more flexible, but they have some drawbacks:

non-aligned indexes require large amounts of memory for certain query plans
non-aligned indexes prevent efficient partition switch operations

Using aligned indexes solves these issues, but brings its own set of problems, because this physical, storage design, option ripples into the data model:

aligned indexes mean unique constrains can no longer be created/enforced (except for the partitioning column)
all foreign keys referencing the partitioned table must include the partitioning key in the relation (since the partitioning key is, due to alignment, in every index), and this in turn requires that all tables referencing the partitioned table contain partitioning key column value. Think Orders->OrderDetails, if Orders have OrderID but is partitioned by OrderDate, then OrderDetails must contain not only OrderID, but also OrderDate, in order to properly declare the foreign key constraint.

These effects I found seldom called out at the beginning of a project that deploys partitioning, but they exists and have serious consequences.

If you think aligned indexes are a rare or extreme case, then consider this: in many cases the cornerstone of ETL and partitioning solutions is the fast switch in of staging tables. Switch in operations require aligned indexes.

Oh, one more thing: all my argument about foreign keys and the ripple effect of adding the partitioning column value to other tables applies equally to joins.

Mysql – Partitioning large thesql table

First, you should consider solving the problem in another way.

Upgrade to MySQL 5.6, where OPTIMIZE TABLE works without blocking (for an InnoDB table), as it is supported by InnoDB Online DDL.
If you can't upgrade, try using Percona Toolkit's pt-online-schema-change, which can perform the table rebuild without blocking.
```
$ pt-online-schema-change h=localhost,D=mydatabase,t=mytable --execute
    --alter="ENGINE=InnoDB"
```

If you're stuck on using partition, yes, you must make id the partition key in the table you show. You can convert the table to partitioning with ALTER TABLE. If you need the conversion operation to be non-blocking, use pt-online-schema-change.

There's no way to partition to fixed-size partitions. You have to partition by values. But is it really that important to hit a specific size per partition?

Re your comment about partition size:

When using RANGE partitioning, what I do is set up a schedule to ALTER TABLE and split the last partition from time to time. If you have a regular rate of growth, this is easy, but if you have irregular patterns of growth, you might instead set up a periodic check that examines the number of rows per partition (use the INFORMATION_SCHEMA.PARTITIONS), and email you if it's getting full.

For example, let's set up a table partitioned by range on id.

CREATE TABLE `mytable` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `transactionid` int(11) NOT NULL,
  `parent` int(11) NOT NULL,
  `headers` longtext,
  `creator` int(11) NOT NULL,
  `created` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `id` (`id`),
  KEY `transactionid` (`transactionid`,`parent`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (id)
(PARTITION p0 VALUES LESS THAN (0) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (1000) ENGINE = InnoDB,
 PARTITION p2 VALUES LESS THAN (2000) ENGINE = InnoDB,
 PARTITION p3 VALUES LESS THAN (3000) ENGINE = InnoDB,
 PARTITION p4 VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

As the MAX(id) approaches 3000, it's getting close to filling up p3 and spilling over into p4. So it's time to reorganize. It's good to do this before any data spills over into p4, because the reorg will affect only the last, empty partition and will therefore be very quick.

ALTER TABLE mytable REORGANIZE PARTITION p4 INTO 
(PARTITION p4 VALUES LESS THAN (4000), PARTITION p5 VALUES LESS THAN MAXVALUE);

Even if you miss a day and you get some data into the old p4, chances are it's not much data. But if you neglect this for a month or two, and p4 fills up with a lot of data, then the REORGANIZE will take longer.

Best Answer

Related Solutions

Sql-server – SQL Server 2008 – Partitioning and Clustered Indexes

Mysql – Partitioning large thesql table

Related Question