Sql-server – Will table partitions help me

columnstoreindexpartitioningsql server

I have 10 tables.

dbo.Table2008
dbo.Table2009
...
dbo.Table2018

Each table holds approximately 500,000,000 records which are 20 columns wide (if that matters). And each table has a clustered columnstore index on it.
Each table also holds data only for the year in its name. For instance, in dbo.Table2008 you'll only see records for which the CreatedDate column is >= 20080101 and <= 20081231. But if I want to issue a query across 3 years I need to hit all three tables with a union.

My theory is if I have a single table called dbo.Table which is partitioned into ~120 partitions, one partition for every month of every year, then it'll not only reduce my table count. So I can change my query from

with cte as (
    select col1, col2, col3, col4
    from table2008
    where col4 >= 20080201 and col4 <= 20120801
    union
    select col1, col2, col3, col4
    from table2009
    where col4 >= 20080201 and col4 <= 20120801
    union
    select col1, col2, col3, col4
    from table2010
    where col4 >= 20080201 and col4 <= 20120801
    union
    select col1, col2, col3, col4
    from table2011
    where col4 >= 20080201 and col4 <= 20120801
) select ...
from cte
join LookupTable1 on ... = ...
join LookupTable2 on ... = ...

select col1, col2, col3, col4, lookuptablecol1, lookuptable2col2
from dbo.Table
join ALL MY LOOKUP TABLES

Is my theory correct?
Is table partitioning with parition schemes and partition functions what I'm after?
And will this theory work using clustered columnstore indexes?

Best Answer

Partitioning does not sound like a good fit for your scenario. Partitioning isn't something to be taken lightly. It requires a lot of planning and you'd have to rewrite 5 billion rows in order to convert the data. Instead you should consider a UNION ALL view. Depending on your requirements you could even make it a partitioned view.

A view will give you an easy way to query your data without any date movement. It's by far the simplest solution.

In general, partitioning doesn't improve query performance. However, with columnstore tables it can be helpful if it gives you better rowgroup elimination on the partitioned column. That can be accomplished without partitioning depending on how you build your columnstore tables. If you don't modify the old data it should be relatively straightforward to build your columnstore in a way that maintains order.

Columnstore tables do support partitioning, if you do have a legitimate need for it at some point. It's easy to experiment with it. Simply try to create an empty, partitioned columnstore table.

Related Solutions

Sql-server – How to create implicit select statements

Not sure that this is simpler than a cursor solution, but here is what I would do:

Before we begin we need a few tables in a SQL Fiddle:

MS SQL Server 2012 Schema Setup

CREATE TABLE dbo.tbl1(c1 INT, c2 INT, c3 INT);
CREATE TABLE dbo.tbl2(c4 INT, c5 INT, c6 INT);
CREATE TABLE dbo.tbl3(c7 INT, c8 INT, c9 INT);

First we need the list of all user tables. For that we can use the sys.tables catalog view. The OBJECT_SCHEMA_NAME() function gets us the schema name, the OBJECT_NAME() function the table name. The QUOTENAME() function quotes the names correctly, in case some of those contain special characters, key words or spaces. (In this example the use of OBJECT_NAME() is not strictly necessary as sys.tables has a name column, but I left it in as you can use this pattern with any catalog view that has an object_id column.)

Query 1:

SELECT 
  QUOTENAME(OBJECT_SCHEMA_NAME(T.object_id))+'.'+
  QUOTENAME(OBJECT_NAME(T.object_id)) AS quoted_table_name,
  T.object_id
FROM sys.tables AS T;

Results:

| QUOTED_TABLE_NAME | OBJECT_ID |
|-------------------|-----------|
|      [dbo].[tbl1] | 245575913 |
|      [dbo].[tbl2] | 261575970 |
|      [dbo].[tbl3] | 277576027 |

The next step is to get the list of column names, again quoted. We can use the sys.columns catalog view for that.

Query 2:

SELECT C.name,C.column_id 
  FROM sys.columns AS C
 WHERE C.object_id = OBJECT_ID('dbo.tbl1');

Results:

| NAME | COLUMN_ID |
|------|-----------|
|   c1 |         1 |
|   c2 |         2 |
|   c3 |         3 |

The next hurdle is to get those columns in a comma separated list. There is no string concatenation aggregate function build in so we have to use a trick:

Query 3:

SELECT STUFF((
  SELECT ','+QUOTENAME(name)
    FROM sys.columns AS C
   WHERE C.object_id = OBJECT_ID('dbo.tbl1')
   ORDER BY C.column_id
     FOR XML PATH(''),TYPE
  ).value('.','NVARCHAR(MAX)'),1,1,'') AS clomun_list;

Results:

|    CLOMUN_LIST |
|----------------|
| [c1],[c2],[c3] |

With that all pieces are in place and we just have to put them all together:

Query 4:

SELECT 'SELECT ' +
       CL.column_list +
       ' FROM ' +
       QUOTENAME(OBJECT_SCHEMA_NAME(T.object_id)) + '.' +
       QUOTENAME(OBJECT_NAME(T.object_id)) +
       ';' AS select_statement
FROM sys.tables AS T
CROSS APPLY (
  SELECT STUFF((
    SELECT ','+QUOTENAME(name)
      FROM sys.columns AS C
     WHERE C.object_id = T.object_id
     ORDER BY C.column_id
       FOR XML PATH(''),TYPE
    ).value('.','NVARCHAR(MAX)'),1,1,'') AS column_list
  )CL;

Results:

|                         SELECT_STATEMENT |
|------------------------------------------|
| SELECT [c1],[c2],[c3] FROM [dbo].[tbl1]; |
| SELECT [c4],[c5],[c6] FROM [dbo].[tbl2]; |
| SELECT [c7],[c8],[c9] FROM [dbo].[tbl3]; |

This example will work in SQL 2005 and later, assuming you are on the latest service pack. There is no clean solution to achieve this with SQL Server 2000. In that case you need to go back to your cursor solution.

Sql-server – Large difference in disk space used by partitions

What might cause the big discrepancy in disk space?

Typically this is caused by rows that have been deleted. Heaps do not reclaim space for data pages emptied by DELETE operations unless a table lock is taken at the time of the deletion. Even then, other factors like the possibility of row-overflow data, or an enabled row-versioning isolation level can prevent space being reclaimed.

Another factor that usually has a much smaller impact on space usage is that rows with variable-length columns that expand may not fit on the original heap page, in which case the row is forwarded to another page. Forwarded rows are more usually associated with scan performance issues than pure space usage, however.

Overall, tables that experience significant deletions (without the space being reused quickly by new rows) and/or variable column updates that do not fit in place are often better configured with a clustered index. Clustered tables usually deallocate empty pages very quickly.

You can use the sys.dm_db_index_physical_stats DMV to see physical details for a heap or index:

SELECT
    DDIPS.index_id,
    DDIPS.partition_number,
    DDIPS.index_type_desc,
    DDIPS.alloc_unit_type_desc,
    DDIPS.avg_fragmentation_in_percent,
    DDIPS.fragment_count,
    DDIPS.avg_fragment_size_in_pages,
    DDIPS.page_count,
    DDIPS.avg_page_space_used_in_percent,
    DDIPS.record_count,
    DDIPS.avg_record_size_in_bytes,
    DDIPS.forwarded_record_count
FROM sys.dm_db_index_physical_stats
    (
        DB_ID(),
        OBJECT_ID(N'dbo.t1', N'U'),
        0,
        NULL,
        'DETAILED'
    ) AS DDIPS;

The following (trivial) example shows a heap DELETE not releasing any empty pages:

SET NOCOUNT ON;
CREATE TABLE t1 (c1 char(8000) DEFAULT 'a'); 
GO
INSERT INTO t1 DEFAULT VALUES; 
GO 1000

-- 1000 pages allocated
SELECT
    DDIPS.index_id,
    DDIPS.partition_number,
    DDIPS.page_count
FROM sys.dm_db_index_physical_stats
    (
        DB_ID(),
        OBJECT_ID(N'dbo.t1', N'U'),
        0,
        NULL,
        'DETAILED'
    ) AS DDIPS;

-- Delete all the data from the heap
DELETE FROM t1;

-- Still 1000 pages allocated
-- Even though the table is empty
SELECT
    DDIPS.index_id,
    DDIPS.partition_number,
    DDIPS.page_count
FROM sys.dm_db_index_physical_stats
    (
        DB_ID(),
        OBJECT_ID(N'dbo.t1', N'U'),
        0,
        NULL,
        'DETAILED'
    ) AS DDIPS;
GO
DROP TABLE dbo.t1;

If you try the example again, but with a table lock (DELETE FROM t1 WITH (TABLOCK)) the deletion frees all empty pages (assuming the database does not have the READ_COMMITTED_SNAPSHOT option set to ON etc.)

Best Answer

Related Solutions

Sql-server – How to create implicit select statements

Sql-server – Large difference in disk space used by partitions

Related Question