Sql-server – How to partition a very large table with limited db space

partitioningsql serversql-server-2008

I have a very big table. I want to partition it, but I cannot.

Database size: 1TB, Free space 200GB

Table:

Size: 165 columns (row lengh 4216 KB, no LOBs), 500 million rows, 600GB of data.
Possible partitioning: a day per partition
Number of rows per day/parition: 2 million

In order to partition it, I need to create a clustered index. But creating parition requires free space of the same size as the table, and I do not have extra 600GB.

Are there any options how I partition this table?

EDIT 1:

I have tried copying data into separate table.

However when I try to DELETE (or INSERT) 1 day of data into another table, I get an error, that tranasaction log is full and my transaction is being rolled back. My transaction log is approx 20 GB and I cannot make it any bigger.

Best Answer

You'll need to create a new table with the same schema, but as a partitioned object. Optionally you can compress the table to save even more space. As you are only putting on average one row per page I'm not sure how much space savings you will see. I'd recommend putting a few thousand rows into the new table then compressing to see if the space savings is worth the CPU overhead.

As for how to move this much data without eating up all your drive space and without bloating your transaction log, that'll need to be done in a loop moving small amounts of data per run. You'll want to do some data analysis to see how large of a window you can process, but I'll assume that based on the data volumes you need to move the rows one minute at a time.

DECLARE @processFrom as datetime
SELECT @processFrom = min(YourDateColumn)
FROM YourTable

DECLARE TABLE @Rows (...)

WHILE EXISTS (SELECT * FROM YourTable)
BEGIN
     DELETE TOP (10000) FROM YourTable
     OUTPUT DELETED.* INTO @Rows
     WHERE YourDateColumn = @processFrom


     INSERT INTO NewTable
     (...)
     SELECT ...
     FROM @Rows

     DELETE FROM @Rows

     IF @@ROWCOUNT = 0
              SET @processFrom = dateadd(dd, 1, @processFrom)
END

When everything is done and you've verified that all the data is in the new table drop the old table and rename the new table so that it has the old tables name. This way nothing breaks. You'll want to script out the permissions on the old table (if there are any) so that you can apply them to the new table.

If there are tables with foreign keys to this table you'll need to drop them before this will work.

Related Solutions

Sql-server – How to manage a large SQL Server database: transaction log, reduce size, alter table

Points, in no particular order:

You do need the transaction log for a whole bunch of reasons I'm not going to go into. That said, if you're running out of transaction log space while you're in simple recovery mode, that probably means you have long-running transactions. Make your transactions as short as possible (gather all required user input before you start a transaction). (When SQL Server is trying to free log space, it can only free up to the start of the oldest current transaction.)
Making NVARCHAR(MAX) columns nullable will not necessarily save you much space (off the top of my head it'll be 2 bytes per row, per column).
Deleting/archiving old data will help you - with a caveat that it has to pass through the transaction log first.
Shrinking your database will not help you, as a matter of fact it will kill performance (both while it's running and afterward, since DBCC SHRINKDATABASE uses a naive algorithm that almost perfectly fragments your indexes) and won't regain you much if any space.

Sql-server – Partitioning / indexing an extremely large table

My approach is to disable all the nonclustered indexes [...] then build a clustered index for the table using this partition scheme.

Creating a clustered index on a heap automatically rebuilds all nonclustered indexes (even disabled ones). The nonclustered indexes are rebuilt but not partitioned. Assuming the desired end state is a partitioned clustered table with aligned indexes, rebuilding the nonclustered indexes to be non-aligned is entirely wasted effort.

What worries me slightly is that tempdb is now pushing nearly 1 TB and steadily climbing, despite the current table being around half that size. The MS docs I've read suggest the tempdb space usage should be about the size of the final table/clustered index.

The question of sort space is very complex. To understand all the details (including the effect of parallelism) you would need to carefully read an entire series of posts by the SQL Server Query Processing Team. Converting a heap to a partitioned clustered table with parallelism enabled is probably pretty close to worst case.

At it's most basic (neglecting most of the important information in the QP Team's posts), you are asking SQL Server to run a query like:

SELECT *
FROM DailyTable
ORDER BY
    $partition.monthly_on_primary(LoadDate),
    LoadDate,
    SeqNumber;

This query will not execute quickly, regardless of where you choose to write the sort runs that do not fit in memory to. Add to that the work of actually building a complete new copy of the entire data set in separate rowsets, and the work involved in rebuilding the nonclustered indexes pointlessly...

Advice

There are many considerations in getting this change to work efficiently. The important ones are to avoid sorting at all where possible, and to use parallel minimally-logged bulk load wherever possible.

The details of that depend on details not contained in the question, and a full solution is beyond an answer here. Nevertheless, the outline of an approach that has worked well for me personally in the past is:

Extract the existing data using bcp to one file per final partition
Drop the existing table and create the new one
Load the new table using parallel minimally-logged bulk load

The per-partition data extract needs be ordered on (LoadDate, SeqNumber). Ideally, you would avoid a sorting operation. If you have an existing nonclustered index on (LoadDate, SeqNumber) you can extract data in the right order without sorting if you construct the query correctly.

Once per-partition data has been extracted to separate files (this can be done in parallel if your hardware is up to it), the source table can then be dropped, freeing space. A new partitioned heap or clustered table is then created, and bulk loaded with the pre-sorted data, possibly also in parallel.

Done right, the entire process requires no more than 1x data size and achieves the fastest possible data transfer rates in both directions, with the least amount of log use.

Best Answer

Related Solutions

Sql-server – How to manage a large SQL Server database: transaction log, reduce size, alter table

Sql-server – Partitioning / indexing an extremely large table

Advice

Related Question