SQL Server – Fastest Way to Move Data to New Database

exportmigrationsql server

This is my first question and post so let me know if I can improve. I'll err on the side of providing too much information.

I'm a fairly new DBA at my company, less than two years with the DBA team, and have been tasked with coming up with a way to move data from an existing database for one specific entity to a new database which does not yet exist. The database is on an instance running SQL Server 2012 SP1 Standard edition.

We currently have a custom PowerShell cmdlet called purge-entitydata. This cmdlet will perform a delete in all 160 user tables in the current database based upon the entity value.

Example

purge-entitydata -entity <Entity>

However, before we delete all of the entity's data in the current database, we need to move only that entity's data to a new temporary database called [Entity_Backup] in case a downstream process fails.

Here are some caveats:

  1. The 160 user tables may change several times a year. Some may be deleted or new ones may be added. I was thinking of going the SSIS route but didn't want to have to update the package with a new data flow task everytime a new table is added or removed.

  2. Just the data and the tables themselves need to be moved. I won't need to move views, stored procedures, or indexes from the current database to temporary database.

  3. The current database and user tables contain data for over a thousand entities. I just need to move the data for one entity over to the temporary database.

  4. If the downstream process succeeds, then the temporary database will get dropped.

  5. The database is large, over 200 GB. Also, this server is the publisher for a pull merge replication subscription and is also the publisher for two transaction replication subscriptions. I'm thinking that taking a backup would slow replication too much.

I was thinking of using sp_msforeachtable 'SELECT * INTO [Entity_Backup].? FROM ? WHERE Entity = $Entity' and Invoke-Sqlcmd but am worried this would take too long since it does not run in parallel. An entity has approximately three to five million rows.

What be the most efficient/fastest way to move data from the 160 user tables to a new database which can be done from inside the PowerShell cmdlet? Does anyone have any suggestions which may be faster than ms_foreachtable?

Best Answer

You say that you're worried that your proposed solution won't work because it won't run in parallel. You already have a PowerShell cmdlet which deletes from all 160 tables, right? Deleting rows can often take longer than inserting them, especially if you have indexes on the table that you're deleting from. If that cmdlet doesn't run in parallel then why do you need the insert commands to run in parallel? I recommend starting with a relatively simple solution, doing performance testing, and only making it more complex if necessary.

I'm not familiar with the restrictions of what can be run in PowerShell. However, it appears to be possible to execute a stored procedure from PowerShell , so I will provide a T-SQL solution. You should be able to easily translate that into what you need, but I may refer to the code as a "stored procedure".

If your database is about 200 GB and you have about 1000 entries then on average you only need to move 200 MB of data. That's definitely doable if you have suitable indexes on the tables (such as one on the Entity column). I'm going to assume that you already did that analysis and you have the indexes that you need to speed up the SELECT statements.

1) Create the database for the inserts

I recommend that you create your temporary database with a recovery model of simple. That will reduce data written to the transaction log for minimally logged queries. I also see no advantage in using a different recovery model because the data will be so short-lived. I also recommend that you create a database with the right settings and just keep it around. Grow the data and log files to appropriate sizes to prevent auto-growth events. You can just drop all of the tables in the database as they are no longer needed. Do you really need to go through the work each time of creating and dropping the database?

2) Figure out your DML strategy

I can think of three different ways to insert data into the temporary database. Two of them require changes to your delete applet.

All of them insert into heaps and take advantage of minimal logging. Since you just need the data and nothing else that will minimize the overhead of doing the inserts. On SQL Server 2012 you cannot do parallel inserts. You can do them if you upgrade to SQL Server 2014 and use SELECT INTO.

The first method uses the OUTPUT clause to delete the rows from the table and insert the deleted rows in a single query. This will likely be the fastest way because you only need to read the underlying data once per table. You can't use SELECT INTO for the insert statement (but you can run one beforehand to create the target table). Your query could look something like this:

INSERT INTO [other_db]..[Entity_Table_1] WITH (TABLOCK)
SELECT t.*
FROM 
(
    DELETE FROM [source_db]..[Entity_Table_1]
    OUTPUT deleted.*
) t;

The second method is similar to the first but you process the insert query for a table right before the delete query. The insert query will in theory move the data that you need from the table into the buffer cache. That will reduce the number of trips to disk that your code needs so it will run faster. This will require changing the code of your delete applet.

The third method is just doing inserts. For this method you won't need to modify the delete applet. If you don't use SELECT INTO you'll want to use the TABLOCK hint to get minimal logging.

3) Figure out the method and order of looping

I don't recommend using sp_MSforeachtable. It's undocumented and could be changed or removed by Microsoft at any time. Instead, you could use a cursor to loop over relevant tables from sys.tables. There should be plenty of examples on how to do that online, but let me know if you need further help.

The order that you process the tables may matter. For example, suppose that you need to have separate applications for the delete statements and the insert statements. If you process the tables in the same order in both procedures then it's possible that the data for the current table will be pushed out of the buffer cache before you process it again. That's because you're querying 159 other tables before you query the same table again. If you did something process the tables in the opposite order that you delete from them, you may reduce physical reads when reading the data.

4) Implement parallelism if necessary.

After testing you may find that your code is too slow and that you need to run multiple concurrent insert statements. For one way to do this, first make the stored procedure always loop over the tables in a deterministic order. Add two input parameters to the procedure. The first should be the total number of threads and the second should be the thread for that procedure call. If you called the procedure with 4 total threads and the active thread as 1 then that procedure would process the 1st, 5th, 9th, 13, ... tables. For N threads you could open N PowerShell windows and change the second parameter for each.

Happy inserting!