SQL Server – Fastest Way to Move Data to New Database

exportmigrationsql server

This is my first question and post so let me know if I can improve. I'll err on the side of providing too much information.

I'm a fairly new DBA at my company, less than two years with the DBA team, and have been tasked with coming up with a way to move data from an existing database for one specific entity to a new database which does not yet exist. The database is on an instance running SQL Server 2012 SP1 Standard edition.

We currently have a custom PowerShell cmdlet called purge-entitydata. This cmdlet will perform a delete in all 160 user tables in the current database based upon the entity value.

Example

purge-entitydata -entity <Entity>

However, before we delete all of the entity's data in the current database, we need to move only that entity's data to a new temporary database called [Entity_Backup] in case a downstream process fails.

Here are some caveats:

The 160 user tables may change several times a year. Some may be deleted or new ones may be added. I was thinking of going the SSIS route but didn't want to have to update the package with a new data flow task everytime a new table is added or removed.
Just the data and the tables themselves need to be moved. I won't need to move views, stored procedures, or indexes from the current database to temporary database.
The current database and user tables contain data for over a thousand entities. I just need to move the data for one entity over to the temporary database.
If the downstream process succeeds, then the temporary database will get dropped.
The database is large, over 200 GB. Also, this server is the publisher for a pull merge replication subscription and is also the publisher for two transaction replication subscriptions. I'm thinking that taking a backup would slow replication too much.

I was thinking of using sp_msforeachtable 'SELECT * INTO [Entity_Backup].? FROM ? WHERE Entity = $Entity' and Invoke-Sqlcmd but am worried this would take too long since it does not run in parallel. An entity has approximately three to five million rows.

What be the most efficient/fastest way to move data from the 160 user tables to a new database which can be done from inside the PowerShell cmdlet? Does anyone have any suggestions which may be faster than ms_foreachtable?

Best Answer

You say that you're worried that your proposed solution won't work because it won't run in parallel. You already have a PowerShell cmdlet which deletes from all 160 tables, right? Deleting rows can often take longer than inserting them, especially if you have indexes on the table that you're deleting from. If that cmdlet doesn't run in parallel then why do you need the insert commands to run in parallel? I recommend starting with a relatively simple solution, doing performance testing, and only making it more complex if necessary.

I'm not familiar with the restrictions of what can be run in PowerShell. However, it appears to be possible to execute a stored procedure from PowerShell , so I will provide a T-SQL solution. You should be able to easily translate that into what you need, but I may refer to the code as a "stored procedure".

If your database is about 200 GB and you have about 1000 entries then on average you only need to move 200 MB of data. That's definitely doable if you have suitable indexes on the tables (such as one on the Entity column). I'm going to assume that you already did that analysis and you have the indexes that you need to speed up the SELECT statements.

1) Create the database for the inserts

I recommend that you create your temporary database with a recovery model of simple. That will reduce data written to the transaction log for minimally logged queries. I also see no advantage in using a different recovery model because the data will be so short-lived. I also recommend that you create a database with the right settings and just keep it around. Grow the data and log files to appropriate sizes to prevent auto-growth events. You can just drop all of the tables in the database as they are no longer needed. Do you really need to go through the work each time of creating and dropping the database?

2) Figure out your DML strategy

I can think of three different ways to insert data into the temporary database. Two of them require changes to your delete applet.

All of them insert into heaps and take advantage of minimal logging. Since you just need the data and nothing else that will minimize the overhead of doing the inserts. On SQL Server 2012 you cannot do parallel inserts. You can do them if you upgrade to SQL Server 2014 and use SELECT INTO.

The first method uses the OUTPUT clause to delete the rows from the table and insert the deleted rows in a single query. This will likely be the fastest way because you only need to read the underlying data once per table. You can't use SELECT INTO for the insert statement (but you can run one beforehand to create the target table). Your query could look something like this:

INSERT INTO [other_db]..[Entity_Table_1] WITH (TABLOCK)
SELECT t.*
FROM 
(
    DELETE FROM [source_db]..[Entity_Table_1]
    OUTPUT deleted.*
) t;

The second method is similar to the first but you process the insert query for a table right before the delete query. The insert query will in theory move the data that you need from the table into the buffer cache. That will reduce the number of trips to disk that your code needs so it will run faster. This will require changing the code of your delete applet.

The third method is just doing inserts. For this method you won't need to modify the delete applet. If you don't use SELECT INTO you'll want to use the TABLOCK hint to get minimal logging.

3) Figure out the method and order of looping

I don't recommend using sp_MSforeachtable. It's undocumented and could be changed or removed by Microsoft at any time. Instead, you could use a cursor to loop over relevant tables from sys.tables. There should be plenty of examples on how to do that online, but let me know if you need further help.

The order that you process the tables may matter. For example, suppose that you need to have separate applications for the delete statements and the insert statements. If you process the tables in the same order in both procedures then it's possible that the data for the current table will be pushed out of the buffer cache before you process it again. That's because you're querying 159 other tables before you query the same table again. If you did something process the tables in the opposite order that you delete from them, you may reduce physical reads when reading the data.

4) Implement parallelism if necessary.

After testing you may find that your code is too slow and that you need to run multiple concurrent insert statements. For one way to do this, first make the stored procedure always loop over the tables in a deterministic order. Add two input parameters to the procedure. The first should be the total number of threads and the second should be the thread for that procedure call. If you called the procedure with 4 total threads and the active thread as 1 then that procedure would process the 1st, 5th, 9th, 13, ... tables. For N threads you could open N PowerShell windows and change the second parameter for each.

Happy inserting!

Related Solutions

Sql-server – How to move data from Primary Data File to other data file without destroying replication

SSMS is creating a copy of the table, populating it, dropping the original. hence the error.

I'd create a new filegroup, rather then adding files to the primary filegroup because of how the files are used within a file group

You can move a table "in-situ" to a new filegroup by rebuilding the clustered index using CREATE INDEX .. DROP_EXISTING

CREATE NONCLUSTERED INDEX PK_Mytable
    ON dbo.Mytable(PKColumn)
    WITH (DROP_EXISTING = ON)
    ON NewFileGroup

Sql-server – Fastest way to migrate database to AlwaysOn cluster

The fastest way would be to create a cluster an join your current stand alone server to the cluster as a new node. Set the database to full recovery model, take a full backup and a transaction log backup and restore them to the instance on the other node with norecovery.

If you have a good full backup (most recent), it would be possible to set the recovery model to full, take a differential and then take a log backup using the last full from the simple recovery model as your base. The differential should bridge the lsn gap.

To answer your questions:

I'm not sure why you're switching the database into single_user except that you want no changes to be made. If you add the server as a node as I've set forth, this won't be an issue. Otherwise, yes, you'll need downtime to migrate the database. Your other option would be to set the recovery model to full, get it in full, setup mirroring to the new instance on the cluster and then cut over during a downtime which would be much less time.
That's entirely possible to do, but in your initial question you said you didn't want to do this.
You must restore at least one log or differential to bring the lsn gap as close as possible.
I would not use the wizard, but the join only means you've staged a database that is currently in norecovery and it only needs to join the AOAG. The wizard will want to pre-stage the databases for you and ask you for shared locations, etc. Like I said it's best to pre-stage and not use the wizard for the best flexibility.

Best Answer

Related Solutions

Sql-server – How to move data from Primary Data File to other data file without destroying replication

Sql-server – Fastest way to migrate database to AlwaysOn cluster

Related Question