I have a production database with some data I need to copy to other location, either another database on the same server or another server.
The DB size is huge and I don't need all the data on the DB, but for instance in 1 of the tables with ~300 million records I need just ~60 million out of it.
The issue is that might not be possible to restore the entire DB in the same/another server because of space restrictions, so we might be forced to only copy what is needed.
And the other restriction is that these tables are used by production machines, so they need to be available 100% of the time, so I can't block any write and read access to them.
So, what would be the best way to copy partial data from one DB to anoter?
- To run a query that inserts N records until it is done? kind of INSERT INTO … SELECT TOP N … FROM … WITH(NOLOCK) WHERE …
- Use the Import and Export data? Will this tool block the source tables when running?
Best Answer
This depends a lot on your goals / requirements for the data in the destination. Specifically you need to weigh the pros and cons between blocking and data correctness.
Avoid Blocking — READ COMMITTED with SELECT...INTO
For instance, the most naive solution under the default
READ COMMITTED
isolation level would solve your problem with very little blocking on the source server:SELECT...INTO
statement to copy the desired 60mm rowsNote: you'll need to specify the linked server name in the
FROM
clause if the linked server approach is usedShared locks will be taken and released as pages are scanned in the clustered index (or heap, as the case may be), so there will be brief moments of blocking specific pages, but nothing that should be incredibly noticeable for your production system. Of course, there might be other points of resource contention (storage throughput, for instance).
The main problems with this approach is that it makes basically no guarantees about the correctness of the data that will arrive at the destination. There's a Great Post™ by Paul White about The Read Committed Isolation Level that you can check out for the details, but the main problem is that this doesn't guarantee a point-in-time view of the data.
In other words, DML statements (INSERTs, UPDATEs, and DELETEs) can successfully execute during the scan, which could result in rows being read more than once, rows being missed entirely, etc.
A potential problem for the destination database would be log file growth. If the size (not just number of rows) of the data being moved is very large, you may need to consider batching the inserts rather than doing them all in one go.
Favor Correctness - More Restrictive Isolation Levels
If you want the data that arrives in the destination to provide data that is consistent with a specific point in time, you can use the
SNAPSHOT
orSERIALIZABLE
isolation levels during the entire insert process (the actual insert process could be the same as or similar to the one described above).The
SNAPSHOT
isolation level is a viable option. It will cause an increase in tempdb usage as row versions are maintained to provide a consistent view of the database, so you need to make sure you system can handle that workload. It will also cause some types of blocking (more thanREAD COMMITTED
, but less than theSERIALIZABLE
approach). How much this will affect your production workload would be dependent on you to test.The
SERIALIZABLE
isolation level will be the most disruptive to your production workload, as it will block all access to the table until the transaction is complete. This doesn't meet your requirements, so it's not really a viable option.Correctness AND No Blocking - Replication / Backups
If you're able to restore a full database backup somewhere, that's the ideal solution here. You have a point-in-time view of the database, without contention, from which you could then copy the needed rows to the destination. Since you mentioned database size / space is an issue, I imagine this isn't a viable option.
Due to the special requirements for your situation, you might want to consider buying a third-party tool that supports "object level restores." This would allow you to restore just that needed to table to the destination directly from a backup. I think that Quest Litespeed supports this, although it's hard to tell behind the paywalls on their site.
Transactional Replication might also be a viable option if you need the data to be continuously up to date, although the version restrictions might prevent you from being able to leverage this option. Additionally, the maintenance and setup overhead might be impractical. I haven't used transaction replication in practice, so I'm just mentioning it as a possibility.
About NOLOCK
Don't use that.