Sql-server – Best way to copy data to other location without blocking source

copysql server

I have a production database with some data I need to copy to other location, either another database on the same server or another server.

The DB size is huge and I don't need all the data on the DB, but for instance in 1 of the tables with ~300 million records I need just ~60 million out of it.

The issue is that might not be possible to restore the entire DB in the same/another server because of space restrictions, so we might be forced to only copy what is needed.

And the other restriction is that these tables are used by production machines, so they need to be available 100% of the time, so I can't block any write and read access to them.

So, what would be the best way to copy partial data from one DB to anoter?

To run a query that inserts N records until it is done? kind of INSERT INTO … SELECT TOP N … FROM … WITH(NOLOCK) WHERE …
Use the Import and Export data? Will this tool block the source tables when running?

Best Answer

This depends a lot on your goals / requirements for the data in the destination. Specifically you need to weigh the pros and cons between blocking and data correctness.

Avoid Blocking — READ COMMITTED with SELECT...INTO

For instance, the most naive solution under the default READ COMMITTED isolation level would solve your problem with very little blocking on the source server:

[Optional, if the database is on another server] Create a linked server on the destination (SQL Server 2008) instance that connects to the source (SQL Server 2016) instance
Run a SELECT...INTO statement to copy the desired 60mm rows

SELECT /*column list*/ 
INTO [dbo].[BigTable]
FROM [SourceDatabase].[dbo].[BigTable]
WHERE /*predicates to get the 60mm rows you need*/

Note: you'll need to specify the linked server name in the FROM clause if the linked server approach is used

Shared locks will be taken and released as pages are scanned in the clustered index (or heap, as the case may be), so there will be brief moments of blocking specific pages, but nothing that should be incredibly noticeable for your production system. Of course, there might be other points of resource contention (storage throughput, for instance).

The main problems with this approach is that it makes basically no guarantees about the correctness of the data that will arrive at the destination. There's a Great Post™ by Paul White about The Read Committed Isolation Level that you can check out for the details, but the main problem is that this doesn't guarantee a point-in-time view of the data.

In other words, DML statements (INSERTs, UPDATEs, and DELETEs) can successfully execute during the scan, which could result in rows being read more than once, rows being missed entirely, etc.

A potential problem for the destination database would be log file growth. If the size (not just number of rows) of the data being moved is very large, you may need to consider batching the inserts rather than doing them all in one go.

Favor Correctness - More Restrictive Isolation Levels

If you want the data that arrives in the destination to provide data that is consistent with a specific point in time, you can use the SNAPSHOT or SERIALIZABLE isolation levels during the entire insert process (the actual insert process could be the same as or similar to the one described above).

The SNAPSHOT isolation level is a viable option. It will cause an increase in tempdb usage as row versions are maintained to provide a consistent view of the database, so you need to make sure you system can handle that workload. It will also cause some types of blocking (more than READ COMMITTED, but less than the SERIALIZABLE approach). How much this will affect your production workload would be dependent on you to test.

The SERIALIZABLE isolation level will be the most disruptive to your production workload, as it will block all access to the table until the transaction is complete. This doesn't meet your requirements, so it's not really a viable option.

Correctness AND No Blocking - Replication / Backups

If you're able to restore a full database backup somewhere, that's the ideal solution here. You have a point-in-time view of the database, without contention, from which you could then copy the needed rows to the destination. Since you mentioned database size / space is an issue, I imagine this isn't a viable option.

Due to the special requirements for your situation, you might want to consider buying a third-party tool that supports "object level restores." This would allow you to restore just that needed to table to the destination directly from a backup. I think that Quest Litespeed supports this, although it's hard to tell behind the paywalls on their site.

Transactional Replication might also be a viable option if you need the data to be continuously up to date, although the version restrictions might prevent you from being able to leverage this option. Additionally, the maintenance and setup overhead might be impractical. I haven't used transaction replication in practice, so I'm just mentioning it as a possibility.

About NOLOCK

Don't use that.

Related Solutions

Sql-server – SQL Server data extract with limited performance impact

Many ways you can approach this scenario.

I need to do a data extract from the database for reporting purposes.

Backup restore approach is best suited if you want to do this on an infrequent basis and do a static reporting of data.
Database Snapshot is an enterprise version feature only and this also gives you a static point-in-time view of the database.

Note that database snapshots take up space on disk, and having too many can fill up your disk space, especially in a production environment where data is updated frequently. In addition, using database snapshots will degrade performance a bit because I/O is increased on the database due to the fact that the data pages are copied as the write operation is performed.

Refer to Microsoft SQL Server Database Snapshots and Synonyms to over come some limitations that snapshots have and workarounds.

Look into Replication especially T-Rep if you want to do near to real-time reporting and want to offload the reporting work from the primary server. T-Rep will give you the benefits of Transactional consistency, Low latency, High Throughput, Minimal overhead as well as the ability to filter rows.
Lastly if none of the options are feasible to you, then with proper testing look into changing the isolation level of the database to RCSI (Read Committed Snapshot Isolation). This will be at the COST of TEMPDB performance.
```
-- check if RCSI is enabled
SELECT name, is_read_committed_snapshot_on
FROM sys.databases
WHERE name = '<dbname>'
-- Enable RCSI
ALTER DATABASE <dbname> SET READ_COMMITTED_SNAPSHOT ON
```

From The Data Loading Performance Guide :

RCSI basically will prevent queries reading data to block or be blocked by other queries modifying data in the same table. It is a powerful alternative to NOLOCK because it guarantees a complete, transactionally consistent view of the data and does not require a special hint. While RCSI originally targeted scenarios common to OLTP workloads, the feature can be a powerful tool in data warehouse workloads or scenarios involving large-scale bulk insert operations.

RCSI is enabled as a database-wide setting. When enabled, reader queries do not acquire shared locks on rows, pages or tables, and as a result they are not blocked by X or BU-locks taken by others. Instead, new or modified rows in a table carry a 17-byte version identifier, and the before-images of any rows being changed by a transaction (updated or deleted) are copied to tempdb using the row versioning mechanisms within SQL Server. Reader queries consider only those rows that were committed as of the start of the query – by ignoring any later version numbers and referencing tempdb for appropriate earlier versions of rows.

Also read up on : Comparing Different Results with RCSI & Read Committed from SQL CAT team.

LASTLY, to touch on NOLOCK hint, I would suggest you to read up

Sql-server – Copy database and clear down data (including linked tables) older than N days

One to do the backup and restore (running as an sa) and then creating a separate user that only has access to the PRODCOPY database, that then runs the clean up section of this script. Any thoughts on this?

This is a good idea. You can have 2 jobs

Perform backup and restore of PROD database to dev environments
Once above is completed, kick off the scrubbing script to purge old data.

Some things to consider in your script :

If the users are connected to the PRODCOPY, make sure to disconnect them using

alter database db_name
set single_user with rollback immediate
waitfor delay '00:00:05'
alter database db_name
set multi_user

Backing up PROD database, if you are on sql server 2008R2 and up, then standard edition supports BACKUP compression. You should use that as well.
```
-- we need to delete the existing PRODCOPY
```
Make sure you have Instant File Initialization to cut down the restore time.
do not drop the database, instead restore with replace. With Replace will NOT require pages with identical data to be rewritten there by cutting down the restore time.
Your subsequent script will fail as you are restoring the database, but still putting it in NORECOVERY. You should change it to RECOVERY.
```
-- We just want to remove everything from this table
```
A good coding practice, avoid using INFORMATION_SCHEMA. Instead use sys.tables. Also, always use schema.tableName instead of just tableName.

The references are to Aaron Bertrand's blog that explains in more detail
Change the sequence of shrink and rebuild indexes. The Shrink should be first followed by a rebuild or reorg and update stats.
Use the new syntax of alter index rebuild | reorganize. To automate the above point, you can use Ola Hallengren's - SQL Server Index and Statistics Maintenance