Sql-server – Does SQL Server Snapshot replication completely copy the data every time or does it issue deltas

replicationsnapshotsql serversql-server-2008-r2

I am looking at snapshot replication between two servers. Here is what I have:

500GB database
~500MB nightly bcp loads
~50MB daily transactions

I was asking other DBAs in the company about which replication type to use. I was told to use Snapshot Replication. However, from what I understand and have been reading, every night after the load the snapshot will completely copy the database over to the distributor and then completely overwrite the other server.

Do snapshots work in deltas or is it a complete copy every time?

Best Answer

From Books On Line:

Snapshot replication distributes data exactly as it appears at a specific moment in time and does not monitor for updates to the data. When synchronization occurs, the entire snapshot is generated and sent to Subscribers.

Using snapshot replication by itself is most appropriate when one or more of the following is true:

Data changes infrequently.

It is acceptable to have copies of data that are out of date with respect to the Publisher for a period of time.

Replicating small volumes of data.

A large volume of changes occurs over a short period of time.

Snapshot replication is most appropriate when data changes are substantial but infrequent. For example, if a sales organization maintains a product price list and the prices are all updated at the same time once or twice each year, replicating the entire snapshot of data after it has changed is recommended. Given certain types of data, more frequent snapshots may also be appropriate. For example, if a relatively small table is updated at the Publisher during the day, but some latency is acceptable, changes can be delivered nightly as a snapshot.

Snapshot replication has a lower continuous overhead on the Publisher than transactional replication, because incremental changes are not tracked. However, if the dataset being replicated is very large, it will require substantial resources to generate and apply the snapshot. Consider the size of the entire data set and the frequency of changes to the data when evaluating whether to utilize snapshot replication.

Therefore, deltas are not generated, the entire database is snapshotted and replicated.

Related Solutions

Sql-server – ETL: extraction strategy for 200 source databases

If you have 200 identical sources then you can parameterise a SSIS package with the data source and kick off nultiple threads. These can be controlled within the package by a foreach loop or from an external source that kicks off the extractors with a parameter.

You could consider a full load for relatively small dimensional sources and an incremental load for transactional data. This would require you to have persistent dimensions, but this is fairly straightforward to do with MERGE operations, or a pre-load area and dimension handler if you need slowly-changing dimensions.

You may wish to consider giving each source its own staging area (maybe a schema for each source in the staging database). This eliminates locking issues on the staging tables. Build a set of views over the staging tables (essentially just set of unions that correspond to each of the source tables) that includes data source information. These can be generated fairly easily, so you don't have to manually cut and paste 200 different queries into the union. Once you've staged the data then ETL process can read the whole lot from the view.

This allows the ETL to run in one hit, although you will have to come up with a strategy to deal with extract failures from individual systems. For this, you might want to look into an architecture that deals with late arriving data gracefully, so you can catch up individual feeds that had transient issues.

BCP

For 200 simple extracts, BCP is probably a good way to go. The sources are all identical, so the BCP files will be the same across sources. You can build a load controller with SSIS. Getting multiple threads to read the top off a common list would require you to implement synchronised access to the list. The SSIS process has a bunch of loops running in parallel in a sequence container that pop the next item, execute it and update the corresponding status.

Implementing the 'next' function uses a sproc running in a serializable transaction that pops the 'next' eligible source off the list and marks it as 'in progress' within the transaction. This is a 'table as queue' problem, but you don't have to implement synchronised inserts - a whole batch can be pushed into the table at the start of the run.

Structure the individual extract process so that it tries once or twice again if the first attempt fails. This will mitigate a lot of failures caused by transient errors. Fail the task if it fails twice, and structure the ETL so it is resilient to individual extraction failures.

Incremental loads

An incremental loader is probably not worth bothering for dimension tables unless you have a really big dimension that shows real performance issues. For the fact table data sources it probably is worth it. If you can add a row version to the application table with a timestamp column or some such, you can pick up stuff that's new. However, you will need to track this locally to record the last timestamp. If there is an insert or update date on the data you may be able to use that instead.

Full Loads

What could possibly go wrong?

200 processes kicking off to do a full load places a load spike on the network and possibly the staging database. This could lead to all sorts of transient issues like timeouts. For small dimension tables it's probably not such a big issue. However for 100GB there are quite a wide variety of issues - WAN saturation, locking (although the right staging architecture will mitigate that), availability of sources. The longer the extract process has to run the bigger influence environmental factors have on the reliability of the process.

There are quite a lot of imponderables here, so YMMV. I'd suggest an incremental load for the larger tables if possible.

Sql-server – Transactional Replication snapshot locking Publisher db from beginning

To get this to work I scheduled switching off the 'immediate sync' to very low traffic period

EXEC sp_changepublication  
@publication = 'statics',
@property = 'Immediate_sync',
@value = 'false'
GO

I then ran a snapshot which only took a fraction of the time at 6 mins.

I then had to drop the current subscription for that article only

    EXEC sp_dropsubscription
    @publication = 'statics',
    @article = 'dbo.SupplierCorporatePayAccountLink_tbl',
    @subscriber = 'SURVIVOR'

and rerun the snapshot. Then i could add the subscription back into the publication

    EXEC sp_addsubscription
    @publication = 'statics',
    @article = 'dbo.SupplierCorporatePayAccountLink_tbl',
    @subscriber = 'SURVIVOR',
    @destination_db = 'MLT-Replicated',
    @reserved='Internal'

but it would not allow this without the fairly obscure setting

@reserved ='Internal'

or I got an error message saying I needed to subscribe to all articles.... replication then automatically sync'd the new table. I'm a happy man as this has been causing me some long nights!

Best Answer

Related Solutions

Sql-server – ETL: extraction strategy for 200 source databases

Sql-server – Transactional Replication snapshot locking Publisher db from beginning

Related Question