Sql-server – how to mirror an azure storage account to a non-paired zone to enable fast file_snapshot backup/restore and database file access

azureazure-blob-storagesql serversql-server-2019

Wow, that title was a mouthful.

Scenario

My dev server is restoring a copy of production data daily, anonymizing it, and creating a dev-grade backup for my developers to restore locally. We're running SQL 2019.
Unfortunately, my servers are all in Azure East US, but almost all of my company is based in Spain.

As of today, we had the server produce a standard backup that would be downloaded by a central server here in Spain, so devs can restore without going to the cloud. Restores on developers' laptops are taking up to 180 minutes to complete.

This is where BACKUP ... WITH FILE_SNAPSHOT comes in. In a few words, it allows backups and restores to go super fast, by storing data directly on Azure storage accounts. There's more to it, but it's not the point here – read more here

So far so good, but…

Issues

  • Restoring near the server will be fast, but will force developers to traverse the Atlantic when reading from what should be the local database.
  • Restoring the snapshot from EastUs to a storage account in West Europe seems to be copying the snapshot over (which is logical), but this means there's no restore performance benefit.
  • My devs are really keen on restoring daily, weekly at most to ensure they're on par with production database releases and can troubleshoot data issues that get escalated to the engineering team.
  • Manually copying files from East Us to West Europe is not a good option, since it requires daily attention from someone.

What I tried

  • Restoring a file_snapshot backup from our East US storage account to a temporary West Europe storage account
  • Restoring a file_snapshot backup within our East Us storage account (performances were awful for large queries due to, at least, network latencies)
  • Setting the storage to RA_GRS; when I tested it, it would only pair with West Us – this was to speed up normal restores, but it's still meaningful.

What I want to accomplish

I want my devs to be able to restore quickly using the FILE_SNAPSHOT technology to the nearest storage location. I'll need them to be able to write there using SAS keys. To ensure they can restore quickly, I'm assuming the storage account must be mirrored with a copy that sits here in West Europe, but I don't know how to accomplish this.

My plan

Currently, I'm planning to either try and store the backup in a West Europe bound storage container, or try restoring to it after enabling Azure CDN.
I'm doubting both of these options because

  1. FILE_SNAPSHOT backup takes a snapshot of the files in Azure storage, so they will have to be copied across the wire through the Atlantic
  2. Azure CDN is a read-only system, I'm not sure it will help me with writes in the restore process. If it has to read from the source file, I'd be back to square one.

I'm going to test these out (aka do my homework), but I'm reaching out to see if anyone here has other ideas, has already done this successfully (and wants to share their fix), or know this is not doable and I should file a UserVoice item.

Thanks a bunch

Update 2020-07-07

I have been testing CDNs – it does not work. When trying to restore an existing backup to URL (it does not matter if using FILE_SNAPSHOT or not) I'm getting an incorrectly formatted error. I know that the .bak is not corrupted since I can restore directly from the original blob storage. My current guess is that some header has a mismatch between the CDN URL and the actual blob storage.
I also tried backing up directly to the CDN URL but that seems to not be supported, which makes sense due to my current understanding of CDNs being read-only systems.

Update 2020-07-08

Also tested out backing up to a different storage location. Backups are fast, but the snapshot is still taken locally to the .mdf location – this means restores are painfully slow, and reducing that lag (and the data transfer) is the whole point of this exercise. I cannot really modify the .bak pointers after the backup has been generated, so even copying that data manually across the ocean won't work.

This means I'm officially out of ideas.

Update

I missed out on doing this for a while, but I'd better keep this thread up to date.
@DavidSpillet has a few points in his reply. Here is a bit more information that might help people helping me out – you're all awesome.

  1. We have about 13 (SQL natively compressed) backup files, totaling around 60GB. The approximate size after restoring them is around 270GB.
  2. Our databases are separated into two systems. System A has 2 databases, System B has 11. Restores are done concurrently per system, so we will have at most 2 restores per developer running in parallel.
  3. Developers in Spain restores in the office using a wired LAN network, sitting on the same VNET (and building, and floor) the server is in.
  4. The server is a 16-core, 32 logical CPUs, 96GB RAM monster that was once used to host a shared SQL instance for our developers. Restores in there take consistently between 15-20 minutes.
  5. The server high-capacity storage (where the backups are held) is based on spinning locally attached disks – but then again, restoring locally is not a bottleneck
  6. Developers laptops use high-capacity NVMe SSD drives (at least, the new ones. The older model used spinning drives. I'm not minding those as they're being rolled out)

I also have a few more concerns due to the current world situation, namely

  • Devs are currently working anywhere, sometimes at the office, most of the time at home. Unfortunately, our client VPN endpoint is in another country, so restoring from the office is a major pain, and I'm trying to account for that, especially considering how the country is looking like (2020-07-29).
  • Some of the new laptop models arrived with lower capacity storage (256 GB) and there's nothing I can do to fix that.
  • Cannot go on purchasing commercially available software that would do the trick (for example this one) because of budget constraints.

Best Answer

Restores on developers' laptops are taking up to 180 minutes to complete.

Are you sure that you are trying to fix the right bottleneck? If several are restoring at the same time over wireless then remember that wireless bandwidth is a shared resource. If it takes 15 minutes to pull the data from the central server to one laptop when the network is otherwise quiet and ten start the process at the same time then you are looking at 150+ minutes until the last one completes.

downloaded by a central server here in Spain

If they are not using wireless, and the wired network is plenty fast enough generally, then the IO resources on the files server are the next thing to check.

Depending on how big the database backups are relative to the memory in the server, most of the access to the backup(s) could be hitting disk instead of coming from cache. If your storage array is based on relatively high latency tech (spindle & head based drives) then concurrent restores could have their runtime ballooned by a much greater amount due to IO latency.

There are a number of ways you could try to get around local issues like these, though to pick a suitable one instead of suggesting many we'd need to know a few more details:

  • How many places (laptops/other) are these backups being restored?
  • How many of these restores are concurrent?
  • How large are the database backups?
  • What is the basic spec of the file server they are hosted on (RAM, drive types and basic layout)
  • How are the developer machines connected to the network?
  • What is the bandwidth capacity of the network itself between the laptops' access points and the server?

(I'm assuming that network traffic and IO on that server are low generally, so these DB restores are by far the most significant use of those resources at the time - if that is not the case then you might have a wider resource provisioning problem!)

This may all sound very obvious and you have probably considered all these factors already, in which case my apologies for being patronising, but I've seen a number of cases over the years where people were trying to solve a problem with high tech solutions that won't eventually help because the core issue is something very basic in the local network or storage infrastructure which they've overlooked.

Additional on Local Restore Speed

One thing to test is how long it takes to simply copy the backups to a dev machine at a time when they are experiencing the slow restores. If that is very slow too then the issue isn't the restore itself, but might suggest a workaround. As you say the restore is of multiple databases you might speed up the process with a little concurrency: instead of restoring each directly from the server, copy one backup to local, and set that restoring while the second is copying over, and so on.

Of course looking at it the other way your devs may already be trying concurrent restores and that might actually be the problem. Even if the dev machines use good SSDs across the board restoring 13 databases concurrently is going to be slower than doing it sequentially due to IO contention. If the dev machines are using more traditional drives or slower solid-state tech (like many eMMC based drives in budget laptops) then this is definitely a possible cause of the problem. If the storage array in the server is spinning-disk based and its other workload means that the backups are unlikely to be entirely held in cache when needed, then 13 concurrent file reads from a single laptop might also be the issue for the same reason, worse if two devs try the same thing at the same time.