Sql-server – Dead Raid Card SQL Server Running in Memory

raidsql server

Wanted some opinions, on a predicament I’m currently facing. I just got thrown with this mess..

SQL Server database files saving to Raid volume, long story short raid card died so volume is not accessible. SQL service is still running and applications accessing the databases are still responding.

The backups aren’t upto date of course.

I have tried to export the databases out but error occurs because the files do not exist as the volume isn’t present.

However the ram is currently at 47GB.

My question is what does SQL Server store in memory, as my understanding is it stores everything in ram and then dumps to file?

Is the data recoverable by specialist if I do a raw ram dump?

Could I get backup of the database and remount another volume with the same drive letter and file structure and try exporting the database?

Best Answer

long story short raid card died so volume is not accessible.

Definitely not a good thing, but is it possible to buy a replacement? If it's on the motherboard directly, can the motherboard be replaced?

SQL service is still running and applications accessing the databases are still responding.

This will work so long as the data in the files that are now not accessible are not asked for any data. If they are, it'll return an error to the session and write some information in the errorlog. This may currently be happening.

If, at some point, data from this volume was in memory and subsequently changed then you should see errors in the errorlog from checkpoints failing to write to the disk. This leads to a few conclusions but it's hard to say without looking at the database metadata. Generally speaking, if you aren't seeing any failures writing to disk in the eventlog, errorlog, or sessions (if you're looking at that) it's possible to conclude that some of the following may be true:

  1. The database files aren't in use and aren't needed
  2. The database files were in use but no data has changed
  3. The database files aren't in use but will be used at a later point (month end, etc.)
  4. Some data from the database files are in memory but not changed
  5. No data from the database files are in memory
  6. None of the files are log files or they are log files but no VLFs are in use

This list seems to be very contrary but multiples of them can be true at the same time. Again, without seeing the actual metadata it's hard to say.

The backups aren’t upto date of course.

This is actually the root issue (I know you believe it's the RAID Card) because with backups it could have been restored to a different server, DNS updated, Logins created, and everyone go about their normal day. This isn't a put down, merely an example.

My question is what does mssql store in memory, as my understanding is it stores everything in ram and then dumps to file?

You can think of it that way, yes. The very oversimplified version is that it's initially read into memory if it existed on disk, changed in memory, and some point later written back to disk if it's a disk based table.

Is the data recoverable by specialist if I do a raw ram dump?

I'm not sure this would be a suitable way to go to be honest. If the files being offline aren't currently spewing tons of errors into the logs and clients are being returned tons of errors due to the inability to retrieve the data... you may not need (at this point in time) the data from those files. Additionally, swapping out cards should solve the immediate issue.

I believe the question is can you shutdown without losing any data? If so, and you aren't spewing tons of errors as listed above then it's more than likely you won't lose much if anything at all - that's not a guarantee and YMMV but if you look at all the other logs it should give you an idea. Here is the main point, if it's in memory right now and has been changed then the log record exists whether or not it was successfully flushed to disk at some point later which means upon database startup it should redo the transactions.

Ask your clients to disconnect or stop using the services, stop the services or applications, then deal with the database. The longer this continues to be up and running the higher the risk of something much more catastrophic happening.