Sql-server – RAID0 instead of RAID1 or 5, is this crazy

availability-groupsraidsql server

I'm considering using a RAID0 setup for one of our SQL Server clusters. I'll outline the situation and am looking for why this may be a bad idea. Also if someone you have use cases, white papers or other documentation you can point me to on this topic, that would be great.

We have 3 servers in 2 datacenters that are part of an SQL cluster. They are all running SQL Server in an Availability Group. The primary has a replica sitting right next to it and another in the other datacenter. They are running synchronous replication with automatic failover. All drives are enterprise class SSDs. They will be running SQL Server 2017 or 2019.

I'm thinking that there would be multiple benefits to running them on RAID0 arrays over other methods with few, if any, real drawbacks. The only negative I'm currently seeing is lack of redundancy on the primary server, so it failing increases. As pros:

  1. If a drive fails, rather than run in a slowed, degraded state until someone receives a notice an manually acts on it, the server will immediately fail to a secondary maintaining full operational capability. This will have an added benefit of notifying us of a failover, so we can investigate the cause sooner.

  2. It reduces the chance of failure overall per TB capacity. Since we don't need parity or mirror drives, we reduce the number of drives per array. With fewer drives there is less total chance of a drive failure.

  3. It is cheaper. Needing fewer drives for our required capacity obviously costs less.

I know this isn't the conventional business thinking, but is there something I'm not considering? I'd love any input either pro or con.

I'm not trying to do this for query performance gains, though if there are meaningful ones feel free to point them out. My primary concern is failing to consider or address a reliability or redundancy issue that I haven't thought of.

The OS is on a separate mirrored drive, so the server itself should stay up. One of those drives can be replaced and again mirrored. It is small and there aren't any database files other than system DB's on it. I can't imagine it taking more than minutes. If one of the data arrays fails, we replace the drive, rebuild the array, restore and resync with the AG. In my personal experience, restoring has been MUCH faster than a RAID5 drive rebuild. I haven't ever had a RAID1 failure, so I don't know whether that rebuild would be faster or not. The restores would be coming from a backup and rolled forward to match the primary, so the load increase on the primary server should be very minimal only syncing the last few minutes of logs with the recovered replica.

Best Answer

There's one very important aspect I think you are missing in your assessment:

How do you plan to recover?

When raid5 loses a drive, it will run in a degraded state until it has recovered automatically. (At least if you have a hot spare at hand.)

When a raid0 loses a drive, it cannot ever recover at all. This means you have lost redundancy, and to recover it, you need to rebuild your raid0, and copy all the data (not just the data on the broken drive) back from the secondary that is now under production load. That is, instead of the single degraded raid5 array, it's now your entire production setup that gets the performance hit.

If the raid5 (or raid6) degraded state performance penalty isn't something you can cope with, you should probably do raid 1+0 instead. Yes, it costs more, but disk prices being what they are, it's going to be money well spent.

Maybe "actively monitor the raid5 state, and transfer the load off the primary when a drive fails" is the solution that gives you most of the benefits without any drawbacks? (Apart from losing the coolness factor of running without any local redundancy, of course.) If your raid5 drive recovery is taking a lot longer than a complete database data sync, either your raid software is acting strangely, or you have seriously oversized disks, I'd think.