Sql-server – Four db servers all identical, one is much slower than the others

performancesql serversql-server-2008-r2

The setup

We have an application loading a lot of transaction data for later querying. It's sharded on 4 SQL Server 2008R2 Enterprise Edition on Windows Server 2008 64 bits. All servers are configured exactly the same (or so I'm told). 72GB of memory, 2×6 core CPU, each attached to their own SAN. The databases are currently about 2TB in size.

They all load about the same number of transactions daily (+/- 10%), it's mostly inserts. Almost no updates. The process is I/O bound, the CPUs sits at 10-15% usage.

The problem

One of the servers is much slower than all the others. It only inserts transactions at a quarter of the speed of the other three. It has always been like this. It eventually catches up during quieter hours, but the other servers keep up with traffic at all time.

The SAN people have told us that all four SANs are performing the same, with no errors. Looking at the I/O graph on the Activity Monitor, the I/O is about the same on all servers.

I have extracted the schema for the databases and compared them, no difference. There doesn't seem to be any difference in the SQL Server configs either. They all use about the same amount of memory with the same distribution.

The question

What else should I be looking at?

We have been looking at this for months and can't come up with anything. There must be some explanation. I've looked at a lot of different performance metrics and I can't seem to find any glaring difference except that the overall inserts are slower.

Best Answer

I'd be inclined to point a finger at the storage of the poorly performing server.

Sounds like you're in the unfortunate position of having a storage admin telling you everything is ok when it quite possibly isn't. Only way to be sure is to prove it.

If you have the luxury of a maintenance window, run a batch of SQLIO tests against each server and compare the results. @BrentOzar has a good intro article on SQLIO which also covers some alternative tools such as Crystal Diskmark, which might be enough to prove the point in your case. Run the tests in collaboration with the storage admin, so they can witness the issues first hand. Perhaps arm yourself with How to Prove It’s a SAN Problem before hand.

If IO testing proves fruitless or impossible, throw sp_whoisactive or a similar monitoring solution at the problem. Kendra Little's Collecting Data from sp_WhoIsActive in a Table would be the best approach for a long running ETL process.

Before doing any of the above, take 1 minute to check something very very basic that crops up time and time again... the NTFS allocation unit size on each of the arrays.