Sql-server – SQLServerLogMgr::LogWriter: Operating system error 1117

sql server

We have an ongoing issue with our production sql server (physical) where we randomly receive this error in the log which puts the database into a recovery state

 SQLServerLogMgr::LogWriter: Operating system error 1117(The request could not be performed because of an I/O device error.) encountered.

The issue always occurs on our drive storing the transaction logs. The database normally recovers on its own, but few instances it does not and we need to restart the instance to recover. No errors returned from any database for dbcc checkdb.

Our storage team has been investigating it for weeks with our vendor, but with no luck. The investigation is on-going.

That being said, how should the sql server dba handle this error aside from reporting it to the storage team and checking for database corruption? I am wondering if there is any more information I may be able to gather from the sql server side which may help in their investigation?

Running SQL Server 2012 SP3, storage is a SAN.

First Update

Our infrastructure team made the following changes last night

  • Updated the firmware on all NICs on the database server
  • Updated the firmware on the network switches
  • Enabled Jumbo frames for ICSCI

We haven't received the error yet, I'll update again in a week or so.

Second Update

The changes made in previous update did not resolve the issue. Last night we moved tempdb from the SAN to local drive on the physical server and we disabled iSCSI optimization connection tracking. We haven't received the error yet, and we also see much faster disk read/writes to our data and log drives (still on the san) and of course tempdb being local. Additionally, we were receiving many iSCSI errors in the windows event logs during the time of error and also throughout the day. Since these changes last night those iCSI errors are mostly gone, there are still some coming in but not nearly as much.

Thanks,
Kevin

Best Answer

That being said, how should the sql server dba handle this error aside from reporting it to the storage team and checking for database corruption?

There really isn't anything you can do from the database side. SQL Server is a victim of the underlying hardware and virtualization (if any) having issues. The underlying issue (driver, hardware, config, etc.) needs to be fixed. Note that if you're in a virtualized environment it could be a software layer in between or an issue with the host/guest config, etc., and not a physical hardware or storage issue.

Realistically, removing all the filter drivers and associated software, stopping and in-between layers by removing them and putting this on physical (if virtual) and/or changing storage solutions (for example, using local instead of remote/SAN) can help aid in troubleshooting the issue. Updating drivers (for example multipathing, device, firmware, etc.) could also be helpful but not something I would charge a DBA to do but a datacenter or systems admin.

I am wondering if there is any more information I may be able to gather from the sql server side which may help in their investigation?

Not really. Underneath we're calling read and write API's through Windows. The return code from the API call through Windows is this Windows error code which we're bubbling up so that the admin of SQL Server knows why SQL Server is having issues.

If anything, since it's a single volume they should be able to isolate that on the backend and enable infrastructure tracing. If this is a physical machine this would be from the HBA/scsi controller and below through the hardware. If it's virtual, then from the Host through the same layers.

Sadly, this is easier to say than do and most places are ill equipped to actually investigate these style of issue - especially when the environment is virutalized.

Final Thoughts

What do the system event logs say? Are there further NTFS or other corruption issues coming up? Are devices being reset? The system event log should be dissected with an extremely fine toothed comb to see if there are a series of events or items that seem to lead up to this or if it's spontaneous. Additionally I've found these events to generally cluster around certain items such as times of high use on a specific controller or overhwelmed SANs.