I'm running a Solaris 11 x86 file server. The file-server part is ZFS + Samba. It's been up running for three or four years now without major problems.
The Samba shares start out visible to other PCs on the network. I can read from the file server reliably. I can ping the server. I can ping other PCs from the server. I can ping the default gateway from the server.
Starting a few weeks ago, when I attempt to write to the file server, the shares disappear after a few seconds (or maybe after a few hundred megabytes). The issue is apparently with the network. The server is still alive, though. If I hook up a mouse and keyboard and monitor I can still interact with the server.
It doesn't seem like the problem is with the hard drives or with Samba. Tried:
- zpool status
- fmadm faulty
- svcadm restart samba
No errors. No faulty devices. Samba doesn't appear to be the problem.
After the problem happens, I cannot ping the default gateway from the file server anymore. I cannot ping other machines from the file server anymore. I cannot ping the server from other machines.
Network Debugging Steps
I've tried:
- ifconfig skge0 down/ifconfig skge0 up.
- Power cycling the switch that the Solaris box is plugged into
- Power cycling the router that the Solaris box is plugged into
The Solaris box appears to think it's still connected to the network. Resetting the Solaris box (init 6) will bring the shares back, but only until I attempt to write them again.
I tried netstat -rn before and after the problem. Everything looks pretty normal. Below is "after":
Routing Table: IPv4
Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ---------- ---------
default 10.1.10.1 UG 27 456 skge0
10.1.10.0 10.1.10.254 U 6 2536350 skge0
127.0.0.1 127.0.0.1 UH 2 252 lo0
Routing Table: IPv6
Destination/Mask Gateway Flags Ref Use If
--------------------------- --------------------------- ----- --- ------- -----
::1 ::1 UH 2 4 lo0
"Before" has 27 instead of 17 in the "Ref" column for the 1st entry. "After" has slightly higher numbers for "Use" – probably normal.
I tried netstat -an before and after the problem, too. This one may have more of a clue. There are a number of UDP connections that are present before the problem that all vanish.
Before:
UDP: IPv4
Local Address Remote Address State
-------------------- -------------------- ----------
--truncated entries that are present in both before/after--
10.1.10.254.40504 10.1.10.1.53 Connected
10.1.10.254.39900 10.1.10.1.53 Connected
10.1.10.254.40129 10.1.10.1.53 Connected
10.1.10.254.37892 10.1.10.1.53 Connected
10.1.10.254.61658 10.1.10.1.53 Connected
After, those five entries are gone, but one new one is present:
UDP: IPv4
Local Address Remote Address State
-------------------- -------------------- ----------
--Again, truncated--
10.1.10.254.53920 10.1.10.1.53 Connected
I can't find any information about what port 53920 is used for. On the gateway side, port 53 appears to be used for DNS – not sure if this is a clue or not. Doesn't seem terribly helpful
Down in the TCP portion, there are a whole bunch of entires that are "ESTABLISHED" before that are either gone in after or they've transitioned to either TIME_WAIT or FIN_WAIT_1. This seems to jive with what I already know.
There's only one reference to IP of the computer that I used to crash the network:
Before:
TCP: IPv4
Local Address Remote Address Swind Send-Q Rwind Recv-Q State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.1.10.254.445 10.1.10.132.53487 64512 0 128480 0 ESTABLISHED
After:
TCP: IPv4
Local Address Remote Address Swind Send-Q Rwind Recv-Q State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.1.10.254.445 10.1.10.132.53487 64256 0 128480 0 ESTABLISHED
The only difference is in the Swind (send window?) column. It's strange that the state is still listed as established.
I did the netstat -an experiment again
The only difference before and after, was related to the IP address of the PC I used to crash the share.
Before:
TCP: IPv4
Local Address Remote Address Swind Send-Q Rwind Recv-Q State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.1.10.254.445 10.1.10.132.53613 380416 0 128480 0 ESTABLISHED
After:
10.1.10.254.445 10.1.10.132.53613 65280 0 128480 0 ESTABLISHED
Again, the only difference is in the Swind column – the number got smaller.
I've reached the end of what I know about this kind of thing. netstat seems to be telling me what I already know. Short of buying another network card and just trying it, or reinstalling Solaris, I've got no idea. Can anybody clue me in as to the next step here?
Edit
I'm buying another network card and just trying it. It's going to take about a week to get here, so I'll keep poking at this in the meantime.
Best Answer
Netstat -an
,netstat -rn
, andlsof
(before, and during the problem) may give clues. (Do they show too many open connections?).tcpdump
may help too: start it just before establishing the connection and see what happens around the time connections start to die (and also a few minutes before the timeouts).And see if the NFS options are non default and may have effects :
Try to use soft instead of hard settings, for example.
Remove all "non-core" options (core being those that are needed for the NFS to be established) and put them back in little by little, to see which one(s) is(are) causing the issue.
Sorry, but I do not have access at the moment to a Solaris to give the exact settings. A web search including the "Solaris" and "NFS" keywords will help you to find them.