Debugging a Solaris network issue

networkingsambasolariszfs

I'm running a Solaris 11 x86 file server. The file-server part is ZFS + Samba. It's been up running for three or four years now without major problems.

The Samba shares start out visible to other PCs on the network. I can read from the file server reliably. I can ping the server. I can ping other PCs from the server. I can ping the default gateway from the server.

Starting a few weeks ago, when I attempt to write to the file server, the shares disappear after a few seconds (or maybe after a few hundred megabytes). The issue is apparently with the network. The server is still alive, though. If I hook up a mouse and keyboard and monitor I can still interact with the server.

It doesn't seem like the problem is with the hard drives or with Samba. Tried:

  • zpool status
  • fmadm faulty
  • svcadm restart samba

No errors. No faulty devices. Samba doesn't appear to be the problem.

After the problem happens, I cannot ping the default gateway from the file server anymore. I cannot ping other machines from the file server anymore. I cannot ping the server from other machines.

Network Debugging Steps

I've tried:

  • ifconfig skge0 down/ifconfig skge0 up.
  • Power cycling the switch that the Solaris box is plugged into
  • Power cycling the router that the Solaris box is plugged into

The Solaris box appears to think it's still connected to the network. Resetting the Solaris box (init 6) will bring the shares back, but only until I attempt to write them again.

I tried netstat -rn before and after the problem. Everything looks pretty normal. Below is "after":

Routing Table: IPv4
Destination           Gateway           Flags  Ref     Use     Interface 
-------------------- -------------------- ----- ----- ---------- --------- 
default              10.1.10.1            UG       27        456 skge0     
10.1.10.0            10.1.10.254          U         6    2536350 skge0     
127.0.0.1            127.0.0.1            UH        2        252 lo0       

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If   
--------------------------- --------------------------- ----- --- ------- ----- 
::1                         ::1                         UH      2       4 lo0   

"Before" has 27 instead of 17 in the "Ref" column for the 1st entry. "After" has slightly higher numbers for "Use" – probably normal.

I tried netstat -an before and after the problem, too. This one may have more of a clue. There are a number of UDP connections that are present before the problem that all vanish.

Before:

UDP: IPv4
   Local Address        Remote Address      State
-------------------- -------------------- ----------
    --truncated entries that are present in both before/after--
10.1.10.254.40504    10.1.10.1.53         Connected
10.1.10.254.39900    10.1.10.1.53         Connected
10.1.10.254.40129    10.1.10.1.53         Connected
10.1.10.254.37892    10.1.10.1.53         Connected
10.1.10.254.61658    10.1.10.1.53         Connected

After, those five entries are gone, but one new one is present:

UDP: IPv4
   Local Address        Remote Address      State
-------------------- -------------------- ----------
    --Again, truncated--
10.1.10.254.53920    10.1.10.1.53         Connected

I can't find any information about what port 53920 is used for. On the gateway side, port 53 appears to be used for DNS – not sure if this is a clue or not. Doesn't seem terribly helpful

Down in the TCP portion, there are a whole bunch of entires that are "ESTABLISHED" before that are either gone in after or they've transitioned to either TIME_WAIT or FIN_WAIT_1. This seems to jive with what I already know.

There's only one reference to IP of the computer that I used to crash the network:

Before:

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.1.10.254.445      10.1.10.132.53487    64512      0 128480      0 ESTABLISHED

After:

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.1.10.254.445      10.1.10.132.53487    64256      0 128480      0 ESTABLISHED

The only difference is in the Swind (send window?) column. It's strange that the state is still listed as established.

I did the netstat -an experiment again

The only difference before and after, was related to the IP address of the PC I used to crash the share.

Before:

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.1.10.254.445      10.1.10.132.53613    380416      0 128480      0 ESTABLISHED

After:

10.1.10.254.445      10.1.10.132.53613    65280       0 128480      0 ESTABLISHED

Again, the only difference is in the Swind column – the number got smaller.

I've reached the end of what I know about this kind of thing. netstat seems to be telling me what I already know. Short of buying another network card and just trying it, or reinstalling Solaris, I've got no idea. Can anybody clue me in as to the next step here?

Edit

I'm buying another network card and just trying it. It's going to take about a week to get here, so I'll keep poking at this in the meantime.

Best Answer

Netstat -an , netstat -rn, and lsof (before, and during the problem) may give clues. (Do they show too many open connections?). tcpdump may help too: start it just before establishing the connection and see what happens around the time connections start to die (and also a few minutes before the timeouts).

And see if the NFS options are non default and may have effects :

  • Try to use soft instead of hard settings, for example.

  • Remove all "non-core" options (core being those that are needed for the NFS to be established) and put them back in little by little, to see which one(s) is(are) causing the issue.

Sorry, but I do not have access at the moment to a Solaris to give the exact settings. A web search including the "Solaris" and "NFS" keywords will help you to find them.

Related Question