Debugging a Solaris network issue

networkingsambasolariszfs

I'm running a Solaris 11 x86 file server. The file-server part is ZFS + Samba. It's been up running for three or four years now without major problems.

The Samba shares start out visible to other PCs on the network. I can read from the file server reliably. I can ping the server. I can ping other PCs from the server. I can ping the default gateway from the server.

Starting a few weeks ago, when I attempt to write to the file server, the shares disappear after a few seconds (or maybe after a few hundred megabytes). The issue is apparently with the network. The server is still alive, though. If I hook up a mouse and keyboard and monitor I can still interact with the server.

It doesn't seem like the problem is with the hard drives or with Samba. Tried:

zpool status
fmadm faulty
svcadm restart samba

No errors. No faulty devices. Samba doesn't appear to be the problem.

After the problem happens, I cannot ping the default gateway from the file server anymore. I cannot ping other machines from the file server anymore. I cannot ping the server from other machines.

Network Debugging Steps

I've tried:

ifconfig skge0 down/ifconfig skge0 up.
Power cycling the switch that the Solaris box is plugged into
Power cycling the router that the Solaris box is plugged into

The Solaris box appears to think it's still connected to the network. Resetting the Solaris box (init 6) will bring the shares back, but only until I attempt to write them again.

I tried netstat -rn before and after the problem. Everything looks pretty normal. Below is "after":

Routing Table: IPv4
Destination           Gateway           Flags  Ref     Use     Interface 
-------------------- -------------------- ----- ----- ---------- --------- 
default              10.1.10.1            UG       27        456 skge0     
10.1.10.0            10.1.10.254          U         6    2536350 skge0     
127.0.0.1            127.0.0.1            UH        2        252 lo0       

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If   
--------------------------- --------------------------- ----- --- ------- ----- 
::1                         ::1                         UH      2       4 lo0

"Before" has 27 instead of 17 in the "Ref" column for the 1st entry. "After" has slightly higher numbers for "Use" – probably normal.

I tried netstat -an before and after the problem, too. This one may have more of a clue. There are a number of UDP connections that are present before the problem that all vanish.

Before:

UDP: IPv4
   Local Address        Remote Address      State
-------------------- -------------------- ----------
    --truncated entries that are present in both before/after--
10.1.10.254.40504    10.1.10.1.53         Connected
10.1.10.254.39900    10.1.10.1.53         Connected
10.1.10.254.40129    10.1.10.1.53         Connected
10.1.10.254.37892    10.1.10.1.53         Connected
10.1.10.254.61658    10.1.10.1.53         Connected

After, those five entries are gone, but one new one is present:

UDP: IPv4
   Local Address        Remote Address      State
-------------------- -------------------- ----------
    --Again, truncated--
10.1.10.254.53920    10.1.10.1.53         Connected

I can't find any information about what port 53920 is used for. On the gateway side, port 53 appears to be used for DNS – not sure if this is a clue or not. Doesn't seem terribly helpful

Down in the TCP portion, there are a whole bunch of entires that are "ESTABLISHED" before that are either gone in after or they've transitioned to either TIME_WAIT or FIN_WAIT_1. This seems to jive with what I already know.

There's only one reference to IP of the computer that I used to crash the network:

Before:

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.1.10.254.445      10.1.10.132.53487    64512      0 128480      0 ESTABLISHED

After:

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.1.10.254.445      10.1.10.132.53487    64256      0 128480      0 ESTABLISHED

The only difference is in the Swind (send window?) column. It's strange that the state is still listed as established.

I did the netstat -an experiment again

The only difference before and after, was related to the IP address of the PC I used to crash the share.

Before:

TCP: IPv4
   Local Address        Remote Address    Swind Send-Q Rwind Recv-Q    State
-------------------- -------------------- ----- ------ ----- ------ -----------
10.1.10.254.445      10.1.10.132.53613    380416      0 128480      0 ESTABLISHED

After:

10.1.10.254.445      10.1.10.132.53613    65280       0 128480      0 ESTABLISHED

Again, the only difference is in the Swind column – the number got smaller.

I've reached the end of what I know about this kind of thing. netstat seems to be telling me what I already know. Short of buying another network card and just trying it, or reinstalling Solaris, I've got no idea. Can anybody clue me in as to the next step here?

Edit

I'm buying another network card and just trying it. It's going to take about a week to get here, so I'll keep poking at this in the meantime.

Best Answer

Netstat -an , netstat -rn, and lsof (before, and during the problem) may give clues. (Do they show too many open connections?). tcpdump may help too: start it just before establishing the connection and see what happens around the time connections start to die (and also a few minutes before the timeouts).

And see if the NFS options are non default and may have effects :

Try to use soft instead of hard settings, for example.
Remove all "non-core" options (core being those that are needed for the NFS to be established) and put them back in little by little, to see which one(s) is(are) causing the issue.

Sorry, but I do not have access at the moment to a Solaris to give the exact settings. A web search including the "Solaris" and "NFS" keywords will help you to find them.

Potential issue #1 - resolve order

Sounds like a resolving issue around NMB. It's mentioned here in this thread titled: Nautilus doesn't see network computers... [SOLVED].

non-discovering resolve order

# What naming service and in what order should we use to resolve host names
# to IP addresses
name resolve order = lmhosts host wins bcast

reported to work resolve order

name resolve order = bcast lmhosts host wins

Be sure to restart NMB/SMB services once you've made this change.

Potential issue #2 - client protocol

Researching your issue further, I came across this tip in this AU Q&A titled: Nautilus fails to see shares in 18.04. The tip from there was to change the following:

$ more /etc/samba/smb.conf
workgroup = WORKGROUP
client max protocol = NT1

After making the above changes it's advised to reboot, not simply restart.

As part of this tip, make sure that avahi service is running:

$ sudo service avahi-daemon status
$ sudo service avahi-demon start

Potential issue #3 - firewalld

According to this askfedora.org article titled: fedora 27 network browsing doesnt't work. Why? it's suggested to try disabling firewalld. It may be inhibiting the ports 137-139 which are required for Samba's NMB/SMB services to function properly.

Potential issue #4 - Bug 1513394 with gvfs

Continued searches led to this issue that's still listed as open. The issue, titled: Bug 1513394 - Applications using gvfs are unable to browse SMB shares. It has to do with the package gvfs-smb.

Applications using gvfs are unable to browse SMB shares

These steps can be used to see if the issue afflicts your system.

Steps to Reproduce:

1. nmblookup -M -- -
2. nmblookup -M workgroup
3. smbtree
4. gio list network://
5. gio list smb:///
6. gio list smb://workgroup

If things don't work the results from the steps above will look like this:

1. will return IP address for __MSBROWSE__ special name
2. will return IP address for workgroup master browser
3. will correctly list workgroup, workgroup members and their shares
4. returned items are missing workgroup members
5. will return empty
6. will return an error message "The specified location is not mounted"

If things are working the results will look like this:

1. OK
2. OK
3. OK
4. returned items should contain workgroup members
5. should contain workgroup name
6. should contain workgroup members

It should be noted that there doesn't appear to be a fix yet for this:

For the record, it doesn't work in Fedora 28 and Samba 4.8 either.

Read the comments on the issue to see the rest of the story.

Best Answer

Related Solutions

Fedora – Cannot connect to fedora on port 80

Samba network discovery fails through file managers, but works with smbtree

Potential issue #1 - resolve order

Potential issue #2 - client protocol

Potential issue #3 - firewalld

Potential issue #4 - Bug 1513394 with gvfs

References

Related Question