Ubuntu – How to troubleshoot network issues with the VPS

Problem: I occasionally have network issues with my Ubuntu VPS. I cannot SSH to the box, I cannot ping the box by IP address. I can access the box via host Serial terminal. When I access the box via serial, I can't ping out anywhere (far as I can tell), even when pinging by IP address. After some amount of time the network comes back, sometimes without my intervention. Sometimes it comes back when I am fiddling around. But it is hard to tell why. (Edit: It is very consistently out for 1 hour)

Questions: How can I proceed in troubleshooting this issue? What things can I do in order to rule out configuration/software problems in my control so that I can feel more comfortable bringing up the issue to my VPS host?

Things I have tried:

Bring eth0 down and up
Disable firewall temporarily
Checked VPS host advisories for network issues – haven't seen any
Reboot the server via Web console
Note: None of these have worked for me

Details:

Ubuntu 10.04.1 LTS
Hosted with Xen virtualization
Have root access (SSH) to perform my own upgrades, installs, etc.
I have the VPS setup as a VPN server so that I can connect to it "Road Warrior" style and forward all my traffic through the VPS first. So that is the junk with 10.8.X.X
All traffic including DNS lookups are forwarded through the VPS
Use uncomplicated firewall (ufw) with some basic rules
Also acts as a server for some services including Mumble and web server
I setup a script on the VPS as a cron job to ping some common internet entities by IP address every 5 minutes. If there is failure in the ping, then it logs it to a file. Simple enough. Consistently the network outage lasts for an hour. It does not always happen at the same time of day. On almost all of the occurrences, the network is down for an hour and then it "magically" comes back.
Memory usage on my VPS is typically very high. Usually I am maxed out and using some swap. The memory hog is java, if that detail helps.
My provider has been very unhelpful. It has ranged from "we are sorry, we had an unfortunate issue" to "there is no problem now". This is frustrating to me because typically I make a ticket when there is a problem, but the problem is gone by the time the ticket is addressed. The most recent communication has been that they suggest reformatting my VPS and starting over, which i am not keen about.
Consistently network outages start on the hour (within 5-10 minutes). That is, network outages do not start around XX:30, XX:45, etc.

netstat -rn

    Kernel IP routing table  
    Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface  
    10.8.0.2        0.0.0.0         255.255.255.255 UH        0 0          0 tun0  
    XX.57.166.0     0.0.0.0         255.255.255.128 U         0 0          0 eth0  
    192.168.50.0    10.8.0.2        255.255.255.0   UG        0 0          0 tun0  
    10.8.0.0        10.8.0.2        255.255.255.0   UG        0 0          0 tun0  
    0.0.0.0         XX.57.166.1     0.0.0.0         UG        0 0          0 eth0

ip route list

    10.8.0.2 dev tun0  proto kernel  scope link  src 10.8.0.1  
    XX.57.166.0/25 dev eth0  proto kernel  scope link  src XX.57.166.59  
    192.168.50.0/24 via 10.8.0.2 dev tun0  
    10.8.0.0/24 via 10.8.0.2 dev tun0  
    default via XX.57.166.1 dev eth0  metric 100

cat /etc/network/interfaces

    auto eth0  
    iface eth0 inet static  
        address XX.57.166.59  
        gateway XX.57.166.1  
        netmask 255.255.255.128  
    auto lo  
    iface lo inet loopback

Best Answer

Firstly if you believe this is a vendor issue that they're not addressing, I'd strongly consider migrating away. I gave VPS.net the benefit of the doubt when their SAN kept crashing (taking down all the VPSes in the process) but after a few months of "We've fixed this for good" and it still crashing, I had to vote with my wallet.

It's surprisingly easy to start a VPS company (you really only need a bit of datacenter space and some servers) so they're not all equal in technical ability even before you get to customer service.

But in terms of getting to the bottom of the problem, I'd first look at stopping things ending up in swap. Leave swap on but do whatever you have to do so you're not pushing things that far. Rein in the Java application or add more RAM. And see what happens. If this is very regular, you shouldn't have to wait long (or pay much) to see a result.

Same with CPU. If you have things running at 100% for extended periods, you want to make sure they're not interfering with other applications. The most simple way to work this can be done by setting the nice value of whatever applications are rampant to something positive. A nice value of something like +10 should let the system get full priority of the resources before your applications. Sidebar: Nice values basically mean the're more polite when it comes to CPU scheduling. Something with a low (eg -20) nice value means they'll get prioritised over all other things with higher nice values.

If you can, expand your testing to other local network items. If they provide a DNS resolver (as a lot of server companies do), ping that constantly (well, a few times a minute) and log the results. If you can still access that during periods of downtime, it's less likely that it's your fault.

And as I say, if this isn't your fault, move. If you spend any more time trying to fix this, you're outweighing any conceivable benefit of staying with these people. I personally have a very good and long experience with Linode but there are lots of good companies out there.

Best Answer

Related Solutions

Related Question