VM with ntpd running but not syncing

kvmntpd

TL; DR

VM using KVM, time is not synchronized. After a 2 minute suspend, it keeps a permanent 2 min gap. Setting up another VM with different network config shows that network config prevents ntp from working. Fixing this network issue is out of topic.

However, the new VM that does not have the network issue does not synchronize either after a resume. Same test: suspend 2 minutes. Check the date difference with a machine that is properly synced. The 2 min delay is permanent.

This seems to be a common issue and there is controversy about how to keep a VM synchronized, and about using NTP and kvm-clock at the same time. I found many references to that but no answer.

Question

I have a Debian VM with ntpd running but not correcting time. For instance, after a suspend/resume, I get a permanent 2 minute offset.

/etc/ntp.conf is default or close to default, nothing fancy:

# /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help

driftfile /var/lib/ntp/ntp.drift


# Enable this if you want statistics to be logged.
#statsdir /var/log/ntpstats/

statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable


# You do need to talk to an NTP server or two (or three).
#server ntp.your-provider.example

# pool.ntp.org maps to about 1000 low-stratum NTP servers.  Your server will
# pick a different set every time it starts up.  Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
server 0.debian.pool.ntp.org iburst
server 1.debian.pool.ntp.org iburst
server 2.debian.pool.ntp.org iburst
server 3.debian.pool.ntp.org iburst


# Access control configuration; see /usr/share/doc/ntp-doc/html/accopt.html for
# details.  The web page <http://support.ntp.org/bin/view/Support/AccessRestrictions>
# might also be helpful.
#
# Note that "restrict" applies to both servers and clients, so a configuration
# that might be intended to block requests from certain clients could also end
# up blocking replies from your own upstream servers.

# By default, exchange time with everybody, but don't allow configuration.
restrict -4 default kod notrap nomodify nopeer noquery
restrict -6 default kod notrap nomodify nopeer noquery

# Local users may interrogate the ntp server more closely.
restrict 127.0.0.1
restrict ::1

# Clients from this (example!) subnet have unlimited access, but only if
# cryptographically authenticated.
#restrict 192.168.123.0 mask 255.255.255.0 notrust


# If you want to provide time to your local subnet, change the next line.
# (Again, the address is an example only.)
#broadcast 192.168.123.255

# If you want to listen to time broadcasts on your local subnet, de-comment the
# next lines.  Please do this only if you trust everybody on the network!
#disable auth
#broadcastclient

ntpq seems to report a problem:

# cat ntpq -pn
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 37.187.7.160    .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 195.154.211.37  .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 195.154.216.44  .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 95.81.173.155   .INIT.          16 u    - 1024    0    0.000    0.000   0.000

However, I'm not a netcat wizard, but AFAIU outgoing traffic on UDP port 123 goes through:

# nc -vvzu 37.187.7.160 123
mail.lafkor.de [37.187.7.160] 123 (ntp) open
 sent 0, rcvd 0

Is this test enough to rule out the firewall issue?

The host (also a Debian machine) has the same NTP configuration and synchronization is working. The network config for both machines is different, which is why I'm thinking it might be a network issue.

Any other useful test I could run?

I don't think the tinker panic 0 parameter is relevant here as it is meant to force updates on huge gaps, not 2 minute gaps. And anyway, AFAIU, it would affect the behavior in case of time offset, but it would not solve ntpq -pn returning only zeros.

FWIW, other test outputs inspired from this question:

# ntpq
ntpq> pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 mail.lafkor.de  .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 atoll.tropicdre .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 oods.roflcopter .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 ntp-3.arkena.ne .INIT.          16 u    - 1024    0    0.000    0.000   0.000
ntpq> as

ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1 21025  8011   yes    no  none    reject    mobilize  1
  2 21026  8011   yes    no  none    reject    mobilize  1
  3 21027  8011   yes    no  none    reject    mobilize  1
  4 21028  8011   yes    no  none    reject    mobilize  1
ntpq> rv
associd=0 status=c012 leap_alarm, sync_unspec, 1 event, freq_set,
version="ntpd 4.2.6p5@1.2349-o Fri Apr 10 19:04:04 UTC 2015 (1)",
processor="x86_64", system="Linux/3.16.0-4-amd64", leap=11, stratum=16,
precision=-23, rootdelay=0.000, rootdisp=6683.055, refid=INIT,
reftime=00000000.00000000  Mon, Jan  1 1900  0:09:21.000,
clock=d9b51587.b7a1085f  Tue, Sep 29 2015 15:49:59.717, peer=0, tc=3,
mintc=3, offset=0.000, frequency=-0.125, sys_jitter=0.000,
clk_jitter=0.000, clk_wander=0.000
ntpq> rv 21025
associd=21025 status=8011 conf, sel_reject, 1 event, mobilize,
srcadr=mail.lafkor.de, srcport=123, dstadr=147.210.157.185, dstport=123,
leap=11, stratum=16, precision=-23, rootdelay=0.000, rootdisp=0.000,
refid=INIT, reftime=00000000.00000000  Mon, Jan  1 1900  0:09:21.000,
rec=00000000.00000000  Mon, Jan  1 1900  0:09:21.000, reach=000,
unreach=1137, hmode=3, pmode=0, hpoll=10, ppoll=10, headway=0,
flash=1600 peer_stratum, peer_dist, peer_unreach, keyid=0, offset=0.000,
delay=0.000, dispersion=15937.500, jitter=0.000, xleave=0.167,
filtdelay=     0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtoffset=    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00,
filtdisp=   16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0 16000.0

tcpdump / ntpdate tests

On a machine where NTP sync works correctly, I launch tcpdump udp port ntp and when I restart ntpd, I see this kind of output:

# tcpdump udp port ntp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
17:31:33.719166 IP 10.0.2.15.ntp > spica.beduzar.fr.ntp: NTPv4, Client, length 48
17:31:33.736804 IP spica.beduzar.fr.ntp > 10.0.2.15.ntp: NTPv4, Server, length 48
17:31:35.973551 IP 10.0.2.15.ntp > ntp.tuxfamily.net.ntp: NTPv4, Client, length 48
17:31:35.992671 IP ntp.tuxfamily.net.ntp > 10.0.2.15.ntp: NTPv4, Server, length 48
[...]

On the machine I have the issue with, I don't see any output at all when restarting ntpd (no request, no reply). Shouldn't I at least see the requests?

On the good machine:

# ntpdate 0.debian.pool.ntp.org
29 Sep 17:24:49 ntpdate[700]: adjust time server 193.55.167.1 offset -0.005196 sec

On the bad machine:

# ntpdate 0.debian.pool.ntp.org
29 Sep 17:43:18 ntpdate[3180]: no server suitable for synchronization found

Test with another VM

We setup another VM with the same NTP configuration but another network configuration.

This results of tcpdump and ntpdate are correct and ntpq -pn returns good results. So apparently, the network configuration is indeed an issue on the faulty VM.

However, the new VM does not synchronize either. If I suspend it so that it has about 100s lag, it does not synchronize (I mean after a few minutes, the gap is still the same number of seconds). However, when restarting ntpd, it synchronize instantly.

I appear to have two issues:

  • Network config on the first VM

  • ntp does not synchronize on both (unless restarted)

Best Answer

Problem solved.

Network issue

The VM had network issues preventing ntpd to succeed. It has two eth interfaces, and the one with the gateway goes through a router we don't manage directly. Although my tests wouldn't show it, I guess some UDP frames were blocked. We set up another VM with another network config and ntpq yielded better results.

Ultimately, we changed the ntp config so that the host broadcasts time locally and all VM synchronize on it. Makes more sense and minimizes load on public ntp servers.

ntpd sets clock instantly after a few minutes

One thing that probably mislead me during the tests is that ntpd does not synchronize immediately. I thought it would detect a gap right away and then modify the clock speed so that the clock progressively joins the source clock. In fact, we noticed that (unless ntpd is restarted) the clock is unchanged for a few minutes then all of a sudden it is set what seems instantly. In the meantime, the rightmost columns in ntpq output show that synchronization is going on.

This ntpd behavior probably explains why I thought ntpd didn't work even if it did. I just didn't wait long enough and I didn't understand ntpq output.

Related Question