Centos – Amazon AWS Centos 7 System Clock is fast

awscentosdate

We have Amazon AWS instance with CentOS Linux 7 (Core). But maybe that is not specific for system

Few days ago the System Clock (date) begins to speed up very fast.
If we sync it with Hardware Clock (hwclock), after about 10-20 minutes System Clock (date) will be ahead for 48 seconds.
And 48 secs offset is the max value. After a few hours it will be ahead for 48 seconds too.

I know that a little offset is normal. But 48 seconds offset in ~10-20 minutes is not normal.
I also know that there are files and libs like adjtimex which can use "delta" value and will adjust system time
But in my case, speed up process stops when it reached ~48 seconds.
So, hwclock will print for example 12:00:00 and date will print 12:00:48

I tried:

Install ntpdate and sync time via ntpdate pool.ntp.org
hwclock --hctosys to set System Time from the Hardware Clock. Also tried hwclock --systohc after syncing time (date) with ntpdate
Created file /etc/sysconfig/clock with "HWCLOCK_ADJUST" param set to true. Also tried with false value
Deleted file /etc/adjtime or so, which had UTC and ZERO values in it

But with no luck.

After time sync, I run next code: $ while true; do ntpdate pool.ntp.org; sleep 60; done

16 Jan 15:29:45 ntpdate[20656]: step time server 129.250.35.251 offset -4.977822 sec
16 Jan 15:30:46 ntpdate[20743]: step time server 129.250.35.251 offset -5.117517 sec
16 Jan 15:31:48 ntpdate[20813]: step time server 74.117.214.3 offset -4.853926 sec
16 Jan 15:32:50 ntpdate[20890]: step time server 23.239.26.89 offset -5.583270 sec
16 Jan 15:33:51 ntpdate[20941]: step time server 74.117.214.3 offset -4.983483 sec
16 Jan 15:34:53 ntpdate[20994]: step time server 12.167.151.1 offset -5.150401 sec
16 Jan 15:35:54 ntpdate[21080]: step time server 173.255.206.154 offset -5.256357 sec
16 Jan 15:37:03 ntpdate[21155]: adjust time server 12.167.151.1 offset 0.011276 sec
16 Jan 15:38:09 ntpdate[21205]: adjust time server 108.61.56.35 offset -0.019818 sec
16 Jan 15:39:16 ntpdate[21241]: adjust time server 108.61.56.35 offset -0.285154 sec
16 Jan 15:40:18 ntpdate[21660]: step time server 108.61.56.35 offset -5.227262 sec
16 Jan 15:41:19 ntpdate[21706]: step time server 108.61.73.244 offset -5.474606 sec
16 Jan 15:42:20 ntpdate[21756]: step time server 108.61.73.244 offset -5.286961 sec
16 Jan 15:43:22 ntpdate[21791]: step time server 108.61.73.244 offset -4.808674 sec
16 Jan 15:44:29 ntpdate[21885]: adjust time server 96.244.96.19 offset -0.010287 sec
16 Jan 15:45:36 ntpdate[21952]: adjust time server 96.244.96.19 offset -0.000296 sec
16 Jan 15:46:43 ntpdate[22013]: adjust time server 96.244.96.19 offset -0.012838 sec
16 Jan 15:47:51 ntpdate[22126]: adjust time server 198.206.133.14 offset -0.347436 sec
16 Jan 15:48:53 ntpdate[22220]: step time server 198.206.133.14 offset -5.570427 sec
16 Jan 15:49:57 ntpdate[22300]: step time server 198.206.133.14 offset -5.229636 sec
16 Jan 15:50:58 ntpdate[22367]: step time server 104.131.53.252 offset -5.466987 sec
16 Jan 15:52:00 ntpdate[22407]: step time server 104.131.53.252 offset -5.298659 sec
16 Jan 15:53:01 ntpdate[22462]: step time server 104.131.53.252 offset -5.127748 sec
16 Jan 15:54:03 ntpdate[22578]: step time server 129.6.15.30 offset -5.014787 sec
16 Jan 15:55:05 ntpdate[22617]: step time server 129.6.15.30 offset -5.144181 sec
16 Jan 15:56:06 ntpdate[22694]: step time server 129.6.15.30 offset -5.436509 sec
16 Jan 15:57:08 ntpdate[22733]: step time server 96.238.43.39 offset -5.038639 sec

Who can tell me what's going on here?
Does that mean that System Clock works fine for about ~3-4 minutes sometimes?
Before these logs I thought that it speeds up always up to 48 seconds.
The reason why logs printed out not every exactly 60 secs, because ntpdate works for a few seconds and after sync writes those text.

We solved this issue by running ntpdate (ntp) as a service to sync date automatically.

What are the possible reasons for that "sudden gigantic speeds up"?

If this is not a common issue, we will contact Amazon support for help.

Best Answer

The problem was probably in one of the hypervisors; it could have been the clock skewed by 48s; it happens (and is not a problem unique to AWS)

There was also a Xen bug, no idea if that applies nowadays. (has not AWS migrated to kvm?)

Amazon is advising people to install chrony synced with one of their NTP servers. Have a look at AWS docs - EC2 - Setting the Time for Your Linux Instance

As in:

sudo yum erase ntp*
sudo yum install chrony

Create /etc/chrony.conf with:

server 169.254.169.123 prefer iburst

And lastly:

sudo service chronyd start

One thing that could also be tried, per a @jordanm comment, is stopping/starting the EC2 server. You might get lucky, and get it running in another hypervisor without the clock skewed.

If these actions still do not solve the problem, I would open a ticket with Amazon.

Related Solutions

Shell – Leap seconds and date

I didn't find a simple solution to my question, so I've written a small Bash script to solve it. You need to download the leap seconds file given in the link below and put it with the script or change the path to it. I didn't write utc2unix.sh yet, but it's very easy to adapt. Do not hesitate to comment/give suggestions...

unix2utc.sh:

#!/bin/bash

# Convert a Unix timestamp to the real number of seconds
# elapsed since the epoch.

# Note: this script only manage additional leap seconds

# Download leap-seconds.list from
# https://github.com/eggert/tz/blob/master/leap-seconds.list

# Get current timestamp if nothing is given as first param
if [ -z $1 ]; then
    posix_time=$(date --utc +%s)
else
    posix_time=$1
fi

# Get the time at which leap seconds were added
seconds_list=$(grep -v "^#" leap-seconds.list | cut -f 1 -d ' ')

# Find the last leap second (see the content of leap-seconds.list)
# 2208988800 seconds between 01-01-1900 and 01-01-1970:
leap_seconds=$(echo $seconds_list | \
               awk -v posix_time="$posix_time" \
               '{for (i=NF;i>0;i--)
                   if (($i-2208988800) < posix_time) {
                    print i-1; exit
                    }
                } END {if (($(i+1)-2208988800) == posix_time) 
                    print "Warning: POSIX time ambiguity:",
                            posix_time,
                          "matches 2 values in UTC time!",
                          "The smallest value is given." | "cat 1>&2"
                }')
# echo $leap_seconds

# Add the leap seconds to the timestamp
seconds_since_epoch=$(($posix_time + $leap_seconds))

echo $seconds_since_epoch

Just some tests:

date --utc +%s && ./unix2utc.sh -> today and at least until June 2015 the difference is 25 sec.
./unix2utc.sh 78796799 -> 78796799
./unix2utc.sh 78796801 -> 78796802
./unix2utc.sh 78796800 -> 78796800 + on stderr: Warning: POSIX time ambiguity: 78796800 matches 2 consecutive values in UTC time! Only the smallest value is given.

Centos – Force time to stay put

For this answer, I'll assume that there may be several elements working hard to set your time straight. Since I don't really want to wild-guess about which one is working against you, I'll try and give you an answer which should help you find it yourself instead.

On a UNIX system, the clock can typically be set using the stime system call. As things evolved, it also became possible to set the clocks more accurately using the clock_settime call instead. You might also come accross settimeofday. When running date --set on a CentOS machine, strace revealed that it used clock_settime.

Knowing this, a solution would be to trace these system calls. Good thing is, Linux has a mechanism for that: debugfs. On my system, calling mount, I can see that this is available at /sys/kernel/debug :

$ mount
none on /sys/kernel/debug type debugfs (rw)
...

However, on some systems (including RedHat and probably CentOS), it isn't mounted at boot time. You'll therefore need to run...

# mount -t debugfs nodev /sys/kernel/debug

Also note that if you were in that directory before mounting, you might have to go out and back in before files start to appear in it.

Now we're ready to go. Let's enable the trace for our systems calls. I'm tracing all of them because I don't really want to check which one is really being used. Tracing for system calls can be set in /sys/kernel/debug/tracing/events/syscalls. In this directory, you should find...

sys_enter_stime
sys_enter_clock_settime
sys_enter_settimeofday

... depending on what's available on your system.

These correspond to the events of entering our system calls, which is what we want to trace (you'll also find sys_exit_* directories). Within each directory, you'll find a file named enable, the contents of which should appear to be 0. To trace these calls, set that to 1 instead:

# echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_stime/enable
# echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_clock_settime/enable
# echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_settimeofday/enable

Now that we've set up our trap, just wait until something sets your time to its correct value. Once it has happened, run for the trace logs at...

# cat /sys/kernel/debug/tracing/trace

Now, unless something wrong occured, you should see one of the following lines:

stime-xxxxx [xxx] .... x.x: sys_stime(...)
clock_settime-xxxxx [xxx] .... x.x: sys_clock_settime(...)
settimeofday-xxxxx [xxx] .... x.x: sys_settimeofday(...)

The number right after stime- (or another call's name) is the PID of the process which made the system call. Now go get it:

# ps -fp xxxxx
UID        PID  PPID  C STIME TTY          TIME CMD
root     XXXXX XXXXX  0 hh:mm ?        hh:mm:ss time_warrior

You should now have everything you need to make sure your system stops getting the time right. The simplest thing would probably be to kill the process, and make sure it isn't spawned at boot time ; of course, you'll have to make sure it doesn't serve a more important purpose before doing so : you don't want to completely crash your system...

Also remember to disable the trace when you're done by writing 0 to the files we edited earlier. A shortcut could be:

# echo 0 > /sys/kernel/debug/tracing/events/syscalls/enable

(this file acts as a master switch for all others ; it allows you to switch all system calls tracing off)

Note: as Mark Plotnick said in a comment systemtap could be a slightly easier way to achieve similar results. I'll let him write a stap answer if he feels like it, since I'm not fluent with stap scripts at all.

Best Answer

Related Solutions

Shell – Leap seconds and date

Centos – Force time to stay put

Related Question