wall clock time goes backwards #44

brooksdavis · 2017-05-01T18:42:43Z

We've had a problem for quite a while where wall clock time goes backwards in qemu. It seems to be impossible to trigger reliably, but we're hacking around it in kyua. Unfortunately, it's causing problems with some tests. For example /usr/tests/usr.bin/limits/limits_test cputime_hard_flag often fails in Jenkins due to outlandish time values:

Fail: regexp real[[:space:]]+[34]\.[0-9][0-9] not in stderr
time: command terminated abnormally
real 0.49
user 25012534298.28
sys 3126566791.48

I've also seen xinstall tests fail due to a time warp during test setup causing fails to be created out of order according to their on-disk timestamps.

The text was updated successfully, but these errors were encountered:

staceyson · 2017-05-04T19:53:17Z

I wonder if a work-around would be changing the timer source (w/ kern.timecounter.hardware). It is pretty well known that Qemu doesn't emulate the HPET (High Precision Event Timer) that well: See http://drupal.bitfunnel.net/drupal/linux-under-qemu for example. Maybe use TCP-low or something else?

staceyson · 2017-05-04T20:22:37Z

It seems that mips only has two choices:

kern.timecounter.hardware: MIPS32
kern.timecounter.choice: MIPS32(800) dummy(-1000000)

staceyson · 2017-05-05T17:57:05Z

Is there a simple and deterministic way to reproduce this?

brooksdavis · 2017-05-05T18:03:57Z

Unfortunately there isn't. For me, running the whole freebsd test suite usually triggers at least one event (most of the time, it won't matter since it will be within a non-time dependent test, but without the kyua hack to allow time going backwards, I've never finished a full run on vica). The failures tend to cluster (I get none sometimes and then several in the same VM other types) which leads me to believe it's somehow system load related.

rwatson · 2017-05-05T19:10:14Z

I wonder if there may be some form of wrap-around bug involving the MIPS CP0 counter?

bsdjhb · 2017-05-05T19:51:10Z

The "dummy" timecounter isn't a real one. The MIPS timecounter in qemu wraps every 42 seconds. If qemu is using a host wall clock to compute the cycle counter value and the VM was starved of cycles (or stuck handling I/O, etc.) such that the counter wrapped before the "every 100ms" tc update callout fired then perhaps this could occur? It depends on how the counter is implemented in qemu I think (though 42 seconds is a fairly long time).

staceyson · 2017-05-08T15:58:36Z

Yes, it looks like Qemu is using gettimeofday() (or clock_gettime(CLOCK_MONOTONIC,...) if CLOCK_MONOTONIC is set at compile-time) to supply the MIPS CP0 Count register. Therefore, if the VM was starved of cycles I guess a roll over of the count register could be possible.

Here is something to try... If the command-line option -icount [shift=N|auto] is used it looks like it will use the instruction counter instead of gettimeofday().

brooksdavis · 2017-05-09T01:07:08Z

I've tried a number of -icount values ("shift=1", "shift=auto", "shift=auto,sleep=on") and non result in a usable system. The hang setting up entropy in user space. I did hit keys on the console and eventually one of them went through, but it would randomly stall for no obvious reason from time to time.

staceyson · 2017-05-11T22:40:31Z

578c23d fixes the "Feeding entropy" hang by using some bits from the host clock for the lower 4 bits of the counter. The upper bits are taken from the instruction counter. Now the MIPS CP0 count register should be more deterministic but still give some entropy to seed things. Using the -icount now allows CheriBSD to boot. Give it a try and see if you are still seeing the other random hangs/stalls.

brooksdavis · 2017-05-12T15:12:59Z

With -icount shift=auto I get a mostly functional system, but did see a hang at:

Device configuration finished.
Timecounters tick every 10.000 msec
lo0: bpf attached
tcp_init: net.inet.tcp.tcbhashsize auto tuned to 32768
ata0: reset tp1 mask=03 ostat0=50 ostat1=00
ata0: stat0=0x50 err=0x01 lsb=0x00 msb=0x00
ata0: stat1=0x00 err=0x00 lsb=0xff msb=0xff
ata0: reset tp2 stat0=50 stat1=00 devices=0x1
ata1: reset tp1 mask=03 ostat0=50 ostat1=00
ata1: stat0=0x00 err=0x01 lsb=0x14 msb=0xeb
ata1: stat1=0x00 err=0x00 lsb=0xff msb=0xff
ata1: reset tp2 stat0=00 stat1=00 devices=0x10000

Hitting enter causing it to continue with:

pass0 at ata0 bus 0 scbus0 target 0 lun 0
pass0: <QEMU HARDDISK 2.4.50> ATA-7 device
pass0: Serial Number QM00001
pass0: 33.300MB/s transfers (UDMA2, PIO 8192bytes)
pass1 at ata1 bus 0 scbus1 target 0 lun 0
...

But it really is basically forever (one hung there overnight).

staceyson · 2017-05-12T16:27:18Z

I wonder if using clock_gettime(CLOCK_VIRTUAL,...) might be a better solution to this problem. Qemu currently uses:

/* real time host monotonic timer */
static inline int64_t get_clock_realtime(void)
{
    struct timeval tv;

    gettimeofday(&tv, NULL);
    return tv.tv_sec * 1000000000LL + (tv.tv_usec * 1000);
}

which seems like the worst clock to use. It seem it should at least use clock_gettime(CLOCK_MONOTONIC,...).

Of course, Linux doesn't have the CLOCK_VIRTUAL option so maybe that is why they didn't use it.

In testing using CLOCK_VIRTUAL the hang after ata1:... still seems like it is a problem, however. I am guessing it is waiting for the at a devices to settle using a timer. It is halting and waiting for a timer interrupt since nothing else is running. Therefore, qemu will block and the clock_gettime(CLOCK_VIRTUAL,...) clock doesn't run.

staceyson · 2017-05-12T18:26:26Z

Please give abef7fd a try. This implements the change described above. It may seem like it hangs on ata1: reset... but always seems to get past it. It may take a bit longer, however.

Please note that the -icount option is not needed.

brooksdavis · 2017-05-13T00:03:35Z

With abef7fd I get stuck in ata probing again. I let it go for 10 minutes by stopwatch. I'm finding 65cc9fc to be usable.

brooksdavis · 2017-05-15T17:17:35Z

I've reverted abef7fd because all the tests started failing with timeouts during ATA probe after the fix to #47 caused a new version to be used in Jenkins.

staceyson · 2017-05-15T20:04:31Z

b79c3d9 may help with this problem. According to the CheriBSD header file sys/mips/include/clock.h the MIPS Count register is incremented every two cycles. However, Qemu (hw/mips/cputimer.c) is assuming it is incremented every cycle. Therefore, this change adjusts the emulation code to more accurately emulate the hardware (which effectively increments the counter half as fast).

One note about Qemu and CheriBSD from reviewing the code: Both assume the system clock is 100 Mhz (10ns period). There is an option to scale this to the actual clock in CheriBSD (TICK_USE_MALTA_RTC) but it looks like it requires some additional hardware.

staceyson · 2017-05-18T23:05:53Z

I added a simple emulated realtime clock to Qemu-CHERI (12b39ea) and a RTC driver for it to CheriBSD. I am hoping this will help with this problem.

brooksdavis · 2017-05-22T21:32:11Z

I'm still seeing errors. I've got a simple(ish) repro case:

$ rm -f copyf* && /usr/tests/usr.bin/xinstall/install_test copy_to_nonexistent_backup_safe
install_test: WARNING: Running test cases outside of kyua(1) is unsupported
install_test: WARNING: No isolation nor timeout control is being applied; you may get unexpected failures; see atf-test-case(4)
Executing command [ install -b -B.bak -S testf copyf ]
failed: bad timestamp

This is run on a tmpfs /tmp.
This fails for me fairly reliably (I had written 1/3 of the time, but then got 10 passes in a row so it's not consistent). This time time went backwards around 1 second:

# stat -f %Fm testf
1495364843.445343000
# stat -f %Fm copyf
1495364842.793015000

I've seen previous cases where it that and the seconds were the same. It does appear these little slips are happening pretty often. The qemu binary is at 12b39ea.

I did make an effort to reduce the test case further to a series of shell commands and never could get it to trigger. I somewhat wonder if it needs to be just slow enough to open the window and not too slow so as to mask small clock slips.

staceyson · 2017-05-23T16:25:57Z

I assume that you also added the RTC driver to CheriBSD ( 37c9a9b27b1756a7ee45 ).

I am puzzled on what is causing this to go backwards a second. The timer in Qemu is implemented using a callback timer sort of based on the host clock. It attempts to make it a virtual clock but that is only the case for if the emulation is suspended or something of the like. Qemu emulation can block on a halt/wait instruction but the timer expiration should still fire an IRQ (at least I believe it does). We could try not using the virtual clock for the MIPS count/compare register emulation to see if that changes anything.

Another thing we could possibly try is hacking CheriBSD to make sure the time value it returns to userland is always monotonically increasing (from the last read). If it doesn't then checking and adjusting to the realtime clock. The RTC emulation that I added to Qemu-CHERI is feed simply from the host clock.

brooksdavis · 2017-05-23T16:29:26Z

There driver is indeed installed and attached:

root@beri1:/tmp # grep rtc /var/run/dmesg.boot
rtc0: <Qemu Realtime Clock> on nexus0
rtc0: registered as a time-of-day clock (resolution 1000us, adjustment 0.000500000s)
random: harvesting attach, 8 bytes (4 bits) from rtc0

arichardson mentioned this issue Jan 16, 2018

Jemalloc assertion that time goes backwards CTSRD-CHERI/cheribsd#249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wall clock time goes backwards #44

wall clock time goes backwards #44

brooksdavis commented May 1, 2017

staceyson commented May 4, 2017

staceyson commented May 4, 2017

staceyson commented May 5, 2017

brooksdavis commented May 5, 2017

rwatson commented May 5, 2017 via email

bsdjhb commented May 5, 2017

staceyson commented May 8, 2017 •

edited

Loading

brooksdavis commented May 9, 2017

staceyson commented May 11, 2017

brooksdavis commented May 12, 2017

staceyson commented May 12, 2017

staceyson commented May 12, 2017 •

edited

Loading

brooksdavis commented May 13, 2017

brooksdavis commented May 15, 2017

staceyson commented May 15, 2017

staceyson commented May 18, 2017

brooksdavis commented May 22, 2017

staceyson commented May 23, 2017

brooksdavis commented May 23, 2017

wall clock time goes backwards #44

wall clock time goes backwards #44

Comments

brooksdavis commented May 1, 2017

staceyson commented May 4, 2017

staceyson commented May 4, 2017

staceyson commented May 5, 2017

brooksdavis commented May 5, 2017

rwatson commented May 5, 2017 via email

bsdjhb commented May 5, 2017

staceyson commented May 8, 2017 • edited Loading

brooksdavis commented May 9, 2017

staceyson commented May 11, 2017

brooksdavis commented May 12, 2017

staceyson commented May 12, 2017

staceyson commented May 12, 2017 • edited Loading

brooksdavis commented May 13, 2017

brooksdavis commented May 15, 2017

staceyson commented May 15, 2017

staceyson commented May 18, 2017

brooksdavis commented May 22, 2017

staceyson commented May 23, 2017

brooksdavis commented May 23, 2017

staceyson commented May 8, 2017 •

edited

Loading

staceyson commented May 12, 2017 •

edited

Loading