Skip to content

Latest commit

 

History

History
254 lines (196 loc) · 13.6 KB

README.md

File metadata and controls

254 lines (196 loc) · 13.6 KB

This repository contains OSjitter, Pingpong and other latency/performance related utilities.

OSjitter is a tool for measuring how much the operating system interrupts programs. Such interruptions increase the latency of a program while the variation in latency is called jitter.

This tool can be used to quickly measure a lower bound for the latency of a given system configuration. Note that the OS jitter depends on the kind of load a real-time program is applying to a system. Thus, one still needs to execute a domain specific test-suite to the real-time program of interest after a tool like OSjitter shows good results.

The Pingong utility measures the overhead of several thread notification mechanisms such as spinning on a atomic variable (with/without pauses), POSIX condition variables, semaphores, pipes and raw Linux futexes.

The ptp-clock-offset utility is a small program for checking the availability of different PTP offset ioctls and how they perform. Rule of thumb: using any PTP offset ioctl is better than having to use clock_gettime() and smaller delays are better.

There is also a microbenchmark (bench_syscalls.cc) that measures some (seemingly) low-overhead syscalls in order to measure the userspace to kernelspace mode-switch costs. See also a related blog post for some results.

2019, Georg Sauthoff [email protected], GPLv3+

Example Session

Check out the help:

$ ./osjitter -h

Isolating the last 3 cores on a 8 core system:

$ cat /proc/cmdline
[..] isolcpus=5-7 nohz=on nohz_full=5-7 rcu_nocbs=5-7 rcu_nocb_poll \
nowatchdog mce=ignore_ce acpi_irq_nobalance pcie_aspm=off tsc=reliable

This system is a Supermicro one (running Fedora 29) with an Atom CPU:

$ cat /proc/cpuinfo | grep model' name' | head -n 1
model name	: Intel(R) Atom(TM) CPU C3758 @ 2.20GHz

First OSjitter run:

$ ./osjitter  -t 60
 CPU  TSC_khz  #intr  #delta  ovfl_ns  invol_ctx  sum_intr_ns  iratio  rt_s  loop_ns  median_ns  p20_ns  p80_ns  p90_ns  p99_ns  p99.9_ns   max_ns  mad_ns
   0  2200000  60240   60240        0       8065    283228653   0.005    60       22       3151    2989    4354    6047    7218    443376  9380037     220
   1  2200000  60192   60192        0       9809    216975033   0.004    60       22       2710    2339    3740    5314    6322     11774  4614206     432
   2  2200000  60199   60199        0       5942    180783353   0.003    60       22       2424    2219    3399    4847    7888     14611  1465586     223
   3  2200000  60193   60193        0       5465    171929486   0.003    60       22       2426    2236    3087    4246    6388     11487   592769     187
   4  2200000  60320   60320        0       6173    212338516   0.004    60       22       2548    2358    3468    5005    6280     40044  2262400     211
   5  2200000    156     156        0          1       576392   0.000    60       22       3681    2801    4044    4388   11667     12138    12286     428
   6  2200000    156     156        0          1       581260   0.000    60       22       3565    2788    3964    4270   12278     20279    28125     451
   7  2200000    126     126        0          1       450470   0.000    60       22       3703    2467    4003    4205    9163     11859    12198     352

=> The threads on the isolated CPUs are much less interrupted the other ones.

Move all interrupts away from the isolated CPUs:

# tuna -q '*' -c 0-4 -m -x

OSjitter:

$ ./osjitter  -t 60
 CPU  TSC_khz  #intr  #delta  ovfl_ns  invol_ctx  sum_intr_ns  iratio  rt_s  loop_ns  median_ns  p20_ns  p80_ns  p90_ns  p99_ns  p99.9_ns   max_ns  mad_ns
   0  2200000  60342   60342        0       6207    272600031   0.005    60       22       3105    2980    4141    5898    7205    442155  4772690     144
   1  2200000  60334   60334        0       6483    196708372   0.003    60       22       2488    2293    3530    5014    6335     13491  4684815     236
   2  2200000  60330   60330        0       8479    211832782   0.004    60       22       2528    2296    3651    5269    9299     15708  5513140     347
   3  2200000  60256   60256        0       7973    237326578   0.004    60       22       2477    2261    3617    5155    7186     39479  5602172     325
   4  2200000  60280   60280        0       5149    197355746   0.003    60       22       2532    2345    3020    4026    6309     16298  2630389     175
   5  2200000      8       8        0          1        41371   0.000    60       22       3340    1869    8570   11288   11288     11288    11616    1470
   6  2200000      8       8        0          1        41025   0.000    60       22       3291    1706    8616   11429   11429     11429    11609    1585
   7  2200000     10      10        0          1        46852   0.000    60       22       2886    1927    8794   11968   11968     11968    12126     959

=> Even less interruptions on the isolated CPU's

Move all moveable kernel threads away from the isolated CPUs:

# tuna -U -t '*' -c 0-4 -m

OSjitter:

$ ./osjitter  -t 60
 CPU  TSC_khz  #intr  #delta  ovfl_ns  invol_ctx  sum_intr_ns  iratio  rt_s  loop_ns  median_ns  p20_ns  p80_ns  p90_ns  p99_ns  p99.9_ns   max_ns  mad_ns
   0  2200000  60246   60246        0       4333    231374600   0.004    60       22       3177    3040    3595    4465   10924     29714   469030     134
   1  2200000  60403   60403        0       5965    198823307   0.003    60       22       2490    2274    3425    4865    6387     16643  4743847     229
   2  2200000  60445   60445        0       5020    186508000   0.003    60       22       2402    2172    2959    3740    5762     12846  1716645     209
   3  2200000  60490   60490        0      10195    234402816   0.004    60       22       2825    2308    4398    5358    6915    112854  3997080     576
   4  2200000  60276   60276        0       7274    212001750   0.004    60       22       2531    2328    3668    5061    5747     13550  6431210     275
   5  2200000      8       8        0          1        34188   0.000    60       22       3197    1765    5095    8923    8923      8923    11685    1114
   6  2200000      8       8        0          1        39910   0.000    60       22       3218    1616    8130   11231   11231     11231    11793    1601
   7  2200000      5       5        0          0        16998   0.000    60       22       2091    2079    8506    8506    8506      8506     8506     574

=> Isolated CPUs: Improvements in interruptions, few improvements in median, max and median absolute deviation (MAD).

Switch from throughput-performance based tuned profile to a latency-performance based one (i.e. disable CPU frequency scaling, longer stat interval, writeback cpubask etc.):

# tuned-adm profile gs-latency

OSjitter:

$ ./osjitter  -t 60
 CPU  TSC_khz  #intr  #delta  ovfl_ns  invol_ctx  sum_intr_ns  iratio  rt_s  loop_ns  median_ns  p20_ns  p80_ns  p90_ns  p99_ns  p99.9_ns   max_ns  mad_ns
   0  2200000  60250   60250        0        686    213519597   0.004    60       22       3125    3008    3250    3323   13616     37892  1871887      97
   1  2200000  60223   60223        0      26628    287996052   0.005    60       22       3118    2914    6182    6266    7117     17085  5240030     777
   2  2200000  60241   60241        0      26289    272751612   0.005    60       22       3079    2889    6183    6260    6480      9952  1231324     728
   3  2200000  60193   60193        0        167    163954807   0.003    60       22       2360    2123    2470    2526    3210     13830  8119388     124
   4  2200000  60223   60223        0        120    161220610   0.003    60       22       2427    2231    2514    2566    3060     13410  1885120      99
   5  2200000      5       5        0          1        14843   0.000    60       21       2255    1897    6112    6112    6112      6112     6112     402
   6  2200000      5       5        0          0        17074   0.000    60       22       2144    1852    8859    8859    8859      8859     8859     389
   7  2200000      5       5        0          0        16665   0.000    60       22       1922    1808    8630    8630    8630      8630     8630     234

=> Isolated CPUs: less interruptions, less total interruptions, improvements in median, max and MAD

How it works

OSjitter creates a measurement thread for each selected CPU that polls the CPU's Time Stamp Counter (TSC). In each iteration the previous counter value is subtracted from the previous one and if that duration is above the threshold (default: 100 ns) it's counted as an interruption.

Since the 1990ies, x86 CPUs feature a TSC, which can be read with a special instruction from any user-space program. The TSC on relatively modern CPUs is supposed to run constant and reliable, i.e. even during CPU-frequency changes and power-saving state changes. That means that the TSC frequency (although constant) may be different from the base frequency of the CPU. Since the TSC is integrated into the CPU, can be accessed like a register (with low overhead) and has a high accuracy it's well suited for measuring even short interruptions.

When a program is interrupted by the operating system the TSC ticks continue and thus after the program execution continues (otherwise transparently to the program) it can derive how long it was interrupted by looking at the current TSC value.

The actual TSC frequency is required to convert TSC counts to nanoseconds. OSjitter obtains the TSC frequency from the kernel, i.e. from /sys/devices/system/cpu/cpu0/tsc_freq_khz (if available) or it parses it from journalctl --boot (relevant stackoverflow answer).

How to build

For most utilities:

$ make

The syscall benchmark:

$ git submodule update --init
$ mkdir build
$ cd build
$ CXXFLAGS='-Wall -O3 -g' cmake .. -DCMAKE_BUILD_TYPE=Release -DBENCHMARK_ENABLE_GTEST_TESTS=0 -GNinja
$ ninja

(or a similar cmake invocation)

Related Work

There is sysjitter (1.4, GPLv3) which also reads the TSC in a loop to detect external interruptions. Some differences are:

  • Sysjitter calibrates the TSC frequency against gettimeofday() whereas OSJitter just obtains the Kernel's TSC frequency (the Kernel is in a better position to calibrate the TSC frequency and Linux contains a well-engineered calibration logic including possible refinements after the first calibration)
  • Sysjitter just invokes the RDTSC instruction while OSjitter invokes RDTSC and RDTSCP in combination with fencing instructions
  • OSjitter uses ISO C atomic operations while Sysjitter uses GCC atomic intrinsics
  • In contrast to OSjitter, sysjitter doesn't allow to specify the scheduling class/priority of the measurement threads
  • OSjitter's output includes a measure for dispersion (MAD)
  • Besides TSC on x86, sysjitter also support reading a timestamp counter on POWER CPUs.

The Linux Kernel contains a hardware latency detector to check for interruptions caused outside of the operating system such as the System Management Mode (SMM). It also queries the TSC in a loop.

The SMM is triggered by System Management Interrupts (SMI) which are transparent to the kernel and can only be detected indirectly. An alternative to the TSC approach for detecting and measuring SMIs is to query CPU counters the SMI changes (relevant stackoverflow answer).

Cyclictest measures OS latency by setting timers and comparing the actual sleep time with the configured one.

Erik Rigtorp has published hiccups to measure 'system induced jitter', ipc-bench as a ping-pong latency benchmark and c2clat to measure inter-core latency. The hiccups repository references osnoise, an OS jitter detector built into the Linux kernel which appeared in Linux 5.14 or so that complements the above mentioned hardware latency detector.

Pingpong Results

The doc directory contains some example Pingpong results for different configurations.

The results for condition variable, semaphore and futex are quite similar because, on Linux, condition variables and semaphores are implemented in terms of futex.

Notifying via a traditional UNIX pipe is more expensive than using a futex but it's the same order of magnitude.

Inserting a PAUSE instruction while spinning on an atomic variable increases the median absolute deviation (MAD) just a little bit, but yields similar median while reducing the number of executed instructions.

As documented in the kernel documentation, comparing the results with and without full_hz= show how this features increases context-switch overhead and thus increases latency for the syscall methods (e.g. by 0.6 us or so in the median, a few us in the other percentiles and maximum). On the other hand, more context-switch overhead isn't relevant for spinning on an atomic variable, thus, full_hz= really pays off for this use-case because the process is interrupted much less.