Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeline with engine instrumentation #973

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

lukego
Copy link
Member

@lukego lukego commented Jul 22, 2016

This is the implementation of Timeline (#916):

  • Adds the core.timeline module for efficient logging to binary ring buffers.
  • Adds log messages to core.engine to keep track of engine events.

Controversy?

This PR is a relatively large change. It introduces a new mechanism for defining detailed operational events for logging purposes and this is relatively verbose. It also introduces the practice of calling logging hooks even in relatively performance-sensitive code paths which could be hazardous (although I have already taken considerable pains to make this efficient). Reasonable people may have objections.

Implementation notes

The log is a 64MB (1M x 64B) shared memory object. This can be accessed as /var/run/snabb/$pid/engine/timeline while the Snabb is running and will be automatically removed on shutdown (thanks to #930).

The log compresses well and in the near future we should follow up with a way for the Hydra CI to archive timeline logs for all tests (e.g. via a hook in the Snabb shutdown/cleanup process that creates a compressed tarball somewhere).

This implementation automatically randomizes the log level on each breath. Most breaths (99%) are not logged at all; some breaths (1%) have their start and end logged, and a few breaths (0.1%) are logged in more detail ("app level"), while extremely few (0.01%) are logged in great detail ("packet level"). The intention is to get a representative sample of log messages and for each sample to be a whole breath that is logged at a consistent level.

Example log excerpt

The complete timeline log is up to a million lines long. Here is a tiny snippet that shows ~10 breaths. One of these is logged in detail ("app level") while the others with low detail (start and end only). The events each include parameters that could be used to calculate interesting metrics (cycles per packet, etc).

numa core       cycles category         message
---- ---- ------------ ---------------- -------
0    3        15449112 engine           start breath(1418481): total packets(64180039) bytes(100406899796) etherbits(807876161304)
0    3          341564 engine           end breath(1418481): processed packets(242) bytesPerPacket(644)
0    3         7143964 engine           start breath(1418509): total packets(64187365) bytes(100411625144) etherbits(807914491560)
0    3          249408 engine           end breath(1418509): processed packets(242) bytesPerPacket(644)
0    3        27673240 engine           start breath(1418614): total packets(64214647) bytes(100429219436) etherbits(808057210200)
0    3          244564 engine           end breath(1418614): processed packets(242) bytesPerPacket(644)
0    3         9542868 engine           start breath(1418651): total packets(64224261) bytes(100435419968) etherbits(808107506664)
0    3          245384 engine           end breath(1418651): processed packets(242) bytesPerPacket(644)
0    3        76336068 engine           ran timers
0    3            1800 engine           start breath(1418940): total packets(64294376) bytes(100480636610) etherbits(808474288080)
0    3             264 engine           initialized breath
0    3            4156 engine           called app A_NIC (type: apps.intel.intel_app)
0    3           14404 engine           called app B_NIC (type: apps.intel.intel_app)
0    3           39008 engine           called app B_Virtio (type: apps.vhost.vhost_user)
0    3           90292 engine           called app A_Virtio (type: apps.vhost.vhost_user)
0    3             220 engine           pulled input packets
0    3            4504 engine           called app B_NIC (type: apps.intel.intel_app)
0    3          105588 engine           called app B_Virtio (type: apps.vhost.vhost_user)
0    3           13732 engine           called app A_NIC (type: apps.intel.intel_app)
0    3            7864 engine           called app A_Virtio (type: apps.vhost.vhost_user)
0    3           27460 engine           pushed output packets
0    3             840 engine           pushed output packets
0    3             220 engine           end breath(1418940): processed packets(242) bytesPerPacket(644)
0    3        40522220 engine           start breath(1419083): total packets(64331528) bytes(100504588274) etherbits(808668576336)
0    3          333708 engine           end breath(1419083): processed packets(317) bytesPerPacket(646)
0    1        41323384 engine           start breath(1419233): total packets(64368118) bytes(100528182974) etherbits(808859968416)
0    1          357952 engine           end breath(1419233): processed packets(242) bytesPerPacket(644)
0    1        65394108 engine           start breath(1419463): total packets(64428194) bytes(100566922310) etherbits(809174208576)
0    1          247936 engine           end breath(1419463): processed packets(244) bytesPerPacket(639)
0    1        18768960 engine           start breath(1419530): total packets(64445616) bytes(100578159890) etherbits(809265363600)
0    1          277872 engine           end breath(1419530): processed packets(242) bytesPerPacket(644)

Note that the first breath is #1418481 and the last breath is #1419530 and so during this interval there were 1049 breaths executed but only these ones were sampled for the log. (The more breaths we include in the sample the faster the log wraps around.)

How to use it

This PR introduces a discipline of registering detailed log messages ahead of time. This is to make the storage efficient (reference predefined strings) and the logging efficient (reuse a dedicated closure for each log message).

Here is an example of how log messages for sleeping and waking up can be defined:

local event_sleep_idle = define_event('trace', [[sleeping $usec on idle                                                                                                                       
The engine requests that the kernel suspend this process for a period                                                                                                                         
of microseconds in order to reduce CPU utilization because idleness                                                                                                                           
has been detected (a breath in which no packets were processed.)]])

local event_wakeup = define_event('trace', [[wakeup from sleep                                                                                                                                
The engine resumes operation after sleeping voluntarily.]])

and here is how those example messages are logged:

event_sleep_idle(sleep)
C.usleep(sleep)
event_wakeup()

and then the log will contain messages like this:

numa core       cycles category         message
---- ---- ------------ ---------------- -------
0    3             404 engine           sleeping usec(100) on idle
0    3          535156 engine           wakeup from sleep
0    0             272 engine           sleeping usec(100) on idle
0    0          533624 engine           wakeup from sleep
0    0             360 engine           sleeping usec(100) on idle
0    0          533640 engine           wakeup from sleep
0    0             372 engine           sleeping usec(100) on idle
0    0          534732 engine           wakeup from sleep

which are interesting for a couple of reasons. First we can look at the cycles (delta) column to see how long we really sleep for and compare this with the time that we requested. Second we can look at the core column to see whether we have woken up on the same CPU core as we slept on.

The idea is that the log would include many such events of different kinds and a developer can sift through to answer questions. The log format is also relatively easy to process e.g. if one would want to plot the distribution of actual cycles spent sleeping vs requested sleeping time to estimate the reliability of usleep.

Formatting timeline logs

The log excerpts above were created with a command like:

sudo ./snabb snsh -e 'require("core.timeline").dump("./timeline")'

and to remove the verbosity (complete log message text) I added `| grep '^[0-9]'.

This could be greatly improved in the future e.g. with a snabb log program or an extension of snabb top to access timelines. There is also a graphical timeline log analyzer in the works.

lukego and others added 29 commits March 30, 2016 07:19
I spotted two typos and fixed them (entires -> entries).

And a question: In the example, shouldn't the string `example` appear somewhere in the resulting log ("The log now contains there entries")?
Typo fixes, and a question
Added category to example log output (thanks for feedback @sleinen).

Added functions: priority(), save(), dump()

Fixed formatting (more word wrap)
Updated the timeline implementation to match the documented API.

This is still a work in progress:

The messages printed by the selftest all have zeros for arguments for
unknown reasons;

The overhead of logging has doubled for unknown reasons, possibly due
to allocating with core.shm and putting a performance-critical
cacheline on a 4096-aligned address that may trigger associativity
problems (wild speculation, have to dig in).
I realized that when we take the timestamp with RDTSCP this is also
loading the current CPU core number and NUMA node into ECX. This patch
includes that register in the log dump. This seems useful so that we can
account for scheduling issues from the log.

Strictly speaking the value in ECX is the contents of the TSC_AUX
register but it seems that we can rely on Linux to load the core ID +
NUMA node there. Here is where I got the idea:
http://stackoverflow.com/questions/22310028/is-there-an-x86-instruction-to-tell-which-core-the-instruction-is-being-run-on
The string pattern matching for detecting arguments ($arg) was wrong and
this caused all args to be logged as 0.
Show core number and NUMA node.

Write slightly better log messages for selftest.

Simplify the logging loop in the selftest to simply log three messages
per iteration. This way the timestamps will indicate the logging
overhead. (Previous version had some funny branches that could add a few
dozen cycles to the timestamps which is a bit distracting.)
Fixed an off-by-one error (braino) in choosing which timestamps to
calculate the delta from.

Make the cycle column wider to better accommodate large values.
The selftest function now executes a few loops to see how long it takes
to log a million events that are disabled by priority.

Log excerpt:

    numa core      -cycles category         message
    0    3         5014856 selftest         invoked many events(1000000)
    0    3         5025592 selftest         invoked many events(1000000)
    0    3         5176004 selftest         invoked many events(1000000)

So looks like five cycles per event is the ballpark cost (including
the loop that makes the individual FFI calls).
The events are currently always enabled. This needs to be optimized.

Goal for now is simply to have interesting information in the log.
Use shm.create() instead of shm.map().
Required after update in  9fe397e.
Sample different priorities (log levels) on different breaths. Try to
always include interesting details in the log but without wrapping too
quickly.
end breath now reports number of packets and bits freed.
The timeline text dump now formats the log starting from the first
message and continuing towards the last.

The previous behavior was to print the log backwards but this was
confusing and ill-considered.
Merged conflict in core/app.lua
Requires a math.max() on an argument that is otherwise FFI cdata.
@eugeneia eugeneia self-assigned this Jul 25, 2016
@eugeneia
Copy link
Member

eugeneia commented Jul 25, 2016

🎆 Looking great! Can't wait to look at some cool plots. :-)

Depending on whether #972 gets accepted, I would adapt the API of core.timeline to provide create and open (instead of new) and register it as the type timeline. Makes sense? It would probably also make sense to add a snabb top -t option to pretty-print regions of timeline objects.

Super minor nitpick, question: When formatting optional parameters in Markdown we used *[param]*, but I prefer to use [*param*] (and that is what I adopted for #972). E.g. italic marks the symbol, braces are semantic hints. Do you mind if I change that to the latter style when merging this?

Edit: I am now realizing why we used *[param]*, the alternative renders to a link... hmm.
Edit2: Apparently only in GitHub comments.

@lukego
Copy link
Member Author

lukego commented Jul 26, 2016

@eugeneia That's okay for me.

Hey there are a few outstanding issues here that I would like to resolve for this release but maybe we can take on next. The main thing is that I want Hydra to snapshot the shm folder for relevant process(es) and archive this for reference. This way we would have the timelines recorded for later inspection. I wonder what is the least-hacky way to do this? Basically we need to either run a hook before we delete the shm dir (which could be used to copy a snapshot somewhere) or we need to create the shm directory in a special location and then not delete it on termination.

This branch actually contains a couple of hacks in this direction that are not very effective and likely should be rolled back. One is adding the env var SNABB_SHM_ROOT that can force a new location for the shm directory. The other is in snabbnfv.traffic to explicitly save the timeline in $pwd at the end of test. I actually pushed reverts for these but cheekly reset that away quickly because I am not sure if they are part of the solution or not.

Like I say these issues could be resolved on next so needn't block merge of this branch. I do really want to pick a solution for preserving the shm directory for Hydra though.

@eugeneia
Copy link
Member

I think adding SNABB_SHM_ROOT is an appropriate solution, combined with SNABB_DEBUG=yes (which causes the runtime directory to persist after exit) it should do what we need.

@lukego
Copy link
Member Author

lukego commented Jul 27, 2016

I notice that we already have _G.developer_debug inhibiting removal of the shm directory. That is probably sufficient for getting this hooked into Hydra. Good enough for now.

Generally I am uncomfortable with debug modes like _G.developer_debug and SNABB_DEBUG because I don't understand their consequences. Do they reduce performance? Leave files on the file system? Create verbose output? Likely I willl want some of these but not others, e.g. if I am benchmarking then I want the state kept on the filesystem for examination but I don't want to enable debug features that will impact performance. So a global debug=yes/no flag seems unlikely to do what I mean.

Have to make a snabb.conf with a well-defined format e.g. YANG schema. One step at a time...

lukego added a commit to lukego/snabb that referenced this pull request Jul 27, 2016
takikawa pushed a commit to takikawa/snabb that referenced this pull request Sep 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants