Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bpf): use time window for bpf sampling to replace per call based sampling #1723

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rootfs
Copy link
Contributor

@rootfs rootfs commented Aug 22, 2024

From @vimalk78 finding, the per call based bpf sampling has very large cpu time variations.

Now changing to time window based sampling. The cpu time is much consistent and close to the probing results, while the overhead is reduced even more.

Disclaimer: some of the code is generated by ChatGPT.

Active Time (ms) Idle Time (ms) Average kepler_sched_switch_trace bpf runtime (ns)
5 95 400
20 80 875
50 50 1500
80 20 2100
1000 (default) 0 2500

Copy link

github-actions bot commented Aug 22, 2024

🤖 SeineSailor

Here is a concise summary of the pull request changes:

Summary: This pull request introduces significant changes to the BPF (Berkeley Packet Filter) implementation, replacing per-call sampling with time window-based sampling. This update aims to reduce CPU time variation and overhead. Additionally, a minor internal change is made to the dcgm.Init() function call.

Key Modifications:

  1. Time window-based sampling for BPF: The new approach reduces CPU time variation and overhead by introducing global parameters for tracking and non-tracking periods, along with two BPF maps to manage the tracking state.
  2. Updated do_kepler_sched_switch_trace function: This function now checks a tracking flag and updates the sampling state based on elapsed time.
  3. Internal change to dcgm.Init() function call: The function now uses config.GetDCGMHostEngineEndpoint() instead of config.DCGMHostEngineEndpoint.

Impact on Codebase:

  • The external interface remains unchanged, but the implementation of sampling and tracking has been significantly altered.
  • The code generated by ChatGPT may require further review.

Observations and Suggestions:

  • It would be beneficial to include additional testing to ensure the new time window-based sampling approach does not introduce any regressions or performance issues.
  • Consider adding more detailed comments or documentation to explain the reasoning behind the changes and how they impact the codebase.
  • Review the code generated by ChatGPT to ensure it meets the project's coding standards and best practices.

@rootfs
Copy link
Contributor Author

rootfs commented Aug 22, 2024

converting to draft, pending test results.

@rootfs rootfs marked this pull request as draft August 22, 2024 15:22
@vimalk78
Copy link
Collaborator

Test results:

Below is a comparison of two keplers, one with sampling window enabled (100 ms active, 1000 ms idle), other without sampling.

We can see that on bare metal, the two keplers produce very close values for package power and core power, because the ratio of bpf cpu time, with sampling, is very close to without sampling.

process cpu time, exhaustive vs sampling
image

process core joules, exhaustive vs sampling
image

process package joules, exhaustive vs sampling
image

kepler cpu time, exhaustive vs sampling
image

As expected, the kepler with sampling consumes less cpu time, and less cpu instructions compared to kepler without sampling.

@rootfs rootfs force-pushed the sample-window branch 6 times, most recently from 113e3bc to d88e8b8 Compare September 16, 2024 19:53
@rootfs
Copy link
Contributor Author

rootfs commented Sep 16, 2024

@dave-tucker @sthaha @marceloamaral PTAL, thanks

@rootfs rootfs marked this pull request as ready for review September 16, 2024 20:30
@marceloamaral
Copy link
Collaborator

@rootfs @dave-tucker I’m concerned about the impact on VM power estimation. If we're undercounting CPU time, the power consumption will be underestimated as well.

To address this, we need to extrapolate the results, similar to how Linux handles counter multiplexing. For instance, if we collected data for only 1 second out of a 5-second window, we should multiply the results by 5 to estimate for the full 5 seconds.

All we need to do is track the collection interval and adjust the results accordingly.

Copy link
Collaborator

@dave-tucker dave-tucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the code.

I'm not going to repeat myself, so here is my canned response: #1685 (review)

@vimalk78 I'm not surprised the totals or the estimation are pretty much the same.
But can you run the same test again but show the distribution of CPU time for each process on the system?

My bet is we will find that per-process cpu time, cpu instructions etc.. are totally off.
Comparing to top or schapandre etc.. would yield very different results.

Given we only care about cpu time (at the moment), the most efficient option is to not use eBPF at all and just read utime from /proc/$pid/stat


// BPF map to track whether we are in the tracking period or not
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using a PERCPU_ARRAY since this would have the effect of making the time window per-cpu also, which may be desirable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to PERCPU_ARRAY

counter_sched_switch--;
// Retrieve tracking flag and start time
u32 key = 0;
u32 *tracking_flag = bpf_map_lookup_elem(&tracking_flag_map, &key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that map lookups are the most expensive part of the eBPF code it would be better to reduce them where possible. there's no reason to store tracking_flag in a map as far as I can tell since it's value doesn't need to persist between invocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a thinking of using kepler userspace program to set the tracking flag. The actual mechanism is not quite clear yet. Will remove this map if that is a dead end.

CPUArchOverride = getConfig("CPU_ARCH_OVERRIDE", "")
MaxLookupRetry = getIntConfig("MAX_LOOKUP_RETRY", defaultMaxLookupRetry)
BPFSampleRate = getIntConfig("EXPERIMENTAL_BPF_SAMPLE_RATE", 0)
BPFActiveSampleWindowMS = getIntConfig("EXPERIMENTAL_BPF_ACTIVE_SAMPLE_WINDOW_MS", 1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the default values in the code 20 and 80, but here they are 1000 and 0?
If this is for coexistence with the other sampling feature it may be easier to set them all to 0 and update the eBPF code to only evaluate this code path if both ACTIVE and IDLE values are > 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@rootfs
Copy link
Contributor Author

rootfs commented Sep 17, 2024

@marceloamaral good point! at the moment, the sampled cpu time is not extrapolated. We can consider different scaling factors. One approach in my plan is to find the max and min cpu time from each sample, and use the mean cpu time to extrapolate the entire active + idle duration. This will account for the variable cpu utilization conditions. If that proves effective, we then will discuss removing the EXPERIMENTAL prefix from these params. wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants