Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide example kubernetes manifest #661

Merged
merged 1 commit into from
Feb 14, 2024
Merged

Provide example kubernetes manifest #661

merged 1 commit into from
Feb 14, 2024

Conversation

jcpunk
Copy link
Contributor

@jcpunk jcpunk commented Jan 22, 2024

This provides an example for how you might deploy this in kubernetes.

It includes node selectors defined by the Node Feature Discovery SIG and podMonitors defined by the Prometheus Operator initiative.

@rdementi
Copy link
Contributor

thanks a lot for the patch. Please let me find a reviewer

@ppalucki
Copy link
Contributor

ppalucki commented Jan 26, 2024

I have two questions:

  1. Why do you use privileged:true? (Is is only because suggest in how to for docker?) I tested it without (but I have to add PCM_NO_MSR to enviornment)
    like this
        - name: PCM_NO_MSR
          value: "1"

and it worked in my enviornment (it used Linux perf interface). Was there any other reason to use privilaged? Can you check, does it work for you without privilaged?

Without privileged, we could put less strcit requirments for namespace (with labels) I just want to follow least privileged principle if it doesn't break any functionality.

  1. Why hostNetwork: true - is it only to simplify configuration of Prometheus discovery with podMonitor - or is there any other reason?

FYI: I'm going for vacation for a week, I'll comeback to review in the second week of February, so no rush.

@jcpunk
Copy link
Contributor Author

jcpunk commented Jan 26, 2024

I did use the privileged flag because of the docker documentation. I'd be happy to drop it, but I don't really understand the risks. I do seem to get data back with it set to disabled. So if you think it would be safe, I'd be happy to drop it.

I set hostNetwork: true for folks who want to scrape this from an external prometheus. I was trying to think of a way to make it easy to either use prometheus-operator or to do your own thing. I'd be fine to drop it.

Copy link
Contributor

@ppalucki ppalucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ready to accept this as is when we drop privileged and hostNetwork and just need to be sure it works without functional issues in bare "kind based" testing enviorment .

pcm-kubernetes.yaml Outdated Show resolved Hide resolved
pcm-kubernetes.yaml Outdated Show resolved Hide resolved
pcm-kubernetes.yaml Show resolved Hide resolved
@jcpunk
Copy link
Contributor Author

jcpunk commented Feb 12, 2024

In theory I've made the changes you requested. Does this look better?

@jcpunk jcpunk requested a review from ppalucki February 12, 2024 17:27
Copy link
Contributor

@ppalucki ppalucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks definitelly better and it works flaweslly! :) so LGTM

Here is the functional test to be further used for validation:

# Create cluster
kind create cluster
kind export kubeconfig

# Deploy NodeFeatureDiscovery
kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.15.1
kubectl get node -o jsonpath='{.items[0].metadata.labels.feature\.node\.kubernetes\.io\/cpu\-model\.vendor_id}{"\n"}'

# Deploy prometheus for PodMonitor
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
kubectl get sts prometheus-prometheus-kube-prometheus-prometheus

# Deploy PCM
kubectl apply -f pcm-kubernetes.yaml

# Verfiy PCM works as expected
kubectl -n intel-pcm get daemonset
kubectl -n intel-pcm get pods
podname=`kubectl -n intel-pcm get pods -ojsonpath='{.items[0].metadata.name}'`
kubectl proxy &
curl -Ls http://127.0.0.1:8001/api/v1/namespaces/intel-pcm/pods/$podname/proxy/metrics | grep DRAM_Writes
promtool query instant http://127.0.0.1:8001/api/v1/namespaces/default/services/prometheus-kube-prometheus-prometheus:http-web/proxy 'avg by(__name__) ({job="pcm"})'

and we get

CStateResidency => 0.09090909090909094 @[1707901856.957]
Clock_Unhalted_Ref => 1010026077.3913049 @[1707901856.957]
Clock_Unhalted_Thread => 1295730425.8695648 @[1707901856.957]
DRAM_Joules_Consumed => 0 @[1707901856.957]
DRAM_Reads => 3600814506.6666665 @[1707901856.957]
DRAM_Writes => 1974366592 @[1707901856.957]
Embedded_DRAM_Reads => 0 @[1707901856.957]
Embedded_DRAM_Writes => 0 @[1707901856.957]
Incoming_Data_Traffic_On_Link_0 => 689786624 @[1707901856.957]
Incoming_Data_Traffic_On_Link_1 => 689454432 @[1707901856.957]
Incoming_Data_Traffic_On_Link_2 => 0 @[1707901856.957]
Instructions_Retired_Any => 749013885.5739133 @[1707901856.957]
Invariant_TSC => 432975372048881700 @[1707901856.957]
L2_Cache_Hits => 3531524.973913045 @[1707901856.957]
L2_Cache_Misses => 2334387.130434784 @[1707901856.957]
L3_Cache_Hits => 1325323.1739130428 @[1707901856.957]
L3_Cache_Misses => 627863.4000000003 @[1707901856.957]
L3_Cache_Occupancy => 0 @[1707901856.957]
Local_Memory_Bandwidth => 0 @[1707901856.957]
Measurement_Interval_in_us => 14507400443881 @[1707901856.957]
Memory_Controller_IO_Requests => 0 @[1707901856.957]
Number_of_sockets => 2 @[1707901856.957]
OS_ID => 55.499999999999986 @[1707901856.957]
Outgoing_Data_And_Non_Data_Traffic_On_Link_0 => 1843333122.5 @[1707901856.957]
Outgoing_Data_And_Non_Data_Traffic_On_Link_1 => 1849219231.5 @[1707901856.957]
Outgoing_Data_And_Non_Data_Traffic_On_Link_2 => 0 @[1707901856.957]
Package_Joules_Consumed => 0 @[1707901856.957]
Persistent_Memory_Reads => 0 @[1707901856.957]
Persistent_Memory_Writes => 0 @[1707901856.957]
RawCStateResidency => 89486131.66409859 @[1707901856.957]
Remote_Memory_Bandwidth => 0 @[1707901856.957]
SMI_Count => 0 @[1707901856.957]
Thermal_Headroom => -2147483648 @[1707901856.957]
Utilization_Incoming_Data_Traffic_On_Link_0 => 0 @[1707901856.957]
Utilization_Incoming_Data_Traffic_On_Link_1 => 0 @[1707901856.957]
Utilization_Incoming_Data_Traffic_On_Link_2 => 0 @[1707901856.957]
Utilization_Outgoing_Data_And_Non_Data_Traffic_On_Link_0 => 0 @[1707901856.957]
Utilization_Outgoing_Data_And_Non_Data_Traffic_On_Link_1 => 0 @[1707901856.957]
Utilization_Outgoing_Data_And_Non_Data_Traffic_On_Link_2 => 0 @[1707901856.957]

ps. above test was run on Intel(R) Xeon(R) Platinum 8180 CPU - for VM based hosts we will have issues depending on the type (e.g. we may need to comment out MCFG/sys-acpi volume as described in FAQ Q11 )

Copy link
Contributor

@rdementi rdementi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot!

@rdementi rdementi merged commit 1932047 into intel:master Feb 14, 2024
30 checks passed
@jcpunk jcpunk deleted the k8s-deployment branch February 14, 2024 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants