Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to pull image "kepler_model_server" #888

Closed
tobby-yuan opened this issue Aug 25, 2023 · 9 comments
Closed

Failed to pull image "kepler_model_server" #888

tobby-yuan opened this issue Aug 25, 2023 · 9 comments
Assignees
Labels
kind/bug report bug issue

Comments

@tobby-yuan
Copy link

tobby-yuan commented Aug 25, 2023

What happened?

Hi everone, I install kepler using Manifests and option ESTIMATOR_SIDECAR_DEPLOY(command: make build-manifest OPTS="ESTIMATOR_SIDECAR_DEPLOY").
Then, I deploy kepler(command: kubectl apply -f _output/generated-manifest/deployment.yaml).
However, I encountered a following error

# kubectl get pods -A
NAMESPACE     NAME                                                         READY   STATUS         RESTARTS   AGE
kepler        kepler-exporter-9s4bk                                        1/2     ErrImagePull   0          18s

The following is description of this pod:

# kubectl describe pods -n kepler        kepler-exporter-9s4bk
Name:         kepler-exporter-9s4bk
Namespace:    kepler
Priority:     0
Node:         understudent/10.0.10.202
Start Time:   Fri, 25 Aug 2023 04:37:40 +0000
Labels:       app.kubernetes.io/component=exporter
              app.kubernetes.io/name=kepler-exporter
              controller-revision-hash=d46fbb7bc
              pod-template-generation=1
              sustainable-computing.io/app=kepler
Annotations:  <none>
Status:       Pending
IP:           10.244.0.71
IPs:
  IP:           10.244.0.71
Controlled By:  DaemonSet/kepler-exporter
Containers:
  kepler-exporter:
    Container ID:  docker://bc025ac92c05336c6f1b041071da78e70d7883a63cc26ba56ec86466188aa15a
    Image:         quay.io/sustainable_computing_io/kepler:latest
    Image ID:      docker-pullable://quay.io/sustainable_computing_io/kepler@sha256:e80cfa22bf41280d0663ac490a28624d4436ae4e56609a0ed27d1bcd82a52b75
    Port:          9102/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      -c
    Args:
      until [ -e /tmp/estimator.sock ]; do sleep 1; done && /usr/bin/kepler -v=1 -kernel-source-dir=/usr/share/kepler/kernel_sources -redfish-cred-file-path=/etc/redfish/redfish.csv
    State:          Running
      Started:      Fri, 25 Aug 2023 04:37:44 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
    Environment:
      NODE_IP:     (v1:status.hostIP)
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/kepler/kepler.config from cfm (ro)
      /etc/redfish from redfish (ro)
      /lib/modules from lib-modules (rw)
      /proc from proc (rw)
      /sys from tracing (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5n76j (ro)
  estimator:
    Container ID:
    Image:         kepler_model_server
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      python3.8
    Args:
      -u
      src/estimate/estimator.py
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/kepler/kepler.config from cfm (ro)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5n76j (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  Directory
  tracing:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc
    HostPathType:  Directory
  cfm:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kepler-cfm
    Optional:  false
  redfish:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redfish-4kh9d7bc7m
    Optional:    false
  kube-api-access-5n76j:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  32s                default-scheduler  Successfully assigned kepler/kepler-exporter-9s4bk to understudent
  Normal   Pulling    32s                kubelet            Pulling image "quay.io/sustainable_computing_io/kepler:latest"
  Normal   Pulled     30s                kubelet            Successfully pulled image "quay.io/sustainable_computing_io/kepler:latest" in 2.560478285s
  Normal   Created    30s                kubelet            Created container kepler-exporter
  Normal   Started    29s                kubelet            Started container kepler-exporter
  Normal   BackOff    26s                kubelet            Back-off pulling image "kepler_model_server"
  Warning  Failed     26s                kubelet            Error: ImagePullBackOff
  Normal   Pulling    11s (x2 over 29s)  kubelet            Pulling image "kepler_model_server"
  Warning  Failed     9s (x2 over 27s)   kubelet            Failed to pull image "kepler_model_server": rpc error: code = Unknown desc = Error response from daemon: pull access denied for kepler_model_server, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
  Warning  Failed     9s (x2 over 27s)   kubelet            Error: ErrImagePull

How can I solve it?

What did you expect to happen?

"kepler_model_server" should be pull.

How can we reproduce it (as minimally and precisely as possible)?

Install kepler using Manifests and option ESTIMATOR_SIDECAR_DEPLOY

Anything else we need to know?

No response

Kepler image tag

Kubernetes version

$ kubectl version
# paste output here

Cloud provider or bare metal

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@tobby-yuan tobby-yuan added the kind/bug report bug issue label Aug 25, 2023
@sunya-ch
Copy link
Collaborator

Currently, the flow is building only version tag (v0.6) for now but I will also add latest tag to the flow.

@tobby-yuan
Copy link
Author

@sunya-ch Why the pod will pull image kepler_model_server not quay.io/sustainable_computing_io/kepler_model_server?

@sunya-ch
Copy link
Collaborator

sunya-ch commented Aug 28, 2023

Oh.. I see. It is a different issue. We do miss the kustomize edit set image command for the kepler-model-server.

On the kepler-model-server repo, we set image in the Makefile here: https://github.com/sustainable-computing-io/kepler-model-server/blob/0bcf63a9625ff71dea293774bd761727ae859d24/Makefile#L73.

@sunya-ch
Copy link
Collaborator

Please confirm these two PRs.

sustainable-computing-io/kepler-model-server#129
#891

@tobby-yuan
Copy link
Author

@sunya-ch What day does this issue will be fixed?

Can I change the image from kepler_model_server to quay.io/sustainable_computing_io/kepler_model_server directly using kubectl edit -n kepler daemonset kepler-exporter if I wand to fix the bug?

@sunya-ch
Copy link
Collaborator

@sunya-ch What day does this issue will be fixed?

Can I change the image from kepler_model_server to quay.io/sustainable_computing_io/kepler_model_server directly using kubectl edit -n kepler daemonset kepler-exporter if I wand to fix the bug?

Yes, you can directly change the image to quay.io/sustainable_computing_io/kepler_model_server:v0.6 in the daemonset.

@Lai-Kenny
Copy link

Hi, @sunya-ch . I found make build-manifest OPTS="ESTIMATOR_SIDECAR_DEPLOY" and make build-manifest OPTS="MODEL_SERVER_DEPLOY". They seem to have switched to model server, right?

@tobby-yuan
Copy link
Author

@sunya-ch I changed the model server image to the quay.io/sustainable_computing_io/kepler_model_server:v0.6 and exporter image to quay.io/sustainable_computing_io/kepler:latest-libbpf in the daemonset.

These are logs of kepler exporter.

kubectl logs -n kepler kepler-exporter-f7fmw kepler-exporter
I0828 11:44:43.463430       1 gpu.go:46] Failed to init nvml, err: could not init nvml: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0828 11:44:43.468290       1 qat.go:35] Failed to init qat-telemtry err: could not get qat status exit status 127
I0828 11:44:43.477644       1 exporter.go:158] Kepler running on version: fad28b7
I0828 11:44:43.477694       1 config.go:270] using gCgroup ID in the BPF program: true
I0828 11:44:43.477748       1 config.go:272] kernel version: 5.15
I0828 11:44:43.477810       1 exporter.go:170] LibbpfBuilt: true, BccBuilt: false
I0828 11:44:43.477846       1 exporter.go:189] EnabledBPFBatchDelete: true
I0828 11:44:43.477890       1 rapl_msr_util.go:129] failed to open path /dev/cpu/0/msr: no such file or directory
I0828 11:44:43.477995       1 power.go:71] Unable to obtain power, use estimate method
I0828 11:44:43.478043       1 redfish.go:173] failed to initialize node credential: no supported node credential implementation
I0828 11:44:43.478050       1 power.go:56] use acpi to obtain power
I0828 11:44:43.478243       1 acpi.go:67] Could not find any ACPI power meter path. Is it a VM?
I0828 11:44:43.493568       1 container_energy.go:109] Using the Ratio/DynPower Power Model to estimate Container Platform Power
I0828 11:44:43.493589       1 container_energy.go:118] Using the Ratio/DynPower Power Model to estimate Container Component Power
I0828 11:44:43.493617       1 process_power.go:108] Using the Ratio/DynPower Power Model to estimate Process Platform Power
I0828 11:44:43.493627       1 process_power.go:117] Using the Ratio/DynPower Power Model to estimate Process Component Power
I0828 11:44:43.493803       1 node_platform_energy.go:53] Using the LinearRegressor/AbsPower Power Model to estimate Node Platform Power
I0828 11:44:44.709229       1 node_component_energy.go:54] Using the EstimatorSidecar/AbsPower Power Model to estimate Node Component Power
I0828 11:44:44.709405       1 exporter.go:212] Initializing the GPU collector
I0828 11:44:50.714948       1 watcher.go:66] Using in cluster k8s config
libbpf: loading /var/lib/kepler/bpfassets/amd64_kepler.bpf.o
libbpf: elf: section(3) tracepoint/sched/sched_switch, size 2344, link 0, flags 6, type=1
libbpf: sec 'tracepoint/sched/sched_switch': found program 'kepler_trace' at insn offset 0 (0 bytes), code size 293 insns (2344 bytes)
libbpf: elf: section(4) .reltracepoint/sched/sched_switch, size 352, link 29, flags 40, type=9
libbpf: elf: section(5) tracepoint/irq/softirq_entry, size 144, link 0, flags 6, type=1
libbpf: sec 'tracepoint/irq/softirq_entry': found program 'kepler_irq_trace' at insn offset 0 (0 bytes), code size 18 insns (144 bytes)
libbpf: elf: section(6) .reltracepoint/irq/softirq_entry, size 16, link 29, flags 40, type=9
libbpf: elf: section(7) .maps, size 352, link 0, flags 3, type=1
libbpf: elf: section(8) license, size 4, link 0, flags 3, type=1
libbpf: license of /var/lib/kepler/bpfassets/amd64_kepler.bpf.o is GPL
libbpf: elf: section(19) .BTF, size 5759, link 0, flags 0, type=1
libbpf: elf: section(21) .BTF.ext, size 2120, link 0, flags 0, type=1
libbpf: elf: section(29) .symtab, size 1056, link 1, flags 0, type=2
libbpf: looking for externs among 44 symbols...
libbpf: collected 0 externs total
libbpf: map 'processes': at sec_idx 7, offset 0.
libbpf: map 'processes': found type = 1.
libbpf: map 'processes': found key [6], sz = 4.
libbpf: map 'processes': found value [10], sz = 88.
libbpf: map 'processes': found max_entries = 32768.
libbpf: map 'pid_time': at sec_idx 7, offset 32.
libbpf: map 'pid_time': found type = 1.
libbpf: map 'pid_time': found key [6], sz = 4.
libbpf: map 'pid_time': found value [12], sz = 8.
libbpf: map 'pid_time': found max_entries = 32768.
libbpf: map 'cpu_cycles_hc_reader': at sec_idx 7, offset 64.
libbpf: map 'cpu_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_cycles': at sec_idx 7, offset 96.
libbpf: map 'cpu_cycles': found type = 2.
libbpf: map 'cpu_cycles': found key [6], sz = 4.
libbpf: map 'cpu_cycles': found value [12], sz = 8.
libbpf: map 'cpu_cycles': found max_entries = 128.
libbpf: map 'cpu_ref_cycles_hc_reader': at sec_idx 7, offset 128.
libbpf: map 'cpu_ref_cycles_hc_reader': found type = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_ref_cycles_hc_reader': found max_entries = 128.
libbpf: map 'cpu_ref_cycles': at sec_idx 7, offset 160.
libbpf: map 'cpu_ref_cycles': found type = 2.
libbpf: map 'cpu_ref_cycles': found key [6], sz = 4.
libbpf: map 'cpu_ref_cycles': found value [12], sz = 8.
libbpf: map 'cpu_ref_cycles': found max_entries = 128.
libbpf: map 'cpu_instr_hc_reader': at sec_idx 7, offset 192.
libbpf: map 'cpu_instr_hc_reader': found type = 4.
libbpf: map 'cpu_instr_hc_reader': found key [2], sz = 4.
libbpf: map 'cpu_instr_hc_reader': found value [6], sz = 4.
libbpf: map 'cpu_instr_hc_reader': found max_entries = 128.
libbpf: map 'cpu_instr': at sec_idx 7, offset 224.
libbpf: map 'cpu_instr': found type = 2.
libbpf: map 'cpu_instr': found key [6], sz = 4.
libbpf: map 'cpu_instr': found value [12], sz = 8.
libbpf: map 'cpu_instr': found max_entries = 128.
libbpf: map 'cache_miss_hc_reader': at sec_idx 7, offset 256.
libbpf: map 'cache_miss_hc_reader': found type = 4.
libbpf: map 'cache_miss_hc_reader': found key [2], sz = 4.
libbpf: map 'cache_miss_hc_reader': found value [6], sz = 4.
libbpf: map 'cache_miss_hc_reader': found max_entries = 128.
libbpf: map 'cache_miss': at sec_idx 7, offset 288.
libbpf: map 'cache_miss': found type = 2.
libbpf: map 'cache_miss': found key [6], sz = 4.
libbpf: map 'cache_miss': found value [12], sz = 8.
libbpf: map 'cache_miss': found max_entries = 128.
libbpf: map 'cpu_freq_array': at sec_idx 7, offset 320.
libbpf: map 'cpu_freq_array': found type = 2.
libbpf: map 'cpu_freq_array': found key [6], sz = 4.
libbpf: map 'cpu_freq_array': found value [6], sz = 4.
libbpf: map 'cpu_freq_array': found max_entries = 128.
libbpf: sec '.reltracepoint/sched/sched_switch': collecting relocation for section(3) 'tracepoint/sched/sched_switch'
libbpf: sec '.reltracepoint/sched/sched_switch': relo #0: insn #17 against 'cpu_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 2 (cpu_cycles_hc_reader, sec 7, off 64) for insn #17
libbpf: sec '.reltracepoint/sched/sched_switch': relo #1: insn #36 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 7, off 96) for insn #36
libbpf: sec '.reltracepoint/sched/sched_switch': relo #2: insn #50 against 'cpu_cycles'
libbpf: prog 'kepler_trace': found map 3 (cpu_cycles, sec 7, off 96) for insn #50
libbpf: sec '.reltracepoint/sched/sched_switch': relo #3: insn #55 against 'cpu_ref_cycles_hc_reader'
libbpf: prog 'kepler_trace': found map 4 (cpu_ref_cycles_hc_reader, sec 7, off 128) for insn #55
libbpf: sec '.reltracepoint/sched/sched_switch': relo #4: insn #68 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 7, off 160) for insn #68
libbpf: sec '.reltracepoint/sched/sched_switch': relo #5: insn #82 against 'cpu_ref_cycles'
libbpf: prog 'kepler_trace': found map 5 (cpu_ref_cycles, sec 7, off 160) for insn #82
libbpf: sec '.reltracepoint/sched/sched_switch': relo #6: insn #87 against 'cpu_instr_hc_reader'
libbpf: prog 'kepler_trace': found map 6 (cpu_instr_hc_reader, sec 7, off 192) for insn #87
libbpf: sec '.reltracepoint/sched/sched_switch': relo #7: insn #104 against 'cpu_instr'
libbpf: prog 'kepler_trace': found map 7 (cpu_instr, sec 7, off 224) for insn #104
libbpf: sec '.reltracepoint/sched/sched_switch': relo #8: insn #117 against 'cpu_instr'
libbpf: prog 'kepler_trace': found map 7 (cpu_instr, sec 7, off 224) for insn #117
libbpf: sec '.reltracepoint/sched/sched_switch': relo #9: insn #122 against 'cache_miss_hc_reader'
libbpf: prog 'kepler_trace': found map 8 (cache_miss_hc_reader, sec 7, off 256) for insn #122
libbpf: sec '.reltracepoint/sched/sched_switch': relo #10: insn #134 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 7, off 288) for insn #134
libbpf: sec '.reltracepoint/sched/sched_switch': relo #11: insn #148 against 'cache_miss'
libbpf: prog 'kepler_trace': found map 9 (cache_miss, sec 7, off 288) for insn #148
libbpf: sec '.reltracepoint/sched/sched_switch': relo #12: insn #156 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #156
libbpf: sec '.reltracepoint/sched/sched_switch': relo #13: insn #170 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #170
libbpf: sec '.reltracepoint/sched/sched_switch': relo #14: insn #182 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #182
libbpf: sec '.reltracepoint/sched/sched_switch': relo #15: insn #206 against 'cpu_freq_array'
libbpf: prog 'kepler_trace': found map 10 (cpu_freq_array, sec 7, off 320) for insn #206
libbpf: sec '.reltracepoint/sched/sched_switch': relo #16: insn #215 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #215
libbpf: sec '.reltracepoint/sched/sched_switch': relo #17: insn #223 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #223
libbpf: sec '.reltracepoint/sched/sched_switch': relo #18: insn #235 against 'pid_time'
libbpf: prog 'kepler_trace': found map 1 (pid_time, sec 7, off 32) for insn #235
libbpf: sec '.reltracepoint/sched/sched_switch': relo #19: insn #241 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #241
libbpf: sec '.reltracepoint/sched/sched_switch': relo #20: insn #261 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #261
libbpf: sec '.reltracepoint/sched/sched_switch': relo #21: insn #287 against 'processes'
libbpf: prog 'kepler_trace': found map 0 (processes, sec 7, off 0) for insn #287
libbpf: sec '.reltracepoint/irq/softirq_entry': collecting relocation for section(5) 'tracepoint/irq/softirq_entry'
libbpf: sec '.reltracepoint/irq/softirq_entry': relo #0: insn #5 against 'processes'
libbpf: prog 'kepler_irq_trace': found map 0 (processes, sec 7, off 0) for insn #5
libbpf: map 'processes': created successfully, fd=10
libbpf: map 'pid_time': created successfully, fd=11
libbpf: map 'cpu_cycles_hc_reader': created successfully, fd=12
libbpf: map 'cpu_cycles': created successfully, fd=13
libbpf: map 'cpu_ref_cycles_hc_reader': created successfully, fd=14
libbpf: map 'cpu_ref_cycles': created successfully, fd=15
libbpf: map 'cpu_instr_hc_reader': created successfully, fd=16
libbpf: map 'cpu_instr': created successfully, fd=17
libbpf: map 'cache_miss_hc_reader': created successfully, fd=18
libbpf: map 'cache_miss': created successfully, fd=19
libbpf: map 'cpu_freq_array': created successfully, fd=20
I0828 11:44:50.823749       1 libbpf_attacher.go:143] failed to get perf event cpu_instructions_hc_reader: failed to find BPF map cpu_instructions_hc_reader: no such file or directory
I0828 11:44:50.824146       1 libbpf_attacher.go:157] Successfully load eBPF module from libbpf object
I0828 11:44:50.857140       1 exporter.go:276] Started Kepler in 7.379527116s

Is this right? kpler works normally or not?

@rootfs
Copy link
Contributor

rootfs commented Aug 28, 2023

this looks right, kepler is started successfully.

@rootfs rootfs closed this as completed Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug report bug issue
Projects
None yet
Development

No branches or pull requests

4 participants