High memory usage (>50Gi) when scraping Prometheus metrics #1358

nar-git · 2024-09-30T03:34:22Z

Describe the bug
High memory usage (>50Gi) when scraping Prometheus metrics in EKS on EC2 cluster using cloud watch agent. Our cluster have below resources and the agent memory limit set to 50Gi and getting OOMKilled in every 5 minutes.

Resources	Count
pods	429
namespaces (99% empty)	57776
endpoints	255
services	254

Steps to reproduce
Deploy cloud watch agent as a K8 deployment resource with below configurations in out cluster

  prometheus.yaml: |
    global:
      evaluation_interval: 1m
      scrape_interval: 30s
      scrape_timeout: 10s
    scrape_configs:
    - honor_labels: true
      job_name: kubernetes-service-endpoints
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      metricRelabelings:
        - action: drop
        source_labels:
        - instance

"logs": {
    "metrics_collected": {
      "prometheus": {
        "cluster_name": "<name>",
        "prometheus_config_path": "/etc/prometheusconfig/prometheus.yaml",
        "log_group_name": "/aws/containerinsights/"<name>",/cwagent-prometheus/performance",
        "emf_processor": {
          "metric_declaration": [
            {
              "source_labels": [
                "namespace"
              ],
              "label_matcher": "<removed>",
              "dimensions": [
                [
                  "namespace",
                  "ClusterName",
                  "pod",
                  "container"
                ],
                [
                  "namespace",
                  "ClusterName",
                  "pod"
                ],
                [
                  "namespace",
                  "ClusterName"
                ],
                [
                  "namespace",
                  "ClusterName",
                  "pod",
                  "container",
                  "reason"
                ],
                [
                  "namespace",
                  "ClusterName",
                  "pod",
                  "reason"
                ],
                [
                  "namespace",
                  "ClusterName",
                  "reason"
                ]
              ],
              "metric_selectors": [
                ".*"
              ]
            }
          ]
        }
      }
    }

What did you expect to see?
The expectation is that agent may use low (<10Gi) memory.

What did you see instead?
A very high memory usage(~60Gi)

What version did you use?
cloudwatch-agent:1.300046.0b833

Environment
OS: Amazon Linux 2 - 5.10.224-212.876.amzn2.x86_64

The text was updated successfully, but these errors were encountered:

nar-git changed the title ~~High memory usage (>60Gi) when scraping Prometheus metrics~~ High memory usage (>50Gi) when scraping Prometheus metrics Sep 30, 2024

nar-git mentioned this issue Oct 2, 2024

Memory leak in CW agent prometheus 1.247348.0b251302 #264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage (>50Gi) when scraping Prometheus metrics #1358

High memory usage (>50Gi) when scraping Prometheus metrics #1358

nar-git commented Sep 30, 2024 •

edited

Loading

High memory usage (>50Gi) when scraping Prometheus metrics #1358

High memory usage (>50Gi) when scraping Prometheus metrics #1358

Comments

nar-git commented Sep 30, 2024 • edited Loading

nar-git commented Sep 30, 2024 •

edited

Loading