Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support scalar resources metric #3937

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zedongh
Copy link
Contributor

@zedongh zedongh commented Dec 29, 2024

Add metric

queue_allocated_scalar_resources{queue_name="<queue>", resource="<resource>"}
queue_request_scalar_resources{queue_name="<queue>", resource="<resource>"}
queue_deserved_scalar_resources{queue_name="<queue>", resource="<resource>"}

/close #3931

Need discuss:
Shall we merge cpu, memory bulitin resources into scalar resource ?

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 29, 2024
@JesseStutler
Copy link
Member

ScalerResources is used for extension resources. Keep CPU, and memory not to merge them into scalerResources are fine.

@JesseStutler
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 30, 2024
@JesseStutler
Copy link
Member

/area stable-metrics

@volcano-sh-bot
Copy link
Contributor

@JesseStutler: The label(s) area/stable-metrics cannot be applied. These labels are supported: ``

In response to this:

/area stable-metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@archlitchi
Copy link
Contributor

/lgtm

@yccharles
Copy link

how about capacity ? such as :

volcano_queue_capacity_cpu
volcano_queue_capacity_mem
volcano_queue_capacity_scalar_resource

@JesseStutler
Copy link
Member

how about capacity ? such as :

volcano_queue_capacity_cpu volcano_queue_capacity_mem volcano_queue_capacity_scalar_resource

cc @Monokaix

@Monokaix
Copy link
Member

how about capacity ? such as :

volcano_queue_capacity_cpu volcano_queue_capacity_mem volcano_queue_capacity_scalar_resource

You mean add the capability of each dimension resource? The cpu and memory metrics is already exposed and we cannot modify its metric name for compatibility: )

@yccharles
Copy link

how about capacity ? such as :
volcano_queue_capacity_cpu volcano_queue_capacity_mem volcano_queue_capacity_scalar_resource

You mean add the capability of each dimension resource? The cpu and memory metrics is already exposed and we cannot modify its metric name for compatibility: )

there is no metrics about capacity in both vc-sheduler and vc-controller.

any other place ?

@JesseStutler
Copy link
Member

how about capacity ? such as :
volcano_queue_capacity_cpu volcano_queue_capacity_mem volcano_queue_capacity_scalar_resource

You mean add the capability of each dimension resource? The cpu and memory metrics is already exposed and we cannot modify its metric name for compatibility: )

there is no metrics about capacity in both vc-sheduler and vc-controller.

any other place ?

https://github.com/volcano-sh/volcano/blob/master/docs/design/metrics.md

@Monokaix
Copy link
Member

Plesse also update doc https://github.com/volcano-sh/volcano/blob/master/docs/design/metrics.md and volcano-monitoring-development.yaml to include the new metrics: )

@lowang-bh
Copy link
Member

Any test results will be more appreciated:

  1. queue's allocated gpu become zero, so does the metric
  2. queue is removed, so does the metric

@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 1, 2025
@volcano-sh-bot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign k82cn
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 1, 2025
@zedongh
Copy link
Contributor Author

zedongh commented Jan 1, 2025

some changes update:

  1. update queue metric doc
  2. add unit test for queue allocated/request/deserved resource

@zedongh
Copy link
Contributor Author

zedongh commented Jan 1, 2025

some changes update:

  1. update queue metric doc
  2. add unit test for queue allocated/request/deserved resource

@Monokaix @lowang-bh

And

volcano-monitoring-development.yaml to include the new metrics
need clearify.

Update volcano-monitoring-latest.yaml dashboard?

The original cpu/memeory metrics are not present in dashboard. Update later if necessary.

@yccharles
Copy link

how about capacity ? such as :
volcano_queue_capacity_cpu volcano_queue_capacity_mem volcano_queue_capacity_scalar_resource

You mean add the capability of each dimension resource? The cpu and memory metrics is already exposed and we cannot modify its metric name for compatibility: )

there is no metrics about capacity in both vc-sheduler and vc-controller.
any other place ?

https://github.com/volcano-sh/volcano/blob/master/docs/design/metrics.md

We also need export new metrics to describe each resource's capacity of a queue.
Current there is no metrics about capacity metrics.

scene:

  1. queue's UsageRate = request / capacity
  2. add alert when queue's capacity is almost exhaust

The proposal has been agreed by @JesseStutler . Can you implement this feature? @zedongh

@lowang-bh
Copy link
Member

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jan 2, 2025
@Monokaix
Copy link
Member

Monokaix commented Jan 2, 2025

volcano-monitoring-latest.yaml

If its not present in volcano-monitoring-latest.yaml we can skip add it, and add what @yccharles said is ok: )

@JesseStutler
Copy link
Member

how about capacity ? such as :
volcano_queue_capacity_cpu volcano_queue_capacity_mem volcano_queue_capacity_scalar_resource

You mean add the capability of each dimension resource? The cpu and memory metrics is already exposed and we cannot modify its metric name for compatibility: )

there is no metrics about capacity in both vc-sheduler and vc-controller.
any other place ?

https://github.com/volcano-sh/volcano/blob/master/docs/design/metrics.md

We also need export new metrics to describe each resource's capacity of a queue. Current there is no metrics about capacity metrics.

scene:

  1. queue's UsageRate = request / capacity
  2. add alert when queue's capacity is almost exhaust

The proposal has been agreed by @JesseStutler . Can you implement this feature? @zedongh

@yccharles Your capacity should mean the set value of Guarantee/Deserved/Capability, right?
You can also describe it in detail in the original issue #3931, and then ask @zedongh to confirm the requirements.

@zedongh
Copy link
Contributor Author

zedongh commented Jan 4, 2025

New commit for queue capacity metric. @yccharles

@volcano-sh-bot
Copy link
Contributor

@yccharles: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

metrics.UpdateQueueDeserved(attr.name, attr.deserved.MilliCPU, attr.deserved.Memory, attr.deserved.ScalarResources)
metrics.UpdateQueueAllocated(attr.name, attr.allocated.MilliCPU, attr.allocated.Memory, attr.allocated.ScalarResources)
metrics.UpdateQueueRequest(attr.name, attr.request.MilliCPU, attr.request.Memory, attr.request.ScalarResources)
metrics.UpdateQueueCapacity(attr.name, attr.realCapability.MilliCPU, attr.realCapability.Memory, attr.realCapability.ScalarResources)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capacity you ask is capability here? @yccharles I misunderstood the meaning of capacity before and wanted to confirm it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder.

  1. what i want is queue's capacity . not realCapability.xxx
  2. but, if realCapability.xxx can be exported is really a good idea. such as : queue_real_capability_xxx ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the reminder.

  1. what i want is queue's capacity . not realCapability.xxx
  2. but, if realCapability.xxx can be exported is really a good idea. such as : queue_real_capability_xxx ?

updated with support both metrics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yccharles capacity refers to the capability value set by the user, right? Because there are too many words now, I am afraid that capacity and capability will be confused, we need to be clear. Besides, do we also need the deserved and guarantee values ​​set by the user?

| `unschedule_job_counts` | Gauge | None | The number of jobs could not be scheduled |
| `queue_allocated_milli_cpu` | Gauge | `queue_name`=&lt;queue_name&gt; | Allocated CPU count for one queue |
| `queue_allocated_memory_bytes` | Gauge | `queue_name`=&lt;queue_name&gt; | Allocated memory for one queue |
| `queue_allocated_scalar_resources` | Gauge | `queue_name`=&lt;queue_name&gt;, `resource`=&lt;resource_name&gt; | Allocated scalar resource for one queue |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add queue_capacity_xxx metrics here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with queue_capacity_xxx and queue_real_capacity_xxx metric docs

@zedongh zedongh force-pushed the feat-support-scalar-resource-metric branch from 5f58a33 to c8260b0 Compare January 6, 2025 16:01
@lowang-bh
Copy link
Member

please squash commits to only one.

@zedongh
Copy link
Contributor Author

zedongh commented Jan 11, 2025

please squash commits to only one.

capacity metric part need disscuss. Split it to another issue to make all things done?
@JesseStutler @yccharles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

why metrics do not export ScalarResources
7 participants