Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] High cardinality for metrics that include the HTTP route #47999

Open
edoakes opened this issue Oct 11, 2024 · 2 comments · May be fixed by #48290
Open

[serve] High cardinality for metrics that include the HTTP route #47999

edoakes opened this issue Oct 11, 2024 · 2 comments · May be fixed by #48290
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@edoakes
Copy link
Contributor

edoakes commented Oct 11, 2024

In some of our metrics, we include the HTTP route as a tag. If users include data with high cardinality in their HTTP requests, such as a per-user ID, this blows up the prometheus metrics (and can render the metrics unusable).

We need to reduce the cardinality here, perhaps by only exporting the Serve-level route_prefix instead of the full route.

@edoakes edoakes added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue labels Oct 11, 2024
@antoniomdk
Copy link

+1 to this. I think apart from worsening the UX, for users than ingest the metrics into other providers rather than hosting a Prometheus server, high cardinality metrics can blow up costs, e.g. https://docs.datadoghq.com/account_management/billing/custom_metrics/?tab=countrate

@edoakes
Copy link
Contributor Author

edoakes commented Oct 16, 2024

Ok, I did some prototyping here. We have metrics containing the route in two places: the proxy and the replica.

In the proxy, we don't have access to application-defined routes (by design) so we can't do anything too clever. We could try to do something like auto-detect the cardinality and cap the number of tags, but that seems excessively complex.

In the replica, we do have access to the underlying ASGI app which we can use to identify the matched route string (e.g., /path/{wildcard}.

So I'd propose that we:

  • Update proxy metrics to export the matched route_prefix under the existing route tag.
  • Update replica metrics to export the matched route handler string for FastAPI applications, else the route_prefix for applications that use the raw ASGI request.

We could consider changing the metric tag for the proxy metrics to route_prefix for clarity, but that introduces a migration for what seems to me like a very minor improvement.

@edoakes edoakes linked a pull request Oct 28, 2024 that will close this issue
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants