Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any way of getting estimated consumer lag in seconds in promql? #182

Open
sebw91 opened this issue Jan 6, 2023 · 6 comments
Open

Any way of getting estimated consumer lag in seconds in promql? #182

sebw91 opened this issue Jan 6, 2023 · 6 comments

Comments

@sebw91
Copy link

sebw91 commented Jan 6, 2023

Kminion works great, thank you.

Anyone have a way of computing an estimated consumer time lag in promql?

I think we'd have to somehow join two series, kminion_kafka_consumer_group_topic_offset_sum and kafka_topic_high_water_mark_sum.

Conceptually the query should be something along the lines of:
time() - time_at_value(kafka_topic_high_water_mark_sum, kminion_kafka_consumer_group_topic_offset_sum + 1)

Where time_at_value is a method of getting the timestamp of a series at a value. Not something that exists in prometheus.

@weeco
Copy link
Contributor

weeco commented Jan 6, 2023

Hey @sebw91 ,
yes an approximate time lag is possible and I totally support that. The lag should really be exported in time, because this is what users really want to know.

I thought about how to solve this in the past already and I had different ideas. There's one exporter that uses interpolation, see: https://github.com/seglo/kafka-lag-exporter for more information. It's a bigger effort to implement this and currently I don't plan to spend this amount of time on kminion. If you are interested in trying this I'd suggest to come up with a proposal that we can discuss here, before starting with the implementation. It's not trivial to implement it so that it can scale in larger clusters though and that would be a requirement for KMinion.

@sebw91
Copy link
Author

sebw91 commented Jan 6, 2023

Thanks a lot for the info. I was hoping it would be possible to do something in promql. From what I can see the data we need is all there to compute a very rough estimate. I would be fine without interpolation, just a lower bound on the topic_high_water_mark_sum to get a timestamp for consumer offset would be sufficient. This may not be possible though.

@weeco
Copy link
Contributor

weeco commented Jan 6, 2023

Oh I see what you mean. You are saying the information when a certain high water mark in a partition was reached is stored in Prometheus already (at least up to the retention), so that you put the intepolation logic into the PromQL somehow.

That's indeed a good idea! I'm not sure whether it's possibly with the available PromQL functions, but definetely worth a try to give that idea a try!

@sebw91
Copy link
Author

sebw91 commented Jan 6, 2023

I think there is a way (kinda), using offset promql modifier! If the hwm of a topic 5 minutes ago is greater than the current offset_sum in consumer group, then we can determine that we are at least 5 minutes behind, for example:

`kminion_kafka_topic_high_water_mark_sum offset 5m > on (topic_name) kminion_kafka_consumer_group_topic_offset_sum + 1

I will continue exploring on my side, but this should do the trick for us.

@hhromic
Copy link
Contributor

hhromic commented Jan 12, 2023

This is an interesting request/subject. As mentioned already, Kafka has no notion of consumer lag in time units (seconds) itself. Probably because it actually depends on how fast a consumer can/is consuming a given partition. More in general, the current/expected throughput for consumption by a consumer.

For this reason, we approximate the consumer lag (all-partitions mode) in seconds using the consumer rate like this:

sum(kminion_kafka_consumer_group_topic_lag{job=~"$job",group_id=~"$group_id"})
  by (group_id,topic_name) / on (topic_name)
  group_left sum(rate(kminion_kafka_topic_high_water_mark_sum{job=~"$job"} [$__rate_interval]))
    by (group_id,topic_name)

This is used in Grafana, hence the usage of $__rate_interval which can be replaced by a static rate() range.

Hope this is useful somehow :)

@sebw91
Copy link
Author

sebw91 commented Jan 12, 2023

@hhromic That's a clever prom query - very useful. Thanks very much. I think this is accurate enough for my use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants