This module contains both Kafka specific monitors and system monitors from our system module.
This module is part of a larger suite of modules that provide alerts in Datadog. Other modules can be found on the Terraform Registry
We have two base modules we use to standardise development of our Monitor Modules:
- generic monitor Used in 90% of our alerts
- service check monitor
Modules are generated with this tool: https://github.com/kabisa/datadog-terraform-generator
module "kafka" {
source = "kabisa/kafka/datadog"
notification_channel = "[email protected]"
service = "ServiceX"
env = "prd"
filter_str = "service:kafka"
}
Monitors:
Monitor name | Default enabled | Priority | Query |
---|---|---|---|
Bytesin High | True | 3 | avg(last_30m):avg:kafka.net.bytes_in.rate{tag:xxx} by {host} > 5000000 |
Bytesout_high | True | 3 | avg(last_30m):avg:kafka.net.bytes_out.rate{tag:xxx} by {host} > 5000000 |
Fetch_purgatory_size | True | 3 | avg(last_30m):max:kafka.request.fetch_request_purgatory.size{tag:xxx} by {host} > 100000 |
In_sync_nodes_dropped | True | 2 | avg(last_5m):max:kafka.replication.isr_shrinks.rate{tag:xxx} by {aiven-service} - max:kafka.replication.isr_expands.rate{tag:xxx} by {aiven-service} > |
Leader_election_occurring | True | 3 | max(last_5m):avg:kafka.replication.leader_elections.rate{tag:xxx} by {aiven-service} > |
Multiple_active_controllers | True | 2 | avg(last_5m):max:kafka.replication.active_controller_count{tag:xxx} by {aiven-project} > 1 |
No_active_controllers | True | 2 | avg(last_5m):max:kafka.replication.active_controller_count{tag:xxx} by {aiven-project} < 1 |
Offline_partitions | True | 2 | avg(last_5m):max:kafka.replication.offline_partitions_count{tag:xxx} > |
Produce_purgatory_size | True | 3 | avg(last_15m):max:kafka.request.producer_request_purgatory.size{tag:xxx} > 200 |
Unclean_leader_election | True | 2 | avg(last_5m):max:kafka.replication.unclean_leader_elections.rate{tag:xxx} by {aiven-project} > |
Under_replicated_partitions | True | 3 | avg(last_15m):avg:kafka.replication.under_replicated_partitions{tag:xxx} by {aiven-service} > |
Unusual_consumer_fetch_time | True | 3 | avg(last_30m):avg:kafka.request.fetch_consumer.time.avg{tag:xxx} > 1000 |
Unusual_fetch_failures | True | 3 | avg(last_30m):avg:kafka.request.fetch.failed.rate{tag:xxx} > 50 |
Unusual_follower_fetch_time | True | 3 | avg(last_30m):avg:kafka.request.fetch_follower.time.avg{tag:xxx} > 1000 |
Unusual_produce_failures | True | 3 | avg(last_30m):avg:kafka.request.produce.failed.rate{tag:xxx} > 50 |
Unusual_produce_time | True | 3 | avg(last_30m):avg:kafka.request.produce.time.avg{tag:xxx} > 20 |
pre-commit was used to do Terraform linting and validating.
Steps:
- Install pre-commit. E.g.
brew install pre-commit
. - Run
pre-commit install
in this repo. (Every time you clone a repo with pre-commit enabled you will need to run the pre-commit install command) - That’s it! Now every time you commit a code change (
.tf
file), the hooks in thehooks:
config.pre-commit-config.yaml
will execute.
NOTE: This is based on a baseline and might need adjusting further down the road.
Generally, disk throughput tends to be the main bottleneck in Kafka performance. However, that’s not to say that the network is never a bottleneck. Network throughput can affect Kafka’s performance if you are sending messages across data centers, if your topics have a large number of consumers, or if your replicas are catching up to their leaders. Tracking network throughput on your brokers gives you more information as to where potential bottlenecks may lie, and can inform decisions like whether or not you should enable end-to-end compression of your messages.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_30m):avg:kafka.net.bytes_in.rate{tag:xxx} by {host} > 5000000
variable | default | required | description |
---|---|---|---|
bytesin_high_enabled | True | No | |
bytesin_high_warning | 2500000 | No | |
bytesin_high_critical | 5000000 | No | |
bytesin_high_evaluation_period | last_30m | No | |
bytesin_high_note | "" | No | |
bytesin_high_docs | NOTE: This is based on a baseline and might need adjusting further down the road. |
Generally, disk throughput tends to be the main bottleneck in Kafka performance. However, that’s not to say that the network is never a bottleneck. Network throughput can affect Kafka’s performance if you are sending messages across data centers, if your topics have a large number of consumers, or if your replicas are catching up to their leaders. Tracking network throughput on your brokers gives you more information as to where potential bottlenecks may lie, and can inform decisions like whether or not you should enable end-to-end compression of your messages.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | bytesin_high_filter_override | "" | No | | | bytesin_high_alerting_enabled | True | No | | | bytesin_high_priority | 3 | No | Number from 1 (high) to 5 (low). |
NOTE: This is based on a baseline and might need adjusting further down the road.
Generally, disk throughput tends to be the main bottleneck in Kafka performance. However, that’s not to say that the network is never a bottleneck. Network throughput can affect Kafka’s performance if you are sending messages across data centers, if your topics have a large number of consumers, or if your replicas are catching up to their leaders. Tracking network throughput on your brokers gives you more information as to where potential bottlenecks may lie, and can inform decisions like whether or not you should enable end-to-end compression of your messages.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_30m):avg:kafka.net.bytes_out.rate{tag:xxx} by {host} > 5000000
variable | default | required | description |
---|---|---|---|
bytesout_high_enabled | True | No | |
bytesout_high_warning | 2500000 | No | |
bytesout_high_critical | 5000000 | No | |
bytesout_high_evaluation_period | last_30m | No | |
bytesout_high_note | "" | No | |
bytesout_high_docs | NOTE: This is based on a baseline and might need adjusting further down the road. |
Generally, disk throughput tends to be the main bottleneck in Kafka performance. However, that’s not to say that the network is never a bottleneck. Network throughput can affect Kafka’s performance if you are sending messages across data centers, if your topics have a large number of consumers, or if your replicas are catching up to their leaders. Tracking network throughput on your brokers gives you more information as to where potential bottlenecks may lie, and can inform decisions like whether or not you should enable end-to-end compression of your messages.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | bytesout_high_filter_override | "" | No | | | bytesout_high_alerting_enabled | True | No | | | bytesout_high_priority | 3 | No | Number from 1 (high) to 5 (low). |
The request purgatory serves as a temporary holding pen for produce and fetch requests waiting to be satisfied. Fetch requests are added to purgatory if there is not enough data to fulfill the request (fetch.min.bytes on consumers) until the time specified by fetch.wait.max.ms is reached or enough data becomes available
Keeping an eye on the size of purgatory is useful to determine the underlying causes of latency. Increases in consumer fetch times, for example, can be easily explained if there is a corresponding increase in the number of fetch requests in purgatory.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_30m):max:kafka.request.fetch_request_purgatory.size{tag:xxx} by {host} > 100000
variable | default | required | description |
---|---|---|---|
fetch_purgatory_size_enabled | True | No | |
fetch_purgatory_size_warning | 70000 | No | |
fetch_purgatory_size_critical | 100000 | No | |
fetch_purgatory_size_evaluation_period | last_30m | No | |
fetch_purgatory_size_note | "" | No | |
fetch_purgatory_size_docs | The request purgatory serves as a temporary holding pen for produce and fetch requests waiting to be satisfied. | ||
Fetch requests are added to purgatory if there is not enough data to fulfill the request (fetch.min.bytes on consumers) until the time specified by fetch.wait.max.ms is reached or enough data becomes available |
Keeping an eye on the size of purgatory is useful to determine the underlying causes of latency. Increases in consumer fetch times, for example, can be easily explained if there is a corresponding increase in the number of fetch requests in purgatory.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | fetch_purgatory_size_filter_override | "" | No | | | fetch_purgatory_size_alerting_enabled | True | No | | | fetch_purgatory_size_priority | 3 | No | Number from 1 (high) to 5 (low). |
The number of in-sync replicas (ISRs) for a particular partition should remain fairly static, except when you are expanding your broker cluster or removing partitions. In order to maintain high availability, a healthy Kafka cluster requires a minimum number of ISRs for failover. A replica could be removed from the ISR pool if it has not contacted the leader for some time (configurable with the replica.socket.timeout.ms parameter). You should investigate any flapping in the values of these metrics, and any increase in IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_5m):max:kafka.replication.isr_shrinks.rate{tag:xxx} by {aiven-service} - max:kafka.replication.isr_expands.rate{tag:xxx} by {aiven-service} >
variable | default | required | description |
---|---|---|---|
in_sync_nodes_dropped_enabled | True | No | |
in_sync_nodes_dropped_warning | 0 | No | |
in_sync_nodes_dropped_critical | 0 | No | |
in_sync_nodes_dropped_evaluation_period | last_5m | No | |
in_sync_nodes_dropped_note | "" | No | |
in_sync_nodes_dropped_docs | The number of in-sync replicas (ISRs) for a particular partition should remain fairly static, except when you are expanding your broker cluster or removing partitions. In order to maintain high availability, a healthy Kafka cluster requires a minimum number of ISRs for failover. A replica could be removed from the ISR pool if it has not contacted the leader for some time (configurable with the replica.socket.timeout.ms parameter). You should investigate any flapping in the values of these metrics, and any increase in IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter. |
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | in_sync_nodes_dropped_filter_override | "" | No | | | in_sync_nodes_dropped_alerting_enabled | True | No | | | in_sync_nodes_dropped_priority | 2 | No | Number from 1 (high) to 5 (low). |
When a partition leader dies, an election for a new leader is triggered. A partition leader is considered “dead” if it fails to maintain its session with ZooKeeper. Unlike ZooKeeper’s Zab, Kafka does not employ a majority-consensus algorithm for leadership election. Instead, Kafka’s quorum is composed of the set of all in-sync replicas (ISRs) for a particular partition. Replicas are considered in-sync if they are caught-up to the leader, which means that any replica in the ISR can be promoted to the leader.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
max(last_5m):avg:kafka.replication.leader_elections.rate{tag:xxx} by {aiven-service} >
variable | default | required | description |
---|---|---|---|
leader_election_occurring_enabled | True | No | |
leader_election_occurring_warning | 0 | No | |
leader_election_occurring_critical | 0 | No | |
leader_election_occurring_evaluation_period | last_5m | No | |
leader_election_occurring_note | "" | No | |
leader_election_occurring_docs | When a partition leader dies, an election for a new leader is triggered. A partition leader is considered “dead” if it fails to maintain its session with ZooKeeper. Unlike ZooKeeper’s Zab, Kafka does not employ a majority-consensus algorithm for leadership election. Instead, Kafka’s quorum is composed of the set of all in-sync replicas (ISRs) for a particular partition. Replicas are considered in-sync if they are caught-up to the leader, which means that any replica in the ISR can be promoted to the leader. |
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | leader_election_occurring_filter_override | "" | No | | | leader_election_occurring_alerting_enabled | True | No | | | leader_election_occurring_require_full_window | False | No | | | leader_election_occurring_priority | 3 | No | Number from 1 (high) to 5 (low). |
The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. The controller in a Kafka cluster is responsible for maintaining the list of partition leaders, and coordinating leadership transitions (in the event a partition leader becomes unavailable). If it becomes necessary to replace the controller, ZooKeeper chooses a new controller randomly from the pool of brokers. The sum of ActiveControllerCount across all of your brokers should always equal one, and you should alert on any other value that lasts for longer than one second.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_5m):max:kafka.replication.active_controller_count{tag:xxx} by {aiven-project} > 1
variable | default | required | description |
---|---|---|---|
multiple_active_controllers_enabled | True | No | |
multiple_active_controllers_warning | 0 | No | |
multiple_active_controllers_critical | 1 | No | |
multiple_active_controllers_evaluation_period | last_5m | No | |
multiple_active_controllers_note | "" | No | |
multiple_active_controllers_docs | The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. The controller in a Kafka cluster is responsible for maintaining the list of partition leaders, and coordinating leadership transitions (in the event a partition leader becomes unavailable). If it becomes necessary to replace the controller, ZooKeeper chooses a new controller randomly from the pool of brokers. The sum of ActiveControllerCount across all of your brokers should always equal one, and you should alert on any other value that lasts for longer than one second. |
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | multiple_active_controllers_filter_override | "" | No | | | multiple_active_controllers_alerting_enabled | True | No | | | multiple_active_controllers_require_full_window | True | No | | | multiple_active_controllers_priority | 2 | No | Number from 1 (high) to 5 (low). |
The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. The controller in a Kafka cluster is responsible for maintaining the list of partition leaders, and coordinating leadership transitions (in the event a partition leader becomes unavailable). If it becomes necessary to replace the controller, ZooKeeper chooses a new controller randomly from the pool of brokers. The sum of ActiveControllerCount across all of your brokers should always equal one, and you should alert on any other value that lasts for longer than one second.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_5m):max:kafka.replication.active_controller_count{tag:xxx} by {aiven-project} < 1
variable | default | required | description |
---|---|---|---|
no_active_controllers_enabled | True | No | |
no_active_controllers_warning | 0 | No | |
no_active_controllers_critical | 1 | No | |
no_active_controllers_evaluation_period | last_5m | No | |
no_active_controllers_note | "" | No | |
no_active_controllers_docs | The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. The controller in a Kafka cluster is responsible for maintaining the list of partition leaders, and coordinating leadership transitions (in the event a partition leader becomes unavailable). If it becomes necessary to replace the controller, ZooKeeper chooses a new controller randomly from the pool of brokers. The sum of ActiveControllerCount across all of your brokers should always equal one, and you should alert on any other value that lasts for longer than one second. |
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | no_active_controllers_filter_override | "" | No | | | no_active_controllers_alerting_enabled | True | No | | | no_active_controllers_require_full_window | True | No | | | no_active_controllers_priority | 2 | No | Number from 1 (high) to 5 (low). |
This metric reports the number of partitions without an active leader. Because all read and write operations are only performed on partition leaders, you should alert on a non-zero value for this metric to prevent service interruptions. Any partition without an active leader will be completely inaccessible, and both consumers and producers of that partition will be blocked until a leader becomes available.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_5m):max:kafka.replication.offline_partitions_count{tag:xxx} >
variable | default | required | description |
---|---|---|---|
offline_partitions_enabled | True | No | |
offline_partitions_warning | 0 | No | |
offline_partitions_critical | 0 | No | |
offline_partitions_evaluation_period | last_5m | No | |
offline_partitions_note | "" | No | |
offline_partitions_docs | This metric reports the number of partitions without an active leader. Because all read and write operations are only performed on partition leaders, you should alert on a non-zero value for this metric to prevent service interruptions. Any partition without an active leader will be completely inaccessible, and both consumers and producers of that partition will be blocked until a leader becomes available. |
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | offline_partitions_filter_override | "" | No | | | offline_partitions_alerting_enabled | True | No | | | offline_partitions_require_full_window | True | No | | | offline_partitions_priority | 2 | No | Number from 1 (high) to 5 (low). |
The request purgatory serves as a temporary holding pen for produce and fetch requests waiting to be satisfied. If request.required.acks=-1, all produce requests will end up in purgatory until the partition leader receives an acknowledgment from all followers.
Keeping an eye on the size of purgatory is useful to determine the underlying causes of latency. Increases in consumer fetch times, for example, can be easily explained if there is a corresponding increase in the number of fetch requests in purgatory.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_15m):max:kafka.request.producer_request_purgatory.size{tag:xxx} > 200
variable | default | required | description |
---|---|---|---|
produce_purgatory_size_enabled | True | No | |
produce_purgatory_size_warning | 100 | No | |
produce_purgatory_size_critical | 200 | No | |
produce_purgatory_size_evaluation_period | last_15m | No | |
produce_purgatory_size_note | "" | No | |
produce_purgatory_size_docs | The request purgatory serves as a temporary holding pen for produce and fetch requests waiting to be satisfied. | ||
If request.required.acks=-1, all produce requests will end up in purgatory until the partition leader receives an acknowledgment from all followers. |
Keeping an eye on the size of purgatory is useful to determine the underlying causes of latency. Increases in consumer fetch times, for example, can be easily explained if there is a corresponding increase in the number of fetch requests in purgatory.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | produce_purgatory_size_filter_override | "" | No | | | produce_purgatory_size_alerting_enabled | True | No | | | produce_purgatory_size_require_full_window | True | No | | | produce_purgatory_size_priority | 3 | No | Number from 1 (high) to 5 (low). |
Unclean leader elections occur when there is no qualified partition leader among Kafka brokers. Normally, when a broker that is the leader for a partition goes offline, a new leader is elected from the set of ISRs for the partition. Unclean leader election is disabled by default in Kafka version 0.11 and newer, meaning that a partition is taken offline if it does not have any ISRs to elect as the new leader. If Kafka is configured to allow an unclean leader election, a leader is chosen from the out-of-sync replicas, and any messages that were not synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability. You should alert on this metric, as it signals data loss.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_5m):max:kafka.replication.unclean_leader_elections.rate{tag:xxx} by {aiven-project} >
variable | default | required | description |
---|---|---|---|
unclean_leader_election_enabled | True | No | |
unclean_leader_election_warning | 100 | No | |
unclean_leader_election_critical | 0 | No | |
unclean_leader_election_evaluation_period | last_5m | No | |
unclean_leader_election_note | "" | No | |
unclean_leader_election_docs | Unclean leader elections occur when there is no qualified partition leader among Kafka brokers. Normally, when a broker that is the leader for a partition goes offline, a new leader is elected from the set of ISRs for the partition. Unclean leader election is disabled by default in Kafka version 0.11 and newer, meaning that a partition is taken offline if it does not have any ISRs to elect as the new leader. If Kafka is configured to allow an unclean leader election, a leader is chosen from the out-of-sync replicas, and any messages that were not synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability. You should alert on this metric, as it signals data loss. |
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | unclean_leader_election_filter_override | "" | No | | | unclean_leader_election_alerting_enabled | True | No | | | unclean_leader_election_require_full_window | False | No | | | unclean_leader_election_priority | 2 | No | Number from 1 (high) to 5 (low). |
If a broker becomes unavailable, the value of UnderReplicatedPartitions will increase sharply. Since Kafka’s high-availability guarantees cannot be met without replication, investigation is certainly warranted should this metric value exceed zero for extended time periods.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/
Query:
avg(last_15m):avg:kafka.replication.under_replicated_partitions{tag:xxx} by {aiven-service} >
variable | default | required | description |
---|---|---|---|
under_replicated_partitions_enabled | True | No | |
under_replicated_partitions_warning | 100 | No | |
under_replicated_partitions_critical | 0 | No | |
under_replicated_partitions_evaluation_period | last_15m | No | |
under_replicated_partitions_note | "" | No | |
under_replicated_partitions_docs | If a broker becomes unavailable, the value of UnderReplicatedPartitions will increase sharply. Since Kafka’s high-availability guarantees cannot be met without replication, investigation is certainly warranted should this metric value exceed zero for extended time periods. |
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | under_replicated_partitions_filter_override | "" | No | | | under_replicated_partitions_alerting_enabled | True | No | | | under_replicated_partitions_require_full_window | True | No | | | under_replicated_partitions_priority | 3 | No | Number from 1 (high) to 5 (low). |
produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown. https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems
Query:
avg(last_30m):avg:kafka.request.fetch_consumer.time.avg{tag:xxx} > 1000
variable | default | required | description |
---|---|---|---|
unusual_consumer_fetch_time_enabled | True | No | |
unusual_consumer_fetch_time_warning | 800 | No | |
unusual_consumer_fetch_time_critical | 1000 | No | |
unusual_consumer_fetch_time_evaluation_period | last_30m | No | |
unusual_consumer_fetch_time_note | "" | No | |
unusual_consumer_fetch_time_docs | produce: requests from producers to send data | ||
fetch-consumer: requests from consumers to get new data | |||
fetch-follower: requests from brokers that are the followers of a partition to get new data |
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown. https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems | No | | | unusual_consumer_fetch_time_filter_override | "" | No | | | unusual_consumer_fetch_time_alerting_enabled | True | No | | | unusual_consumer_fetch_time_require_full_window | True | No | | | unusual_consumer_fetch_time_priority | 3 | No | Number from 1 (high) to 5 (low). |
produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems
This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task
Query:
avg(last_30m):avg:kafka.request.fetch.failed.rate{tag:xxx} > 50
variable | default | required | description |
---|---|---|---|
unusual_fetch_failures_enabled | True | No | |
unusual_fetch_failures_warning | 40 | No | |
unusual_fetch_failures_critical | 50 | No | |
unusual_fetch_failures_evaluation_period | last_30m | No | |
unusual_fetch_failures_note | "" | No | |
unusual_fetch_failures_docs | produce: requests from producers to send data | ||
fetch-consumer: requests from consumers to get new data | |||
fetch-follower: requests from brokers that are the followers of a partition to get new data |
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems
This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task | No | | | unusual_fetch_failures_filter_override | "" | No | | | unusual_fetch_failures_alerting_enabled | True | No | | | unusual_fetch_failures_require_full_window | True | No | | | unusual_fetch_failures_priority | 3 | No | Number from 1 (high) to 5 (low). |
produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems
Query:
avg(last_30m):avg:kafka.request.fetch_follower.time.avg{tag:xxx} > 1000
variable | default | required | description |
---|---|---|---|
unusual_follower_fetch_time_enabled | True | No | |
unusual_follower_fetch_time_warning | 800 | No | |
unusual_follower_fetch_time_critical | 1000 | No | |
unusual_follower_fetch_time_evaluation_period | last_30m | No | |
unusual_follower_fetch_time_note | "" | No | |
unusual_follower_fetch_time_docs | produce: requests from producers to send data | ||
fetch-consumer: requests from consumers to get new data | |||
fetch-follower: requests from brokers that are the followers of a partition to get new data |
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems | No | | | unusual_follower_fetch_time_filter_override | "" | No | | | unusual_follower_fetch_time_alerting_enabled | True | No | | | unusual_follower_fetch_time_require_full_window | True | No | | | unusual_follower_fetch_time_priority | 3 | No | Number from 1 (high) to 5 (low). |
The TotalTimeMs metric family measures the total time taken to service a request (be it a produce, fetch-consumer, or fetch-follower request):
produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems
This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task
Query:
avg(last_30m):avg:kafka.request.produce.failed.rate{tag:xxx} > 50
variable | default | required | description |
---|---|---|---|
unusual_produce_failures_enabled | True | No | |
unusual_produce_failures_warning | 40 | No | |
unusual_produce_failures_critical | 50 | No | |
unusual_produce_failures_evaluation_period | last_30m | No | |
unusual_produce_failures_note | "" | No | |
unusual_produce_failures_docs | The TotalTimeMs metric family measures the total time taken to service a request (be it a produce, fetch-consumer, or fetch-follower request): |
produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems
This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task | No | | | unusual_produce_failures_filter_override | "" | No | | | unusual_produce_failures_alerting_enabled | True | No | | | unusual_produce_failures_require_full_window | True | No | | | unusual_produce_failures_priority | 3 | No | Number from 1 (high) to 5 (low). |
The TotalTimeMs metric family measures the total time taken to service a request (be it a produce, fetch-consumer, or fetch-follower request):
produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems
This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task
Query:
avg(last_30m):avg:kafka.request.produce.time.avg{tag:xxx} > 20
variable | default | required | description |
---|---|---|---|
unusual_produce_time_enabled | True | No | |
unusual_produce_time_warning | 10 | No | |
unusual_produce_time_critical | 20 | No | |
unusual_produce_time_evaluation_period | last_30m | No | |
unusual_produce_time_note | "" | No | |
unusual_produce_time_docs | The TotalTimeMs metric family measures the total time taken to service a request (be it a produce, fetch-consumer, or fetch-follower request): |
produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems
This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task | No | | | unusual_produce_time_filter_override | "" | No | | | unusual_produce_time_alerting_enabled | True | No | | | unusual_produce_time_require_full_window | True | No | | | unusual_produce_time_priority | 3 | No | Number from 1 (high) to 5 (low). |
variable | default | required | description |
---|---|---|---|
env | Yes | ||
service | Kafka | No | |
notification_channel | Yes | ||
additional_tags | [] | No | |
filter_str | Yes | ||
is_hosted_service | False | No | |
locked | True | No | |
system_notification_channel_override | None | No | |
name_prefix | "" | No | Can be used to prefix to the Monitor name |
name_suffix | "" | No | Can be used to suffix to the Monitor name |
priority_offset | 0 | No | For non production workloads we can +1 on the priorities |