Terraform module for Datadog Kafka

This module contains both Kafka specific monitors and system monitors from our system module.

This module is part of a larger suite of modules that provide alerts in Datadog. Other modules can be found on the Terraform Registry

We have two base modules we use to standardise development of our Monitor Modules:

generic monitor Used in 90% of our alerts
service check monitor

Modules are generated with this tool: https://github.com/kabisa/datadog-terraform-generator

Example Usage

module "kafka" {
  source = "kabisa/kafka/datadog"

  notification_channel = "[email protected]"
  service              = "ServiceX"
  env                  = "prd"
  filter_str           = "service:kafka"
}

Module Variables

Monitors:

Monitor name	Default enabled	Priority	Query
Bytesin High	True	3	`avg(last_30m):avg:kafka.net.bytes_in.rate{tag:xxx} by {host} > 5000000`
Bytesout_high	True	3	`avg(last_30m):avg:kafka.net.bytes_out.rate{tag:xxx} by {host} > 5000000`
Fetch_purgatory_size	True	3	`avg(last_30m):max:kafka.request.fetch_request_purgatory.size{tag:xxx} by {host} > 100000`
In_sync_nodes_dropped	True	2	`avg(last_5m):max:kafka.replication.isr_shrinks.rate{tag:xxx} by {aiven-service} - max:kafka.replication.isr_expands.rate{tag:xxx} by {aiven-service} >`
Leader_election_occurring	True	3	`max(last_5m):avg:kafka.replication.leader_elections.rate{tag:xxx} by {aiven-service} >`
Multiple_active_controllers	True	2	`avg(last_5m):max:kafka.replication.active_controller_count{tag:xxx} by {aiven-project} > 1`
No_active_controllers	True	2	`avg(last_5m):max:kafka.replication.active_controller_count{tag:xxx} by {aiven-project} < 1`
Offline_partitions	True	2	`avg(last_5m):max:kafka.replication.offline_partitions_count{tag:xxx} >`
Produce_purgatory_size	True	3	`avg(last_15m):max:kafka.request.producer_request_purgatory.size{tag:xxx} > 200`
Unclean_leader_election	True	2	`avg(last_5m):max:kafka.replication.unclean_leader_elections.rate{tag:xxx} by {aiven-project} >`
Under_replicated_partitions	True	3	`avg(last_15m):avg:kafka.replication.under_replicated_partitions{tag:xxx} by {aiven-service} >`
Unusual_consumer_fetch_time	True	3	`avg(last_30m):avg:kafka.request.fetch_consumer.time.avg{tag:xxx} > 1000`
Unusual_fetch_failures	True	3	`avg(last_30m):avg:kafka.request.fetch.failed.rate{tag:xxx} > 50`
Unusual_follower_fetch_time	True	3	`avg(last_30m):avg:kafka.request.fetch_follower.time.avg{tag:xxx} > 1000`
Unusual_produce_failures	True	3	`avg(last_30m):avg:kafka.request.produce.failed.rate{tag:xxx} > 50`
Unusual_produce_time	True	3	`avg(last_30m):avg:kafka.request.produce.time.avg{tag:xxx} > 20`

Getting started developing

pre-commit was used to do Terraform linting and validating.

Steps:

Install pre-commit. E.g. brew install pre-commit.
Run pre-commit install in this repo. (Every time you clone a repo with pre-commit enabled you will need to run the pre-commit install command)
That’s it! Now every time you commit a code change (.tf file), the hooks in the hooks: config .pre-commit-config.yaml will execute.

Bytesin High

NOTE: This is based on a baseline and might need adjusting further down the road.

Generally, disk throughput tends to be the main bottleneck in Kafka performance. However, that’s not to say that the network is never a bottleneck. Network throughput can affect Kafka’s performance if you are sending messages across data centers, if your topics have a large number of consumers, or if your replicas are catching up to their leaders. Tracking network throughput on your brokers gives you more information as to where potential bottlenecks may lie, and can inform decisions like whether or not you should enable end-to-end compression of your messages.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_30m):avg:kafka.net.bytes_in.rate{tag:xxx} by {host} > 5000000

variable	default	required
bytesin_high_enabled	True	No
bytesin_high_warning	2500000	No
bytesin_high_critical	5000000	No
bytesin_high_evaluation_period	last_30m	No
bytesin_high_note	""	No
bytesin_high_docs	NOTE: This is based on a baseline and might need adjusting further down the road.

Generally, disk throughput tends to be the main bottleneck in Kafka performance. However, that’s not to say that the network is never a bottleneck. Network throughput can affect Kafka’s performance if you are sending messages across data centers, if your topics have a large number of consumers, or if your replicas are catching up to their leaders. Tracking network throughput on your brokers gives you more information as to where potential bottlenecks may lie, and can inform decisions like whether or not you should enable end-to-end compression of your messages.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | bytesin_high_filter_override | "" | No | | | bytesin_high_alerting_enabled | True | No | | | bytesin_high_priority | 3 | No | Number from 1 (high) to 5 (low). |

Bytesout_high

NOTE: This is based on a baseline and might need adjusting further down the road.

Generally, disk throughput tends to be the main bottleneck in Kafka performance. However, that’s not to say that the network is never a bottleneck. Network throughput can affect Kafka’s performance if you are sending messages across data centers, if your topics have a large number of consumers, or if your replicas are catching up to their leaders. Tracking network throughput on your brokers gives you more information as to where potential bottlenecks may lie, and can inform decisions like whether or not you should enable end-to-end compression of your messages.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_30m):avg:kafka.net.bytes_out.rate{tag:xxx} by {host} > 5000000

variable	default	required
bytesout_high_enabled	True	No
bytesout_high_warning	2500000	No
bytesout_high_critical	5000000	No
bytesout_high_evaluation_period	last_30m	No
bytesout_high_note	""	No
bytesout_high_docs	NOTE: This is based on a baseline and might need adjusting further down the road.

Generally, disk throughput tends to be the main bottleneck in Kafka performance. However, that’s not to say that the network is never a bottleneck. Network throughput can affect Kafka’s performance if you are sending messages across data centers, if your topics have a large number of consumers, or if your replicas are catching up to their leaders. Tracking network throughput on your brokers gives you more information as to where potential bottlenecks may lie, and can inform decisions like whether or not you should enable end-to-end compression of your messages.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | bytesout_high_filter_override | "" | No | | | bytesout_high_alerting_enabled | True | No | | | bytesout_high_priority | 3 | No | Number from 1 (high) to 5 (low). |

Fetch_purgatory_size

The request purgatory serves as a temporary holding pen for produce and fetch requests waiting to be satisfied. Fetch requests are added to purgatory if there is not enough data to fulfill the request (fetch.min.bytes on consumers) until the time specified by fetch.wait.max.ms is reached or enough data becomes available

Keeping an eye on the size of purgatory is useful to determine the underlying causes of latency. Increases in consumer fetch times, for example, can be easily explained if there is a corresponding increase in the number of fetch requests in purgatory.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_30m):max:kafka.request.fetch_request_purgatory.size{tag:xxx} by {host} > 100000

variable	default	required
fetch_purgatory_size_enabled	True	No
fetch_purgatory_size_warning	70000	No
fetch_purgatory_size_critical	100000	No
fetch_purgatory_size_evaluation_period	last_30m	No
fetch_purgatory_size_note	""	No
fetch_purgatory_size_docs	The request purgatory serves as a temporary holding pen for produce and fetch requests waiting to be satisfied.
Fetch requests are added to purgatory if there is not enough data to fulfill the request (fetch.min.bytes on consumers) until the time specified by fetch.wait.max.ms is reached or enough data becomes available

Keeping an eye on the size of purgatory is useful to determine the underlying causes of latency. Increases in consumer fetch times, for example, can be easily explained if there is a corresponding increase in the number of fetch requests in purgatory.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | fetch_purgatory_size_filter_override | "" | No | | | fetch_purgatory_size_alerting_enabled | True | No | | | fetch_purgatory_size_priority | 3 | No | Number from 1 (high) to 5 (low). |

In_sync_nodes_dropped

The number of in-sync replicas (ISRs) for a particular partition should remain fairly static, except when you are expanding your broker cluster or removing partitions. In order to maintain high availability, a healthy Kafka cluster requires a minimum number of ISRs for failover. A replica could be removed from the ISR pool if it has not contacted the leader for some time (configurable with the replica.socket.timeout.ms parameter). You should investigate any flapping in the values of these metrics, and any increase in IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_5m):max:kafka.replication.isr_shrinks.rate{tag:xxx} by {aiven-service} - max:kafka.replication.isr_expands.rate{tag:xxx} by {aiven-service} >

variable	default	required
in_sync_nodes_dropped_enabled	True	No
in_sync_nodes_dropped_warning	0	No
in_sync_nodes_dropped_critical	0	No
in_sync_nodes_dropped_evaluation_period	last_5m	No
in_sync_nodes_dropped_note	""	No
in_sync_nodes_dropped_docs	The number of in-sync replicas (ISRs) for a particular partition should remain fairly static, except when you are expanding your broker cluster or removing partitions. In order to maintain high availability, a healthy Kafka cluster requires a minimum number of ISRs for failover. A replica could be removed from the ISR pool if it has not contacted the leader for some time (configurable with the replica.socket.timeout.ms parameter). You should investigate any flapping in the values of these metrics, and any increase in IsrShrinksPerSec without a corresponding increase in IsrExpandsPerSec shortly thereafter.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | in_sync_nodes_dropped_filter_override | "" | No | | | in_sync_nodes_dropped_alerting_enabled | True | No | | | in_sync_nodes_dropped_priority | 2 | No | Number from 1 (high) to 5 (low). |

Leader_election_occurring

When a partition leader dies, an election for a new leader is triggered. A partition leader is considered “dead” if it fails to maintain its session with ZooKeeper. Unlike ZooKeeper’s Zab, Kafka does not employ a majority-consensus algorithm for leadership election. Instead, Kafka’s quorum is composed of the set of all in-sync replicas (ISRs) for a particular partition. Replicas are considered in-sync if they are caught-up to the leader, which means that any replica in the ISR can be promoted to the leader.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

max(last_5m):avg:kafka.replication.leader_elections.rate{tag:xxx} by {aiven-service} >

variable	default	required
leader_election_occurring_enabled	True	No
leader_election_occurring_warning	0	No
leader_election_occurring_critical	0	No
leader_election_occurring_evaluation_period	last_5m	No
leader_election_occurring_note	""	No
leader_election_occurring_docs	When a partition leader dies, an election for a new leader is triggered. A partition leader is considered “dead” if it fails to maintain its session with ZooKeeper. Unlike ZooKeeper’s Zab, Kafka does not employ a majority-consensus algorithm for leadership election. Instead, Kafka’s quorum is composed of the set of all in-sync replicas (ISRs) for a particular partition. Replicas are considered in-sync if they are caught-up to the leader, which means that any replica in the ISR can be promoted to the leader.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | leader_election_occurring_filter_override | "" | No | | | leader_election_occurring_alerting_enabled | True | No | | | leader_election_occurring_require_full_window | False | No | | | leader_election_occurring_priority | 3 | No | Number from 1 (high) to 5 (low). |

Multiple_active_controllers

The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. The controller in a Kafka cluster is responsible for maintaining the list of partition leaders, and coordinating leadership transitions (in the event a partition leader becomes unavailable). If it becomes necessary to replace the controller, ZooKeeper chooses a new controller randomly from the pool of brokers. The sum of ActiveControllerCount across all of your brokers should always equal one, and you should alert on any other value that lasts for longer than one second.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_5m):max:kafka.replication.active_controller_count{tag:xxx} by {aiven-project} > 1

variable	default	required
multiple_active_controllers_enabled	True	No
multiple_active_controllers_warning	0	No
multiple_active_controllers_critical	1	No
multiple_active_controllers_evaluation_period	last_5m	No
multiple_active_controllers_note	""	No
multiple_active_controllers_docs	The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. The controller in a Kafka cluster is responsible for maintaining the list of partition leaders, and coordinating leadership transitions (in the event a partition leader becomes unavailable). If it becomes necessary to replace the controller, ZooKeeper chooses a new controller randomly from the pool of brokers. The sum of ActiveControllerCount across all of your brokers should always equal one, and you should alert on any other value that lasts for longer than one second.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | multiple_active_controllers_filter_override | "" | No | | | multiple_active_controllers_alerting_enabled | True | No | | | multiple_active_controllers_require_full_window | True | No | | | multiple_active_controllers_priority | 2 | No | Number from 1 (high) to 5 (low). |

No_active_controllers

The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. The controller in a Kafka cluster is responsible for maintaining the list of partition leaders, and coordinating leadership transitions (in the event a partition leader becomes unavailable). If it becomes necessary to replace the controller, ZooKeeper chooses a new controller randomly from the pool of brokers. The sum of ActiveControllerCount across all of your brokers should always equal one, and you should alert on any other value that lasts for longer than one second.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_5m):max:kafka.replication.active_controller_count{tag:xxx} by {aiven-project} < 1

variable	default	required
no_active_controllers_enabled	True	No
no_active_controllers_warning	0	No
no_active_controllers_critical	1	No
no_active_controllers_evaluation_period	last_5m	No
no_active_controllers_note	""	No
no_active_controllers_docs	The first node to boot in a Kafka cluster automatically becomes the controller, and there can be only one. The controller in a Kafka cluster is responsible for maintaining the list of partition leaders, and coordinating leadership transitions (in the event a partition leader becomes unavailable). If it becomes necessary to replace the controller, ZooKeeper chooses a new controller randomly from the pool of brokers. The sum of ActiveControllerCount across all of your brokers should always equal one, and you should alert on any other value that lasts for longer than one second.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | no_active_controllers_filter_override | "" | No | | | no_active_controllers_alerting_enabled | True | No | | | no_active_controllers_require_full_window | True | No | | | no_active_controllers_priority | 2 | No | Number from 1 (high) to 5 (low). |

Offline_partitions

This metric reports the number of partitions without an active leader. Because all read and write operations are only performed on partition leaders, you should alert on a non-zero value for this metric to prevent service interruptions. Any partition without an active leader will be completely inaccessible, and both consumers and producers of that partition will be blocked until a leader becomes available.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_5m):max:kafka.replication.offline_partitions_count{tag:xxx} >

variable	default	required
offline_partitions_enabled	True	No
offline_partitions_warning	0	No
offline_partitions_critical	0	No
offline_partitions_evaluation_period	last_5m	No
offline_partitions_note	""	No
offline_partitions_docs	This metric reports the number of partitions without an active leader. Because all read and write operations are only performed on partition leaders, you should alert on a non-zero value for this metric to prevent service interruptions. Any partition without an active leader will be completely inaccessible, and both consumers and producers of that partition will be blocked until a leader becomes available.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | offline_partitions_filter_override | "" | No | | | offline_partitions_alerting_enabled | True | No | | | offline_partitions_require_full_window | True | No | | | offline_partitions_priority | 2 | No | Number from 1 (high) to 5 (low). |

Produce_purgatory_size

The request purgatory serves as a temporary holding pen for produce and fetch requests waiting to be satisfied. If request.required.acks=-1, all produce requests will end up in purgatory until the partition leader receives an acknowledgment from all followers.

Keeping an eye on the size of purgatory is useful to determine the underlying causes of latency. Increases in consumer fetch times, for example, can be easily explained if there is a corresponding increase in the number of fetch requests in purgatory.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_15m):max:kafka.request.producer_request_purgatory.size{tag:xxx} > 200

variable	default	required
produce_purgatory_size_enabled	True	No
produce_purgatory_size_warning	100	No
produce_purgatory_size_critical	200	No
produce_purgatory_size_evaluation_period	last_15m	No
produce_purgatory_size_note	""	No
produce_purgatory_size_docs	The request purgatory serves as a temporary holding pen for produce and fetch requests waiting to be satisfied.
If request.required.acks=-1, all produce requests will end up in purgatory until the partition leader receives an acknowledgment from all followers.

Keeping an eye on the size of purgatory is useful to determine the underlying causes of latency. Increases in consumer fetch times, for example, can be easily explained if there is a corresponding increase in the number of fetch requests in purgatory.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | produce_purgatory_size_filter_override | "" | No | | | produce_purgatory_size_alerting_enabled | True | No | | | produce_purgatory_size_require_full_window | True | No | | | produce_purgatory_size_priority | 3 | No | Number from 1 (high) to 5 (low). |

Unclean_leader_election

Unclean leader elections occur when there is no qualified partition leader among Kafka brokers. Normally, when a broker that is the leader for a partition goes offline, a new leader is elected from the set of ISRs for the partition. Unclean leader election is disabled by default in Kafka version 0.11 and newer, meaning that a partition is taken offline if it does not have any ISRs to elect as the new leader. If Kafka is configured to allow an unclean leader election, a leader is chosen from the out-of-sync replicas, and any messages that were not synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability. You should alert on this metric, as it signals data loss.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_5m):max:kafka.replication.unclean_leader_elections.rate{tag:xxx} by {aiven-project} >

variable	default	required
unclean_leader_election_enabled	True	No
unclean_leader_election_warning	100	No
unclean_leader_election_critical	0	No
unclean_leader_election_evaluation_period	last_5m	No
unclean_leader_election_note	""	No
unclean_leader_election_docs	Unclean leader elections occur when there is no qualified partition leader among Kafka brokers. Normally, when a broker that is the leader for a partition goes offline, a new leader is elected from the set of ISRs for the partition. Unclean leader election is disabled by default in Kafka version 0.11 and newer, meaning that a partition is taken offline if it does not have any ISRs to elect as the new leader. If Kafka is configured to allow an unclean leader election, a leader is chosen from the out-of-sync replicas, and any messages that were not synced prior to the loss of the former leader are lost forever. Essentially, unclean leader elections sacrifice consistency for availability. You should alert on this metric, as it signals data loss.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | unclean_leader_election_filter_override | "" | No | | | unclean_leader_election_alerting_enabled | True | No | | | unclean_leader_election_require_full_window | False | No | | | unclean_leader_election_priority | 2 | No | Number from 1 (high) to 5 (low). |

Under_replicated_partitions

If a broker becomes unavailable, the value of UnderReplicatedPartitions will increase sharply. Since Kafka’s high-availability guarantees cannot be met without replication, investigation is certainly warranted should this metric value exceed zero for extended time periods.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Query:

avg(last_15m):avg:kafka.replication.under_replicated_partitions{tag:xxx} by {aiven-service} >

variable	default	required
under_replicated_partitions_enabled	True	No
under_replicated_partitions_warning	100	No
under_replicated_partitions_critical	0	No
under_replicated_partitions_evaluation_period	last_15m	No
under_replicated_partitions_note	""	No
under_replicated_partitions_docs	If a broker becomes unavailable, the value of UnderReplicatedPartitions will increase sharply. Since Kafka’s high-availability guarantees cannot be met without replication, investigation is certainly warranted should this metric value exceed zero for extended time periods.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/ | No | | | under_replicated_partitions_filter_override | "" | No | | | under_replicated_partitions_alerting_enabled | True | No | | | under_replicated_partitions_require_full_window | True | No | | | under_replicated_partitions_priority | 3 | No | Number from 1 (high) to 5 (low). |

Unusual_consumer_fetch_time

produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown. https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems

Query:

avg(last_30m):avg:kafka.request.fetch_consumer.time.avg{tag:xxx} > 1000

variable	default	required
unusual_consumer_fetch_time_enabled	True	No
unusual_consumer_fetch_time_warning	800	No
unusual_consumer_fetch_time_critical	1000	No
unusual_consumer_fetch_time_evaluation_period	last_30m	No
unusual_consumer_fetch_time_note	""	No
unusual_consumer_fetch_time_docs	produce: requests from producers to send data
fetch-consumer: requests from consumers to get new data
fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown. https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems | No | | | unusual_consumer_fetch_time_filter_override | "" | No | | | unusual_consumer_fetch_time_alerting_enabled | True | No | | | unusual_consumer_fetch_time_require_full_window | True | No | | | unusual_consumer_fetch_time_priority | 3 | No | Number from 1 (high) to 5 (low). |

Unusual_fetch_failures

produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems

This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task

Query:

avg(last_30m):avg:kafka.request.fetch.failed.rate{tag:xxx} > 50

variable	default	required
unusual_fetch_failures_enabled	True	No
unusual_fetch_failures_warning	40	No
unusual_fetch_failures_critical	50	No
unusual_fetch_failures_evaluation_period	last_30m	No
unusual_fetch_failures_note	""	No
unusual_fetch_failures_docs	produce: requests from producers to send data
fetch-consumer: requests from consumers to get new data
fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems

This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task | No | | | unusual_fetch_failures_filter_override | "" | No | | | unusual_fetch_failures_alerting_enabled | True | No | | | unusual_fetch_failures_require_full_window | True | No | | | unusual_fetch_failures_priority | 3 | No | Number from 1 (high) to 5 (low). |

Unusual_follower_fetch_time

produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems

Query:

avg(last_30m):avg:kafka.request.fetch_follower.time.avg{tag:xxx} > 1000

variable	default	required
unusual_follower_fetch_time_enabled	True	No
unusual_follower_fetch_time_warning	800	No
unusual_follower_fetch_time_critical	1000	No
unusual_follower_fetch_time_evaluation_period	last_30m	No
unusual_follower_fetch_time_note	""	No
unusual_follower_fetch_time_docs	produce: requests from producers to send data
fetch-consumer: requests from consumers to get new data
fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems | No | | | unusual_follower_fetch_time_filter_override | "" | No | | | unusual_follower_fetch_time_alerting_enabled | True | No | | | unusual_follower_fetch_time_require_full_window | True | No | | | unusual_follower_fetch_time_priority | 3 | No | Number from 1 (high) to 5 (low). |

Unusual_produce_failures

The TotalTimeMs metric family measures the total time taken to service a request (be it a produce, fetch-consumer, or fetch-follower request):

produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems

This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task

Query:

avg(last_30m):avg:kafka.request.produce.failed.rate{tag:xxx} > 50

variable	default	required
unusual_produce_failures_enabled	True	No
unusual_produce_failures_warning	40	No
unusual_produce_failures_critical	50	No
unusual_produce_failures_evaluation_period	last_30m	No
unusual_produce_failures_note	""	No
unusual_produce_failures_docs	The TotalTimeMs metric family measures the total time taken to service a request (be it a produce, fetch-consumer, or fetch-follower request):

produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems

This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task | No | | | unusual_produce_failures_filter_override | "" | No | | | unusual_produce_failures_alerting_enabled | True | No | | | unusual_produce_failures_require_full_window | True | No | | | unusual_produce_failures_priority | 3 | No | Number from 1 (high) to 5 (low). |

Unusual_produce_time

The TotalTimeMs metric family measures the total time taken to service a request (be it a produce, fetch-consumer, or fetch-follower request):

produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems

This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task

Query:

avg(last_30m):avg:kafka.request.produce.time.avg{tag:xxx} > 20

variable	default	required
unusual_produce_time_enabled	True	No
unusual_produce_time_warning	10	No
unusual_produce_time_critical	20	No
unusual_produce_time_evaluation_period	last_30m	No
unusual_produce_time_note	""	No
unusual_produce_time_docs	The TotalTimeMs metric family measures the total time taken to service a request (be it a produce, fetch-consumer, or fetch-follower request):

produce: requests from producers to send data fetch-consumer: requests from consumers to get new data fetch-follower: requests from brokers that are the followers of a partition to get new data

Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.

https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-totaltimems

This monitor checks if there's an unusual high amount of failures. Which might be indicative of the application not being able to perform its task | No | | | unusual_produce_time_filter_override | "" | No | | | unusual_produce_time_alerting_enabled | True | No | | | unusual_produce_time_require_full_window | True | No | | | unusual_produce_time_priority | 3 | No | Number from 1 (high) to 5 (low). |

Module Variables

variable	default	required	description
env		Yes
service	Kafka	No
notification_channel		Yes
additional_tags	[]	No
filter_str		Yes
is_hosted_service	False	No
locked	True	No
system_notification_channel_override	None	No
name_prefix	""	No	Can be used to prefix to the Monitor name
name_suffix	""	No	Can be used to suffix to the Monitor name
priority_offset	0	No	For non production workloads we can +1 on the priorities

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
examples		examples
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.terraform.lock.hcl		.terraform.lock.hcl
LICENSE		LICENSE
README.md		README.md
bytesin-high-variables.tf		bytesin-high-variables.tf
bytesin-high.tf		bytesin-high.tf
bytesout_high-variables.tf		bytesout_high-variables.tf
bytesout_high.tf		bytesout_high.tf
fetch_purgatory_size-variables.tf		fetch_purgatory_size-variables.tf
fetch_purgatory_size.tf		fetch_purgatory_size.tf
in_sync_nodes_dropped-variables.tf		in_sync_nodes_dropped-variables.tf
in_sync_nodes_dropped.tf		in_sync_nodes_dropped.tf
leader_election_occurring-variables.tf		leader_election_occurring-variables.tf
leader_election_occurring.tf		leader_election_occurring.tf
main.tf		main.tf
module_description.md		module_description.md
multiple_active_controllers-variables.tf		multiple_active_controllers-variables.tf
multiple_active_controllers.tf		multiple_active_controllers.tf
no_active_controllers-variables.tf		no_active_controllers-variables.tf
no_active_controllers.tf		no_active_controllers.tf
offline_partitions-variables.tf		offline_partitions-variables.tf
offline_partitions.tf		offline_partitions.tf
produce_purgatory_size-variables.tf		produce_purgatory_size-variables.tf
produce_purgatory_size.tf		produce_purgatory_size.tf
provider.tf		provider.tf
renovate.json		renovate.json
unclean_leader_election-variables.tf		unclean_leader_election-variables.tf
unclean_leader_election.tf		unclean_leader_election.tf
under_replicated_partitions-variables.tf		under_replicated_partitions-variables.tf
under_replicated_partitions.tf		under_replicated_partitions.tf
unusual_consumer_fetch_time-variables.tf		unusual_consumer_fetch_time-variables.tf
unusual_consumer_fetch_time.tf		unusual_consumer_fetch_time.tf
unusual_fetch_failures-variables.tf		unusual_fetch_failures-variables.tf
unusual_fetch_failures.tf		unusual_fetch_failures.tf
unusual_follower_fetch_time-variables.tf		unusual_follower_fetch_time-variables.tf
unusual_follower_fetch_time.tf		unusual_follower_fetch_time.tf
unusual_produce_failures-variables.tf		unusual_produce_failures-variables.tf
unusual_produce_failures.tf		unusual_produce_failures.tf
unusual_produce_time-variables.tf		unusual_produce_time-variables.tf
unusual_produce_time.tf		unusual_produce_time.tf
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terraform module for Datadog Kafka

Example Usage

Getting started developing

Bytesin High

Bytesout_high

Fetch_purgatory_size

In_sync_nodes_dropped

Leader_election_occurring

Multiple_active_controllers

No_active_controllers

Offline_partitions

Produce_purgatory_size

Unclean_leader_election

Under_replicated_partitions

Unusual_consumer_fetch_time

Unusual_fetch_failures

Unusual_follower_fetch_time

Unusual_produce_failures

Unusual_produce_time

Module Variables

About

Releases 3

Packages

Contributors 3

Languages

License

kabisa/terraform-datadog-kafka

Folders and files

Latest commit

History

Repository files navigation

Terraform module for Datadog Kafka

Example Usage

Getting started developing

Bytesin High

Bytesout_high

Fetch_purgatory_size

In_sync_nodes_dropped

Leader_election_occurring

Multiple_active_controllers

No_active_controllers

Offline_partitions

Produce_purgatory_size

Unclean_leader_election

Under_replicated_partitions

Unusual_consumer_fetch_time

Unusual_fetch_failures

Unusual_follower_fetch_time

Unusual_produce_failures

Unusual_produce_time

Module Variables

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 3

Languages

Packages