Skip to content

kabisa/terraform-datadog-kong

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datadog

Terraform module for Datadog Kong

This module can be used to monitor the Kong system, modules have been build based on advice we've got from Kong. Not in scope (yet?) are monitors for the applications hosted behind Kong. It also has the option to monitor the docker containers with the embedded docker module

This module is part of a larger suite of modules that provide alerts in Datadog. Other modules can be found on the Terraform Registry

We have two base modules we use to standardise development of our Monitor Modules:

Modules are generated with this tool: https://github.com/kabisa/datadog-terraform-generator

Example Usage

module "kong" {
  source = "kabisa/kong/datadog"

  notification_channel = "@[email protected]"
  service              = "Kong"
  env                  = "prd"
  filter_str           = "cluster_name:kong"
  service_check_include_tags = [
    "cluster_name:kong",
  ]

  runs_in_k8s                         = true
  docker_container_monitoring_enabled = true
}

Module Variables

Monitors:

Monitor name Default enabled Priority Query
Active Connections High True 3 avg(last_10m):sum:kong.connections_active{tag:xxx} by {cluster_name} > 2000
Active Connections Low True 3 avg(last_10m):sum:kong.connections_active{tag:xxx} by {cluster_name} < 2
Connections Waiting High True 3 avg(last_10m):sum:kong.connections_waiting{tag:xxx} by {cluster_name} > 1000
Db Connectivity True 3 avg(last_10m):avg:network.tcp.can_connect{tag:xxx} by {host,cluster_name} <= 0.5
Disk Space True 2 avg(${var.disk_free_percent_evaluation_period}):100 * min:system.disk.free{${local.disk_free_percent_filter}} by {host,device} / min:system.disk.total{${local.disk_free_percent_filter}} by {host,device} < ${var.disk_free_percent_critical}
Dns Response Time Google True 3 avg(last_5m):avg:dns.response_time{tag:xxx} by {host,cluster_name} > 2
Dns Response Time True 3 avg(last_5m):avg:dns.response_time{tag:xxx} by {host} > 1
Error Response Percentage True 3 sum(last_30m):${local.error_count} * 100 / (${local.success_count} + ${local.error_count} + ${var.error_response_percentage_offset}) > 3
Ntp Drift True 2 avg(last_15m):abs(avg:ntp.offset{tag:xxx} by {host,cluster_name}) > 30
Swap Percent Free True 3 avg(last_5m):min:system.swap.pct_free{tag:xxx} by {host,cluster_name} * 100 < 10
System Entropy True 3 avg(last_15m):min:system.entropy.available{tag:xxx} by {host,cluster_name} < 250

Getting started developing

pre-commit was used to do Terraform linting and validating.

Steps:

  • Install pre-commit. E.g. brew install pre-commit.
  • Run pre-commit install in this repo. (Every time you clone a repo with pre-commit enabled you will need to run the pre-commit install command)
  • That’s it! Now every time you commit a code change (.tf file), the hooks in the hooks: config .pre-commit-config.yaml will execute.

Active Connections High

Query:

avg(last_10m):sum:kong.connections_active{tag:xxx} by {cluster_name} > 2000
variable default required description
active_connections_high_enabled True No
active_connections_high_warning 1500 No
active_connections_high_critical 2000 No
active_connections_high_evaluation_period last_10m No
active_connections_high_note "" No
active_connections_high_docs "" No
active_connections_high_filter_override "" No
active_connections_high_alerting_enabled True No
active_connections_high_no_data_timeframe None No
active_connections_high_notify_no_data False No
active_connections_high_ok_threshold None No
active_connections_high_name_prefix "" No
active_connections_high_name_suffix "" No
active_connections_high_priority 3 No Number from 1 (high) to 5 (low).

Active Connections Low

Query:

avg(last_10m):sum:kong.connections_active{tag:xxx} by {cluster_name} < 2
variable default required description
active_connections_low_enabled True No
active_connections_low_warning 5 No
active_connections_low_critical 2 No
active_connections_low_evaluation_period last_10m No
active_connections_low_note "" No
active_connections_low_docs "" No
active_connections_low_filter_override "" No
active_connections_low_alerting_enabled True No
active_connections_low_no_data_timeframe None No
active_connections_low_notify_no_data False No
active_connections_low_ok_threshold None No
active_connections_low_name_prefix "" No
active_connections_low_name_suffix "" No
active_connections_low_priority 3 No Number from 1 (high) to 5 (low).

Connections Waiting High

Query:

avg(last_10m):sum:kong.connections_waiting{tag:xxx} by {cluster_name} > 1000
variable default required description
connections_waiting_high_enabled True No
connections_waiting_high_warning 750 No
connections_waiting_high_critical 1000 No
connections_waiting_high_evaluation_period last_10m No
connections_waiting_high_note "" No
connections_waiting_high_docs "" No
connections_waiting_high_filter_override "" No
connections_waiting_high_alerting_enabled True No
connections_waiting_high_no_data_timeframe None No
connections_waiting_high_notify_no_data False No
connections_waiting_high_ok_threshold None No
connections_waiting_high_name_prefix "" No
connections_waiting_high_name_suffix "" No
connections_waiting_high_priority 3 No Number from 1 (high) to 5 (low).

Datadog Agent

Returns CRITICAL if Datadog is agent is not running or reporting data for %d minutes

variable default required description
datadog_agent_enabled True No
datadog_agent_critical 1 No
datadog_agent_evaluation_period last(2) No
datadog_agent_priority 2 No
datadog_agent_note "" No
datadog_agent_docs Returns CRITICAL if Datadog is agent is not running or reporting data for %d minutes No
datadog_agent_filter_override "" No
datadog_agent_alerting_enabled True No
datadog_agent_no_data_timeframe 2 No Timeframe in minutes after which we send alerts if a host is unavailable

Db Connectivity

Query:

avg(last_10m):avg:network.tcp.can_connect{tag:xxx} by {host,cluster_name} <= 0.5
variable default required description
db_connectivity_enabled True No
db_connectivity_warning None No
db_connectivity_critical 0.5 No
db_connectivity_evaluation_period last_10m No
db_connectivity_note "" No
db_connectivity_docs "" No
db_connectivity_filter_override "" No
db_connectivity_alerting_enabled True No
db_connectivity_no_data_timeframe None No
db_connectivity_notify_no_data False No
db_connectivity_ok_threshold None No
db_connectivity_name_prefix "" No
db_connectivity_name_suffix "" No
db_connectivity_priority 3 No Number from 1 (high) to 5 (low).

Disk Space

Query:

avg(${var.disk_free_percent_evaluation_period}):100 * min:system.disk.free{${local.disk_free_percent_filter}} by {host,device} / min:system.disk.total{${local.disk_free_percent_filter}} by {host,device} < ${var.disk_free_percent_critical}
variable default required description
disk_free_percent_enabled True No
disk_free_percent_warning 20 No
disk_free_percent_critical 15 No
disk_free_percent_evaluation_period last_15m No
disk_free_percent_note "" No
disk_free_percent_docs "" No
disk_free_percent_filter_override "" No
disk_free_percent_alerting_enabled True No
disk_free_percent_no_data_timeframe None No
disk_free_percent_notify_no_data False No
disk_free_percent_ok_threshold None No
disk_free_percent_name_prefix "" No
disk_free_percent_name_suffix "" No
disk_free_percent_priority 2 No Number from 1 (high) to 5 (low).

Dns Response Time Google

Query:

avg(last_5m):avg:dns.response_time{tag:xxx} by {host,cluster_name} > 2
variable default required description
dns_response_time_google_enabled True No
dns_response_time_google_warning 1.5 No
dns_response_time_google_critical 2 No
dns_response_time_google_evaluation_period last_5m No
dns_response_time_google_note "" No
dns_response_time_google_docs "" No
dns_response_time_google_filter_override "" No
dns_response_time_google_alerting_enabled True No
dns_response_time_google_no_data_timeframe None No
dns_response_time_google_notify_no_data False No
dns_response_time_google_ok_threshold None No
dns_response_time_google_name_prefix "" No
dns_response_time_google_name_suffix "" No
dns_response_time_google_priority 3 No Number from 1 (high) to 5 (low).

Dns Response Time

Query:

avg(last_5m):avg:dns.response_time{tag:xxx} by {host} > 1
variable default required description
dns_response_time_enabled True No
dns_response_time_warning 0.75 No
dns_response_time_critical 1 No
dns_response_time_evaluation_period last_5m No
dns_response_time_note "" No
dns_response_time_docs "" No
dns_response_time_filter_override "" No
dns_response_time_alerting_enabled True No
dns_response_time_no_data_timeframe None No
dns_response_time_notify_no_data False No
dns_response_time_ok_threshold None No
dns_response_time_name_prefix "" No
dns_response_time_name_suffix "" No
dns_response_time_priority 3 No Number from 1 (high) to 5 (low).

Error Response Percentage

Query:

sum(last_30m):${local.error_count} * 100 / (${local.success_count} + ${local.error_count} + ${var.error_response_percentage_offset}) > 3
variable default required description
error_response_percentage_offset 100 No An off set to make sure if only a single requests gets in we don't immediately compute error rate to 100%. The formula is: error_count * 100 / (error_count + success_count + offset)
error_response_percentage_enabled True No
error_response_percentage_warning 2 No
error_response_percentage_critical 3 No
error_response_percentage_evaluation_period last_30m No
error_response_percentage_note "" No
error_response_percentage_docs "" No
error_response_percentage_filter_override "" No
error_response_percentage_alerting_enabled True No
error_response_percentage_no_data_timeframe None No
error_response_percentage_notify_no_data False No
error_response_percentage_ok_threshold None No
error_response_percentage_name_prefix "" No
error_response_percentage_name_suffix "" No
error_response_percentage_priority 3 No Number from 1 (high) to 5 (low).

Integration

Returns CRITICAL if Datadog is unable to connect to the monitored Kong instance(s). Returns OK otherwise.

variable default required description
integration_enabled True No
integration_critical 1 No
integration_evaluation_period last(2) No
integration_priority 2 No
integration_note "" No
integration_docs Returns CRITICAL if Datadog is unable to connect to the monitored Kong instance(s). Returns OK otherwise. No
integration_include_tags_override None No
integration_exclude_tags_override None No
integration_alerting_enabled True No

Ntp Drift

Query:

avg(last_15m):abs(avg:ntp.offset{tag:xxx} by {host,cluster_name}) > 30
variable default required description
ntp_drift_enabled True No
ntp_drift_warning 20 No
ntp_drift_critical 30 No
ntp_drift_evaluation_period last_15m No
ntp_drift_note "" No
ntp_drift_docs "" No
ntp_drift_filter_override "" No
ntp_drift_alerting_enabled True No
ntp_drift_no_data_timeframe None No
ntp_drift_notify_no_data False No
ntp_drift_ok_threshold None No
ntp_drift_name_prefix "" No
ntp_drift_name_suffix "" No
ntp_drift_priority 2 No Number from 1 (high) to 5 (low).

Swap Percent Free

Query:

avg(last_5m):min:system.swap.pct_free{tag:xxx} by {host,cluster_name} * 100 < 10
variable default required description
swap_percent_free_enabled True No
swap_percent_free_warning 25 No
swap_percent_free_critical 10 No
swap_percent_free_evaluation_period last_5m No
swap_percent_free_note "" No
swap_percent_free_docs "" No
swap_percent_free_filter_override "" No
swap_percent_free_alerting_enabled True No
swap_percent_free_no_data_timeframe None No
swap_percent_free_notify_no_data False No
swap_percent_free_ok_threshold None No
swap_percent_free_name_prefix "" No
swap_percent_free_name_suffix "" No
swap_percent_free_priority 3 No Number from 1 (high) to 5 (low).

System Entropy

Entropy is the randomness collected by an operating system or application for use in cryptography or other uses that require random data. This randomness is often collected from hardware sources (variance in fan noise or HDD), either pre-existing ones such as mouse movements or specially provided randomness generators. A lack of entropy can have a negative impact on performance and security.

Query:

avg(last_15m):min:system.entropy.available{tag:xxx} by {host,cluster_name} < 250
variable default required description
system_entropy_enabled True No
system_entropy_warning 500 No
system_entropy_critical 250 No
system_entropy_evaluation_period last_15m No
system_entropy_note "" No
system_entropy_docs Entropy is the randomness collected by an operating system or application for use in cryptography or other uses that require random data. This randomness is often collected from hardware sources (variance in fan noise or HDD), either pre-existing ones such as mouse movements or specially provided randomness generators. A lack of entropy can have a negative impact on performance and security. No
system_entropy_filter_override "" No
system_entropy_alerting_enabled True No
system_entropy_no_data_timeframe None No
system_entropy_notify_no_data False No
system_entropy_ok_threshold None No
system_entropy_name_prefix "" No
system_entropy_name_suffix "" No
system_entropy_priority 3 No Number from 1 (high) to 5 (low).

Module Variables

variable default required description
filter_str Yes
env Yes
service Kong No
notification_channel Yes
additional_tags [] No
locked False No
name_prefix "" No
name_suffix "" No
service_check_include_tags Yes Tags to be included in the "over" section of a service check query
service_check_exclude_tags [] No Tags to be included in the "exclude" section of a service check query
docker_container_monitoring_enabled Yes
docker_filter_str None No
runs_in_k8s Yes Determines whether or not we also do group by on cluster_name
priority_offset 0 No For non production workloads we can +1 on the priorities