Skip to content

Monitoring

Joe Kralicky edited this page Jan 26, 2023 · 11 revisions

Monitoring Plugin

Plugin Modes

The monitoring plugin has both gateway and agent modes.

  • The gateway mode is responsible for creating/updating/deleting the MonitoringCluster custom resource, which the Opni Manager reconciles to deploy the various Cortex components in the upstream cluster.
  • The agent mode is responsible for creating/updating/deleting a Prometheus custom resource, which the (currently external) Prometheus Operator reconciles to deploy a Prometheus agent-mode instance in the downstream cluster.

Metrics Capability

The gateway-side plugin (or "capability backend") and the agent-side plugin (or "capability node") form the logical metrics capability, which can be "installed" onto clusters by adding the capability name to a list in the cluster's metadata. Adding the capability to a cluster will change the desired state for that node, then re-sync the node (see below).

APIs

CortexAdmin

The CortexAdmin API is a management API extension that allows programmatic access to a selection of useful Cortex APIs. These APIs are mostly used for obtaining status and diagnostics, debugging and troubleshooting, or performing uncommon administrative tasks.

CortexOps

The CortexOps API is a Management API extension that are used in the CLI and admin dashboard to create, update, or delete the upstream Cortex cluster by reconciling a MonitoringCluster custom resource, which is controlled by the Opni Manager.

Note: the mechanism for deploying the gateway-side components is currently left as an implementation detail for each backend. However, a suitable generic set of APIs will likely be added to the management api to control upstream capability lifecycle in the future.

Desired State Synchonization

Changing the upstream configuration using the CortexOps API or installing/uninstalling the metrics capability on a cluster using the management API changes the desired state of the associated nodes. When the metrics backend detects that a node (or nodes) no longer matches the current desired state, it will send a SyncNow message to the affected node(s). This triggers each node to send a Sync request to the backend, which then responds with a configuration describing the desired state for that node's resources. The node then reconciles its resources to match the desired state.

On startup, the node will send a Sync request to the backend to obtain the initial desired state as soon as it connects to the stream.

The backend may also send periodic sync requests at a low frequency (on the order of minutes) to help recover from issues such as accidental modification of managed resources.

Clone this wiki locally