Skip to content

Monitoring

Joe Kralicky edited this page Jan 31, 2023 · 11 revisions

Metrics Plugin

The metrics plugin implements metrics observability features for Opni. It provides a fully-managed Cortex cluster deployed in the central Opni cluster and Prometheus agents deployed in the downstream clusters.

Table of contents

Architecture

Plugin Modes

The monitoring plugin has both gateway and agent modes.

  • The gateway mode is responsible for creating/updating/deleting the MonitoringCluster custom resource, which the Opni Manager reconciles to deploy the various Cortex components in the upstream cluster.
  • The agent mode is responsible for creating/updating/deleting a Prometheus custom resource, which the (currently external) Prometheus Operator reconciles to deploy a Prometheus agent-mode instance in the downstream cluster.

Metrics Capability

The gateway-side plugin (or "capability backend") and the agent-side plugin (or "capability node") form the logical metrics capability, which can be "installed" onto clusters by adding the capability name to a list in the cluster's metadata. Adding the capability to a cluster will change the desired state for that node, then re-sync the node (see below).

APIs

  • cortexadmin.CortexAdmin

    The CortexAdmin API is a management API extension that allows programmatic access to a selection of useful Cortex APIs. These APIs are mostly used for obtaining status and diagnostics, debugging and troubleshooting, or performing uncommon administrative tasks.

  • cortexops.CortexOps

    The CortexOps API is a Management API extension used in the CLI and admin dashboard to create, update, or delete the upstream Cortex cluster by reconciling a MonitoringCluster custom resource, which Opni Manager controls.

    Note: the mechanism for deploying the gateway-side components is currently left as an implementation detail for each backend. However, a suitable generic set of APIs will likely be added to the management api to control upstream capability lifecycle in the future.

Desired State Synchonization

Changing the upstream configuration using the CortexOps API or installing/uninstalling the metrics capability on a cluster using the management API changes the desired state of the associated nodes. When the metrics backend detects that a node (or nodes) no longer matches the current desired state, it will send a SyncNow message to the affected node(s). This triggers each node to send a Sync request to the backend, which then responds with a configuration describing the desired state for that node's resources. The node then reconciles its resources to match the desired state.

On startup, the node will send a Sync request to the backend to obtain the initial desired state as soon as it connects to the stream.

The backend may also send periodic sync requests at a low frequency (on the order of minutes) to help recover from issues such as accidental modification of managed resources.

Custom Resources

See Managed Resources: Monitoring

Scale and performance

The components of the gateway-mode plugin that operate in the hot path (forwarding remote-write requests) are stateless and should scale evenly with the gateway, as these components mainly forward requests as-is to Cortex. The gateway plugin doesn't perform any processing of its own on incoming remote-write requests at the moment.

The agent plugin can experience periods of high memory usage depending on the configuration of the Prometheus agent's remote write queue. The agent doesn't buffer requests from remote write clients, rather it round-trips each request serially to the gateway. This means that in the event of a network outage, the remote write clients are responsible for queueing and retrying failed requests, and the agent doesn't cache them itself.

Security

The metrics plugin is responsible for securing connections between itself and other components in the upstream cluster managed by it, such as Cortex services. Requests to all Cortex services are secured and authenticated using mTLS, with certificates generated by cert-manager and mounted into pods from Kubernetes secrets.

High availability

(note: HA gateway is not yet supported, but will be in the near future)

Each instance of the gateway will have its own copy of the plugin running. Depending on the HA strategy used in the gateway, some components of the plugin may be replicated, such as the remote-write APIs. Others, such as the management API extensions, may instead use a leader election strategy to route requests to a single instance.

Testing

Tracked in https://github.com/rancher/opni/issues/813

Clone this wiki locally