-
Notifications
You must be signed in to change notification settings - Fork 56
Monitoring
The metrics
plugin implements metrics observability features for Opni. It provides a fully-managed Cortex cluster deployed in the central Opni cluster and Prometheus agents deployed in the downstream clusters.
The monitoring plugin has both gateway and agent modes.
- The gateway mode is responsible for creating/updating/deleting the
MonitoringCluster
custom resource, which the Opni Manager reconciles to deploy the various Cortex components in the upstream cluster. - The agent mode is responsible for creating/updating/deleting a
Prometheus
custom resource, which the (currently external) Prometheus Operator reconciles to deploy a Prometheus agent-mode instance in the downstream cluster.
The gateway-side plugin (or "capability backend") and the agent-side plugin (or "capability node") form the logical metrics
capability, which can be "installed" onto clusters by adding the capability name to a list in the cluster's metadata. Adding the capability to a cluster will change the desired state for that node, then re-sync the node (see below).
-
The
CortexAdmin
API is a management API extension that allows programmatic access to a selection of useful Cortex APIs. These APIs are mostly used for obtaining status and diagnostics, debugging and troubleshooting, or performing uncommon administrative tasks. -
The
CortexOps
API is a Management API extension used in the CLI and admin dashboard to create, update, or delete the upstream Cortex cluster by reconciling aMonitoringCluster
custom resource, which Opni Manager controls.Note: the mechanism for deploying the gateway-side components is currently left as an implementation detail for each backend. However, a suitable generic set of APIs will likely be added to the management api to control upstream capability lifecycle in the future.
Changing the upstream configuration using the CortexOps
API or installing/uninstalling the metrics capability on a cluster using the management API changes the desired state of the associated nodes. When the metrics backend detects that a node (or nodes) no longer matches the current desired state, it will send a SyncNow
message to the affected node(s). This triggers each node to send a Sync
request to the backend, which then responds with a configuration describing the desired state for that node's resources. The node then reconciles its resources to match the desired state.
On startup, the node will send a Sync
request to the backend to obtain the initial desired state as soon as it connects to the stream.
The backend may also send periodic sync requests at a low frequency (on the order of minutes) to help recover from issues such as accidental modification of managed resources.
Deployed on the upstream cluster by the metrics backend plugin.
Responsibilities:
- Deploy and configure Cortex in the upstream cluster. Can deploy Cortex in either standalone mode (all components run in a single pod) or HA mode (components run in their own pods and scale individually).
- Deploy and configure a
Grafana
custom resource, which the Grafana Operator reconciles to deploy Grafana in the upstream cluster. The Grafana instance is automatically configured to connect to Opni, install plugins, datasources, and dashboards, and configure authentication.
Deployed on the downstream cluster by the metrics capability node plugin.
Responsibilities:
- Deploy and configure a Prometheus instance in the downstream cluster. The metrics plugin configures a Prometheus agent-mode instance, which is configured to send remote-write requests to the agent.
The components of the gateway-mode plugin that operate in the hot path (forwarding remote-write requests) are stateless and should scale evenly with the gateway, as these components mainly forward requests as-is to Cortex. The gateway plugin doesn't perform any processing of its own on incoming remote-write requests at the moment.
The agent plugin can experience periods of high memory usage depending on the configuration of the Prometheus agent's remote write queue. The agent doesn't buffer requests from remote write clients, rather it round-trips each request serially to the gateway. This means that in the event of a network outage, the remote write clients are responsible for queueing and retrying failed requests, and the agent doesn't cache them itself.
The metrics plugin is responsible for securing connections between itself and other components in the upstream cluster managed by it, such as Cortex services. Requests to all Cortex services are secured and authenticated using mTLS, with certificates generated by cert-manager and mounted into pods from Kubernetes secrets.
(note: HA gateway is not yet supported, but will be in the near future)
Each instance of the gateway will have its own copy of the plugin running. Depending on the HA strategy used in the gateway, some components of the plugin may be replicated, such as the remote-write APIs. Others, such as the management API extensions, may instead use a leader election strategy to route requests to a single instance.
Tracked in https://github.com/rancher/opni/issues/813
Architecture
- Backends
- Core Components