Metric Alerts

Sysdig Monitor offers an easy way to define metrics-based alerts.

You can create metric alerts for scenarios such as:

  • Number of processes running on a host
  • Root volume disk usage in a container
  • Cpu / memory usage of a host or workload

Defining a Metric Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert

  • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as namespace, deployment, and so on.

Specify Metrics

Select a metric that this alert will monitor. You can also define how data is aggregated, such as average, maximum, minimum, or sum.

Configure Scope

Team scope is automatically applied to alerts. You can further filter the environment by overriding the scope.

For example, the below alert will fire when any host’s cpu usage will go above the defined threshold within the us-east-1a. cloud availability zone.

You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

Configure Trigger

Define the threshold and time window for assessing the alert condition. Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.

Metric alerts can be triggered to notify you of different aggregations:

Aggregation

Description

on average

The average of the retrieved metric values across the time period. Actual number of samples retrieved is used to calculate the value.

For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as on average, the alert will be calculated by summing the 3 recorded values and dividing by 3.

as a rate

The average value of the metric across the time period evaluated. The expected number of values is used to calculate the rate to trigger the alert.

For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as as a rate, the alert will be calculated by summing the 3 recorded values and dividing by 10 ( 10 x 1 minute samples).

in sum

The combined sum of the metric across the time period evaluated.

at least once

The trigger value is met for at least one sample in the evaluated period.

for the entire time

The trigger value is met for a every sample in the evaluated period.

as a rate of change

The trigger value is met the change in value over the evaluated period.

For example, the alert below will fire for each unique segment denominated by host_hostname and kube_cluster_name using more than 75% of the filesystem on average, over the last 5 minutes.

Example: Alert When Data Transfer Over the Threshold

The below example shows an alert that triggers when the average bytes of data transferred by a container is over 20 KiB/s for a period of 1 minute.

In the alert Settings, you can configure a link to a Runbook and to a Dashboard to speed up troubleshooting when the alert fires.

When viewing the triggered alert you will be able to quickly access your defined Runbook and Dashboard.