Metric Alerts

Sysdig Monitor offers an easy way to define metrics-based alerts by leveraging the easiness of the form configuration and the flexibility of PromQL.

You can create metric alerts for scenarios such as:

  • Number of processes running on a host
  • Root volume disk usage in a container
  • CPU / memory usage of a host or workload

Defining a Metric Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert

  • Specify multiple labels: Selecting a single label might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related labels. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as namespace, deployment, and so on.

Specify Metrics

Select a metric that this alert will monitor. You can also define how data is aggregated, such as average, maximum, minimum, or sum.

Configure Scope

Team scope is automatically applied to alerts. You can further filter the environment by overriding the scope.

For example, the below alert will fire when any host’s cpu usage will go above the defined threshold within the us-east-1a. cloud availability zone.

Alerting on No Data

When a metric stop reporting, Sysdig Monitor show no data where you would normally expect data points. To detect such incidents that fail silently, you can configure alerts to notify you when a metric ceases to report data.

You can use the No Data option in the Settings section to determine how a metric alert should behave upon discovering the metric reports no data.

By default, alerts configured for metrics that stop reporting data will not be evaluated. You can change this behavior by enabling Notify on missing data, in which case, an alert will be sent when the metric stops reporting data.

This feature is currently available only for Metric Alerts.

Configure Threshold

Define the threshold and time window for assessing the alert condition.

Metric alerts can be triggered to notify you of different aggregations:

Aggregation

Description

average

The average of the retrieved metric values across the time period. Actual number of samples retrieved is used to calculate the value.

For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as on average, the alert will be calculated by summing the 3 recorded values and dividing by 3.

sum

The combined sum of the metric across the time period evaluated.

maximum

The maximum of the retrieved metric values across the time period.

minimum

The minium of the retrieved metric values across the time period.

For more information on thresholds, see Multiple Thresholds.

Translate to PromQL

You can automatically translate from form to PromQL in order to leverage the flexibility and power of PromQL. Using the Translate to PromQL option allows more complex queries to be executed.

sysdig_host_memory_available_bytes / sysdig_host_memory_total_bytes * 100

This query looks at the percentage of available memory on a host.

Thresholds are configured separately from the query, allowing the user to specify both an alert threshold and a warning threshold.

Metric alerts translated from form to PromQL do not currently support configuring a duration.

Example: Alert When Data Transfer Over the Threshold

The example given below shows an alert that triggers when the average bytes of data transferred by a container is over 20 KiB/s for a period of 1 minute.

In the alert Settings, you can configure a link to a Runbook and to a Dashboard to speed up troubleshooting when the alert fires.

When viewing the triggered alert you will be able to quickly access your defined Runbook and Dashboard.