Threshold Alerts

Monitor your infrastructure by comparing any metric against user-defined thresholds.

Threshold Alerts were formerly known as Metric Alerts.

Create a Threshold Alert

To create a Threshold Alert:

  1. Log in to Sysdig Monitor and open Alerts from the left navigation bar.
  2. Click Add Alert and choose Threshold to begin defining a Threshold Alert.

Define a Threshold Alert

  • Scope: The alert is set to apply to the Entire Infrastructure of your Team Scope by default. However, you can restrict the alert scope by filtering with specific labels, such as cloud_provider_region or kube_namespace_name.

  • Threshold: Select the metric you want to monitor, and configure how you want the data to be aggregated. For instance, if you want to monitor the read latency of a Cassandra Cluster, you can set the metric to cassandra_read_latency. From there, you can choose the aggregation method that best suits your needs. For example, if you want to understand the mean latency across the entire cluster, you can use the average aggregation. Alternatively, if you want to identify nodes with the highest latency, you can use the maximum aggregation.

  • Group By: By grouping metrics by labels such as cloud_provider_availability_zone, a unique segment is generated for each availability zone. This allows you to quickly detect if a particular availability zone is responsible for increased cassandra_read_latency or other performance degredation.

  • Time Aggregation: Also known as the Range, the Time Aggregation of an alert rule determines the time window over which the selected metric is aggregated. For example, if you select the avg time aggregation for the cassandra_read_latency with a value of 10m, it calculates the average value of the cassandra_read_latency metric over a rolling 10m window. This time aggregation defines how far back in time the metric values are considered for time aggregation.

  • Duration: Duration defines the time an alert condition must continuously be satisfied before triggering an alert. For instance, a duration of 10m means the condition must be met for a continuous 10 minutes. If the alert condition is not satisfied at any time within this period, the 10-minute timer resets and must be satisfied for a full, uninterrupted 10 minutes again. Setting a longer duration reduces false positives by preventing alerts from being triggered by short-lived threshold violations

Time Aggregation and Duration

The Time Aggregation of an alert query defines the time period over which the relevant metric data is evaluated. It should not be confused with the Duration of an alert rule, which refers to the length of time an alert condition must be met before it triggers an alert.

Frequency of Alert Rule Evaluation

The Alert Editor automatically displays the time window that works best with your alert rule. Every data point in the alert preview corresponds with an evaluation of an alert rule.

The frequency at which an alert rule is evaluated depends on the Time Aggregation specified in its query. For example:

  • If you set up an alert query with a Time Aggregation of 40 minutes, the rule evaluates every 1 minutes
  • If you set up a query with a Time Aggregation of 4 hours, the alert evaluates every 10 minutes.

Re-notifications for an alert cannot be sent more frequently than the alert rule’s evaluation interval and must be a multiple of this interval. For example, if an alert rule is evaluated every 10 minutes, re-notifications can only occur at multiples of the evaluation frequency, such as 20 minutes, 30 minutes, and so forth.

Time Aggregation of Alert QueryFrequency of Alert Rule Evaluation
up to 3h1m
up to 1d10m
up to 7d1h
up to 60d1d
60d+Not Supported

To view time series data older than the recommended window, click Explore Historical Data in the top right corner of Alert Editor. This will populate a PromQL Query in the Explore module with your current settings.

Snapshots in Alert Notifications

Threshold Alert notifications include a snapshot of the triggering time series data when forwarded to Slack, Email, Pagerduty, or Microsoft Teams. When the notification channel is configured to Notify when Resolved, a snapshot of the time series data that resolves the alert is also provided in the notification. For Slack notification channels, you can choose whether to include a snapshot in the notification channel settings. See Customize Notifications.

Notification Channels that Support Snapshots in Alert Notifications

Notification ChannelSnapshot Support
Email
Slack
Pagerduty
Microsoft Teams
Google Chat🚫
Custom Webhook🚫
OpsGenie🚫
VictorOps🚫
Prometheus Alertmanager🚫

Enriched Labels in Threshold Alert Notifications

All Threshold Alert notifications are enriched by default with contextual labels, which aid in faster issue identification and resolution. When an alert rule triggers, Sysdig automatically appends contextual labels to the alert notification, such as host_hostname, cloud_provider_region, and kube_cluster_name.

Multiple Thresholds

In addition to an alert threshold, a warning threshold can be configured. Warning thresholds and alert thresholds can be associated with different notification channels. In the following example, you may want to send a warning and alert notification to Slack, but also page the on-call team on Pagerduty if an alert threshold is met.

If both warning and alert thresholds are associated with the same notification channel, a metric immediately exceeding the alert threshold will ignore the warn threshold and only trigger the alert threshold.

Create an Alert on No Data

With the No Data alert configuration, you can choose how to handle situations when there is no incoming data for a metric across all its time series. In the Settings section, select from the two options for No Data:

Ignore: Select this option if you prefer not to receive notifications when all time series of a metric stop sending data.

Notify: Choose this if you want to be alerted when data stops coming in for all time series of a metric.

A No Data alert will not be triggered by an individual time series ceasing to report data; it activates only when all time series for a metric stop reporting.

Threshold Alerts in Sysdig Monitor do not auto-resolve when a time series that triggered an alert rule stops reporting data, unlike Prometheus alerts which can auto-resolve under similar conditions. This means you must manually resolve an alert occurrence if the time series that triggered the alert rule ceases to report data.

Translate to PromQL

You can automatically translate from Form to PromQL in order to leverage the flexibility and power of PromQL. Use the Translate to PromQL option to create more complex queries.

This query, for example, looks at the percentage of available memory on a host:

sysdig_host_memory_available_bytes / sysdig_host_memory_total_bytes * 100

Thresholds are configured separately from the query. This means you can specify both an alert threshold and a warning threshold.

Example: Alert When Data Transfer Over the Threshold

The example given below shows an alert that triggers when the average bytes of data transferred by a container is over 500 KiB/s for a period of 1 minute.