Metric Alerts
Define a Metric Alert
Scope: The alert is set to apply to the Entire Infrastructure of your team scope by default. However, you have the option to restrict the alert scope by filtering by specific labels, such as
cloud_provider_region
orkube_namespace_name
Metric: Select the metric you want to monitor, and configure how you want the data to be aggregated. For instance, if you want to monitor the read latency of a Cassandra Cluster, you can set the metric to
cassandra_read_latency
. From there, you can choose the aggregation method that best suits your needs. For example, if you want to understand the mean latency across the entire cluster, you can use the average aggregation. Alternatively, if you want to identify nodes with the highest latency, you can use the maximum aggregation.Group By Segment: By grouping metrics by labels such as
cloud_provider_availability_zone
, a unique segment is generated for each availability zone. This allows you to quickly detect if a particular availability zone is responsible for increasedcassandra_read_latency
or other performance degredation.
Metric Alerts do not support configuring a duration
.
Range and Duration
The range
of an alert defines the time period over which the relevant metric data is evaluated. It should not be confused with the duration
of an alert, which can only be configured for PromQL Alerts and refers to the length of time an alert condition must persist before triggering an alert. Metric Alerts, even when defined with translated to PromQL will trigger as soon as the alert condition is satisfied.
Configure Threshold
Define the threshold and time range for assessing the alert condition.
Metric alerts can aggregate data over the time range in various ways:
Aggregation | Description |
---|---|
average | The average of the retrieved metric values across the time period. |
sum | The sum of the metric across the time period evaluated. |
maximum | The maximum of the retrieved metric values across the time period. |
minimum | The minium of the retrieved metric values across the time period. |
For more information on thresholds, see Multiple Thresholds.
Images in Metric Alert Notifications
Metric Alert notifications forwarded to Slack or Email include a snapshot of the triggering time series data. For Slack notification channels, the snapshot can be toggled within the notification channel settings. When the channel is configured to Notify when Resolved, a snapshot of the time series data that resolves the alert is also provided in the notification.
Multiple Thresholds
In addition to an alert threshold, a warning threshold can be configured. Warning thresholds and alert thresholds can be associated with different notification channels. In the following example, a user may want to send a warning and alert notification to Slack, but also page the on-call team on Pagerduty if an alert threshold is met.
If both warning and alert thresholds are associated with the same notification channel, a metric immediately exceeding the alert threshold will ignore the warn threshold and only trigger the alert threshold.
Create an Alert on No Data
When a metric stops reporting data, alerts that use those metrics cannot be evaluated. To ensure that you’re aware when this happens, you can configure alerts to notify upon No Data by configuring the No Data option in the Settings section to either ignore or notify.
Translate to PromQL
You can automatically translate from form to PromQL in order to leverage the flexibility and power of PromQL. Using the Translate to PromQL option allows more complex queries to be executed.
sysdig_host_memory_available_bytes / sysdig_host_memory_total_bytes * 100
This query looks at the percentage of available memory on a host.
Thresholds are configured separately from the query, allowing the user to specify both an alert threshold and a warning threshold.
Metric alerts translated from form to PromQL do not currently support configuring a duration.
Example: Alert When Data Transfer Over the Threshold
The example given below shows an alert that triggers when the average
bytes of data transferred by a container is over 20 KiB/s for a period of 1 minute.
In the alert Settings, you can configure a link to a Runbook and to a Dashboard to speed up troubleshooting when the alert fires.
When viewing the triggered alert you will be able to quickly access your defined Runbook and Dashboard.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.