Prometheus Alerts
Prometheus Alerts were formerly known as PromQL Alerts.
To create a Prometheus Alert:
- Log in to Sysdig Monitor and open Alerts from the left navigation bar.
- Click Add Alert and choose Prometheus.
Define a Prometheus Alert
- Condition: Enter a valid PromQL expression. Unlike Threshold Alerts, PromQL queries only return time series that meet the specified condition. For example, if you enter
sysdig_host_cpu_used_percent > 80
, only the hosts with CPU usage above 80% will be included in the query results. If all hosts are below this threshold, the query will return a resolved state, which is identical to a No Data state.
Users may experience confusion when determining whether an alert has truly been resolved or if the metric has disappeared and failed silently. This is because the No Data state in Prometheus is indistinguishable from a resolved state. Strategies for dealing with this can be found in the No Data and Multiple Thresholds.
Duration: The duration is how long an alert condition must remain true before triggering an alert. Prometheus Alerts have three states:
Resolved
,Pending
, andFiring
. If a duration of10m
is set, it means that the alert condition must be consistently satisfied for a continuous period of 10 minutes before transitioning into theFiring
state. Alerts whose expression are satisfied but haven’t met the required duration are in aPending
state.Keep Firing For: This setting enables the continuous firing of an alert occurrence for a user-defined duration even after the alert condition is no longer satisfied. Alerts that tend to flap or undergo bridges of No Data in the underlying metric can be configured with Keep Firing For in order to prevent unnecessary noise and false resolutions. If the alert condition is once again met before the Keep Firing For has elapsed, the alert will continue to trigger without needing to satisfy the duration once again.
Frequency of Alert Rule Evaluation
The Alert Editor automatically chooses the time window that works best with your alert rule. Every data point in the alert preview corresponds with an evaluation of an alert rule.
The frequency at which an alert rule is evaluated depends on the range
specified in its query, defined in PromQL by the time specified in the brackets. Using a larger time window for data aggregation can lead to less frequent checks of the alert rule. For instance, consider monitoring a service’s error rate: occasional errors might be tolerable, but a steady stream of errors over a certain period could indicate a problem. By setting up an alert query like min_over_time(service_error_rate[4h])
, the alert rule is evaluated every 10 minutes. Each evaluation analyzes the error rate over the past 4 hours to determine if the alert rule should be triggered.
Re-notifications for an alert cannot be sent more frequently than the alert rule’s evaluation interval and must be a multiple of this interval. For example, if an alert rule is evaluated every 10 minutes, re-notifications can only occur at multiples of the evaluation frequency, such as 20 minutes, 30 minutes, etc.
Range of Alert Query | Frequency of Alert Rule Evaluation |
---|---|
up to 3h | 1m |
up to 1d | 10m |
up to 7d | 1h |
up to 60d | 1d |
60d+ | Not Supported |
Users seeking to explore metric data outside the window recommended by the alert editor can navigate to Explore Historical Data
in order to view the time series data older than the recommended window.
Range and Duration
The duration of an alert refers to the length of time a specific condition must persist before triggering the alert. It should not be confused with the range of an alert, which defines the time period over which the relevant metric data is evaluated.
In the example below, the highNetworkTrafficFoo
alert examines the average network transmittance over the previous hour. If this average exceeds 10MB for a continuous duration of 5 minutes, the alert is triggered.
On the other hand, highNetworkTrafficBar
focuses on the average network transmittance over the past 5 minutes. If this average exceeds 10MB for the last hour, the alert is triggered.
rules:
- alert: highNetworkTrafficFoo
expr: avg(rate(network_bytes_total[1h])) > 10000000
for: 5m
- alert: highNetworkTrafficBar
expr: avg(rate(network_bytes_total[5m])) > 10000000
for: 1h
While using a longer duration can help reduce noisy alerts, it also means that some alerts may meet the threshold momentarily without triggering. Therefore, there is a trade-off between suppressing noisy alerts and potentially delaying notifications for certain conditions.
Label Enrichment in Alert Notifications
Contextual labels are automatically appended to alert notifications when Prometheus Alert rules trigger. Examples of such labels are host_hostname
, cloud_provider_region
, and kube_cluster_name
. This additional context aids in faster issue identification, troubleshooting, and resolution.
Compare Prometheus Alerts and Threshold Alerts
Prometheus Alerts query the same metrics as Threshold Alerts. The difference between the two alert types lies in the evaluation algorithm.
Threshold Alerts | Prometheus Alerts |
---|---|
No Data state does not cause Alert Resolution | No Data state is the same as Resolved State |
Multi-threshold Support | Must create two alerts for multiple thresholds |
Duration supported | Duration supported |
Queries return all the time series by default | Queries only return time series that satisfy the alert query |
No Data and Multiple Thresholds
To avoid an alert resolution in No Data scenarios or to configure multiple thresholds, you can switch from a Prometheus Alert to a Threshold Alert.
To create a Threshold Alert with these advanced capabilities, follow:
- Create a new Threshold Alert.
- Switch the alert creation mode from Form to PromQL.
- Continue with configuring a Threshold Alert using PromQL.
- Optionally, add a warning threshold and notify on No Data.
By transitioning to a Threshold Alert with PromQL, you can fully utilize the multiple theshold and No Data capabilities provided by Sysdig’s monitoring solution.
Import Prometheus Alert Rules
Sysdig Alert allows you to import Prometheus rules. Click the Upload Prometheus Rules option and enter the rules as YAML in the Upload Prometheus Rules YAML editor. Importing your Prometheus alert rules will convert them to Prometheus alerts.
Each alert rule should include the following mandatory fields.
alert
expr
for
Example: Alert Prometheus Crash Looping
This alert detects potential restart loops on the prometheus
, pushgateway
or alertmanager
jobs.
groups:
- name: crashlooping
rules:
- alert: PrometheusTooManyRestarts
expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[10m]) > 2
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus too many restarts (instance {{ $labels.instance }})
description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n
Example: Alert HTTP Error Rate
NginxHighHttp5xxErrorRate
detects an http error rate of 5% or higher.
NginxLatencyHigh
monitors the p99 latency for 3 seconds or higher for all hosts and nodes.
To alert HTTP requests with status 5xx (> 5%) or high latency:
groups:
- name: default
rules:
# Paste your rules here
- alert: NginxHighHttp5xxErrorRate
expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
for: 1m
labels:
severity: critical
annotations:
summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
description: Too many HTTP requests with status 5xx
- alert: NginxLatencyHigh
expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Nginx latency high (instance {{ $labels.instance }})
description: Nginx p99 latency is higher than 3 seconds
Learn More
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.