PromQL Alerts

Sysdig Monitor enables you to use PromQL to monitor and alert on changes in your infrastructure

Define a PromQL Alert

  • Condition: Enter a valid PromQL expression. Unlike Metric Alerts, PromQL queries only return time series that meet the specified condition. For example, if you enter sysdig_host_cpu_used_percent > 80, only the hosts with CPU usage above 80% will be included in the query results. If all hosts are below this threshold, the query will return a resolved state, which is identical to a No Data state.

Users may experience confusion when determining whether an alert has truly been resolved or if the metric has disappeared and failed silently. This is because the No Data state in Prometheus is indistinguishable from a resolved state. Strategies for dealing with this can be found in the No Data and Multiple Thresholds.

  • Duration: The duration is how long an alert condition must remain true before triggering an alert. Prometheus Alerts have three states: Resolved, Pending, and Firing. If a duration of 10m is set, it means that the alert condition must be consistently satisfied for a continuous period of 10 minutes before transitioning into the Firing state. Alerts whose expression are satisfied but haven’t met the required duration are in a Pending state.

  • Alert Resolution Delay: The Alert Resolution Delay is identical to keep_firing_for in Prometheus. This setting enables the continuous firing of an alert occurrence for a user-defined duration even after the alert condition is no longer valid. Alerts that tend to flap or undergo bridges of No Data in the underlying metric can be configured with Alert Resolution Delay in order to prevent unnecessary noise and false resolutions. If the alert condition is once again met before the Alert Resolution Delay has elapsed, the alert will continue to trigger without needing to satisfy the duration once again.

Range and Duration

The duration of an alert refers to the length of time a specific condition must persist before triggering the alert. It should not be confused with the range of an alert, which defines the time period over which the relevant metric data is evaluated.

In the example below, the highNetworkTrafficFoo alert examines the average network transmittance over the previous hour. If this average exceeds 10MB for a continuous duration of 5 minutes, the alert is triggered.

On the other hand, highNetworkTrafficBar focuses on the average network transmittance over the past 5 minutes. If this average exceeds 10MB for the last hour, the alert is triggered.

  rules:
  - alert: highNetworkTrafficFoo
    expr: avg(rate(network_bytes_total[1h])) > 10000000
    for: 5m
  - alert: highNetworkTrafficBar
    expr: avg(rate(network_bytes_total[5m])) > 10000000
    for: 1h

While using a longer duration can help reduce noisy alerts, it also means that some alerts may meet the threshold momentarily without triggering. Therefore, there is a trade-off between suppressing noisy alerts and potentially delaying notifications for certain conditions.

Label Enrichment in Alert Notifications

Contextual labels are automatically appended to alert notifications when PromQL alert rules trigger. Examples of such labels are host_hostname, cloud_provider_region, and kube_cluster_name. This additional context aids in faster issue identification, troubleshooting, and resolution.

Compare PromQL Alerts and Metric Alerts

PromQL Alerts query the same metrics as Metric Alerts. The difference between the two alert types lies in the evaluation algorithm.

Metric AlertsPromQL Alerts
Native No Data supportNo Data state is the same as Resolved State
Multi-threshold SupportMust create two alerts for multiple thresholds
No DurationDuration supported
Queries return all the time series by defaultQueries only return time series that satisfy the alert query

No Data and Multiple Thresholds

To leverage the benefits of identifying No Data scenarios and configuring multiple thresholds, you can switch from a PromQL Prometheus Alert to a Sysdig Metric Alert.

To create a Metric Alert with these advanced capabilities, follow:

  1. Create a new Metric Alert.
  2. Switch the alert creation mode from Form to PromQL.
  3. Continue with configuring a Metric Alert using PromQL.
  4. Optionally, add a warning threshold and notify on No Data.

By transitioning to a Sysdig Metric Alert with PromQL, you can fully utilize the multiple theshold and No Data capabilities provided by Sysdig’s monitoring solution.

Import Prometheus Alert Rules

Sysdig Alert allows you to import Prometheus rules. Click the Upload Prometheus Rules option and enter the rules as YAML in the Upload Prometheus Rules YAML editor. Importing your Prometheus alert rules will convert them to PromQL alerts.

Each alert rule should include the following mandatory fields.

  • alert
  • expr
  • for

Example: Alert Prometheus Crash Looping

This alert detects potential restart loops on the prometheus, pushgateway or alertmanager jobs.

groups:
- name: crashlooping
  rules:
  - alert: PrometheusTooManyRestarts
    expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[10m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus too many restarts (instance {{ $labels.instance }})
      description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n

Example: Alert HTTP Error Rate

NginxHighHttp5xxErrorRate detects an http error rate of 5% or higher.

NginxLatencyHigh monitors the p99 latency for 3 seconds or higher for all hosts and nodes.

To alert HTTP requests with status 5xx (> 5%) or high latency:

groups:
- name: default
  rules:
  # Paste your rules here
  - alert: NginxHighHttp5xxErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
      description: Too many HTTP requests with status 5xx
  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Nginx latency high (instance {{ $labels.instance }})
      description: Nginx p99 latency is higher than 3 seconds

Learn More