This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Alerts

Alert is the responsive component of Sysdig Monitor. Alerts notify you when an event or issue occurs that requires attention. Events and issues are identified based on changes in the metric values collected by Sysdig Monitor. The Alerts module displays out-of-the-box alerts and a wizard for creating and editing alerts as needed.

Alert Types

The types of alerts available in Sysdig Monitor:

  • Downtime: Monitor any type of entity, such as a host, a container, or a process, and alert when the entity goes down.

  • Metric: Monitor time-series metrics, and alert if they violate user-defined thresholds.

  • PromQL: Monitor metrics through a PromQL query.

  • Event: Monitor occurrences of specific events, and alert if the total number of occurrences violates a threshold. Useful for alerting on container, orchestration, and service events like restarts and unauthorized access.

Alert Tools

The following tools help with alert creation:

  • Alert Library: Sysdig Monitor provides a set of alerts by default. Use it as it is or as a template to create your own.

  • Sysdig API: Use Sysdig’s Python client to create, list, delete, update and restore alerts. See examples.

  • Import Prometheus Rules: Sysdig Monitor allows you to import Prometheus rules or create new rules on the fly and add them to the existing list of alerts.

Create Alerts for CloudWatch Metrics

CloudWatch metrics queries are displayed as no data in the Alerts Editor. This is because our metric store does not currently store CloudWatch metrics and therefore, the UI displays the missing metrics as no data. However, you can successfully create alerts using these metrics.

Guidelines for Creating Alerts

Steps

Description

Decide What to monitor

Determine what type of problem you want to be alerted on. See Alert Types to choose a type of problem.

Define how it will be monitored

Specify exactly what behavior triggers a violation. For example, Marathon App is down on the Kubernetes Cluster named Production for ten minutes.

Decide Where to monitor

Narrow down your environment to receive fine-tuned results. Use Scope to choose an entity that you want to keep a close watch on. Specify additional segments (entities) to give context to the problem. For example, in addition to specifying a Kubernetes cluster, add a namespace and deployment to refine your scope.

Define when to notify

Define the threshold and time window for assessing the alert condition.

Setting up a Warning Threshold allows you to notify of incidents earlier.

For example, a database using 60% disk may trigger a warning to Slack but the same database using 80% disk may page the on-call team.

Decide how notifications are sent

Alert supports customizable notification channels, including email, mobile push notifications, OpsGenie, Slack, and more. To see supported services, see Set Up Notification Channels.

To create alerts, simply:

  1. Choose an alert type.

  2. Configure alert parameters.

  3. Configure the notification channels you want to use for alert notification.

Sysdig sometimes deprecates outdated metrics. Alerts that use these metrics will not be modified or disabled, but will no longer be updated. See Deprecated Metrics and Labels.

1 - Configure Alerts

Use the Alerts Editor to create or edit alerts.

Different Ways To Create An Alert

Beyond the ability to use the Alert Editor, you can create alerts from different modules.

  • From Metrics Explorer, Select Create Alert.
  • From an existing Dashboard, Select the More Options (three dots) icon for a panel, and select Create Alert.
  • From any Event panel, Select Create Alert from Event.

Create An Alert from the Editor

Configure notification channels before you begin, so the channels are available to assign to the alert. Optionally, you can add a custom subject and body information into individual alert notifications.

Enter Basic Alert Information

Configuration slightly differs for each Alert type. See respective pages to learn more. This section covers general instructions to help you acquainted with and navigate the Alerts user interface.

To configure an alert, open the Alert Editor and set the following parameters:

Alert Types

Select the desired Alert Type:

  • Downtime: Select the entity to monitor.
  • Metric: Select a time-series metric to be alerted on if they violate user-defined thresholds.
  • PromQL: Enter the PromQL query and duration to define an alert condition.
  • Event: Filter the custom event to be alerted on by using the name, tag, description, and a source tag.

Metric and Condition

  • Scope: Select Entire Infrastructure, or one or more labels to apply a limited scope and filter a specific metric.

  • Metric: Select a metric that this alert will monitor. Selecting a metric from the list will automatically add the name to the threshold expression being edited. Define how the data is aggregated (Time aggregation), such as average, maximum, minimum, or sum. It’s the historical data rolled up over a selected period.

  • Group By: Metrics are applied to a group of items (Group Aggregation). If no group aggregation type is selected, the appropriate default for the metric will be applied (either sum or average). Group aggregation functions must be applied outside of time aggregation functions.

  • Segment by: Select one or more labels for segmentation. This allows for the creation of multi-series comparisons and multiple alerts. Multiple alerts will be triggered for each segment you specify. For more information, see Metric Alerts.

Multiple Thresholds

In addition to an alert threshold, a warning threshold can be configured for Metric Alerts and Event Alerts. Warning thresholds and alert thresholds can be associated with different notification channels. In the following example, a user may want to send a warning and alert notification to Slack, but also page the on-call team on Pagerduty if an alert threshold is met.

  • Notify when resolved: In order to prevent a Pagerduty incident from automatically resolving once the alert threshold is no longer met, the user can toggle ‘Notify when Resolved’ off in order to ensure that the on-call team can triage the incident. This setting allows an alert to override the notification channel’s default notification settings. If an override is not configured, the alert will inherit the default settings from the notification channel.

If both warning and alert thresholds are associated with the same notification channel, a metric immediately exceeding the alert threshold will ignore the warn threshold and only trigger the alert threshold.

Notification

  • Notification Channel: Select from the configured notification channels in the list. Supported channels are:

    • Email

    • Slack

    • Amazon SNS Topic

    • Opsgenie

    • Pagerduty

    • VictorOps

    • Webhook

    You can view the list of notification channels configured for each alert on the Alerts page.

  • Configure Notification Template: If applicable, add the following message format details and click Apply Template.

    • Notification Subject & Event Title: Customize using variables, such as {{__alert_name__}} is {{__alert_status__}} for {{agent_id}}
    • Notification Body: Add the text for the notification you are creating. See Customize Notifications.

Settings

  • Alert Severity: Select a priority. High, Medium, Low, and Info.
  • Alert Name: Specify a meaningful name that can uniquely represent the Alert you are creating. For example, the entity that an alert targets, such as Production Cluster Failed Scheduling pods.
  • Description (optional): Briefly expand on the alert name or alert condition to give additional context for the recipient.
  • Group (optional): Specify a meaningful group name for the alert you are creating. Alerts that have no group name will be added to the Default Group.
  • Link to Dashboard: Select a dashboard that you might want to include in the alert notification. You can view the specified dashboard link in the event feed associated with the alert.
  • Link to Runbook: Specify the URL of a runbook. The link to the runbook appears in the event feed.

Captures

Optionally, configure a Sysdig capture. Specify the following:

  • Capture Enabled: Click the slider to enable Capture.
  • Capture Duration: The period of time captured. The default time is 15 seconds. The capture time starts from the time the alert threshold was breached
  • Capture Storage: The storage location for the capture files.
  • Capture Name: The name of the capture file
  • Capture Filter: Restricts the amount of trace information collected.

Sysdig capture files are not available for Event and PromQL Alerts. See Captures for more information.

Optional: Customize Notifications

You can optionally customize individual notifications to provide context for the errors that triggered the alert. All the notification channels support this added contextual information and customization flexibility.

Modify the subject, body, or both of the alert notification with the following:

  • Plaintext: A custom message stating the problem. For example, Stalled Deployment.

  • Hyperlink: For example, URL to a Dashboard.

  • Dynamic Variable: For example, a hostname. Note the conventions:

    • All variables that you insert must be enclosed in double curly braces, such as {{file_mount}}.
    • Variables are case sensitive.
    • The variables should correspond to the segment values you created the alert for. For example, if an alert is segmented by host_hostName and container_name, the corresponding variables will be {{host_hostName}} and {{container_name}} respectively. In addition to these segment variables, __alert_name__  and __alert_status__ are supported. No other segment variables are allowed in the notification subject and body.
    • Notification subjects will not show up on the Event feed.
    • Using a variable that is not a part of the segment will trigger an error.

The body of the notification message contains a Default Alert Template. It is the default alert notification generated by Sysdig Monitor. You may add free text, variables, or hyperlinks before and after the template.

You can send a customized alert notification to the following channels:

  • Email
  • Slack
  • Amazon SNS Topic
  • Opsgenie
  • Pagerduty
  • VictorOps
  • Webhook

The following example shows a notification template created to alert you on Failing Prometheus Jobs. Adding {{kube_cluser_name}}: {{job}} - {{__alert_name__}} is {{__alert_status__}} to the subject line helps you identify the problem area at a glance without having to read the entire notification body.

Supported Aggregation Functions

The table below displays supported time aggregation functions, group aggregation functions, and relational operators:

Time Aggregation FunctionGroup Aggregation FunctionRelational Operator
timeAvg()avg()=
min()min()<
max()max()>
sum()sum()<=
not applicablenot applicable>=
not applicablenot applicable!=

2 - Manage Alerts

Alerts can be managed individually, or as a group, by using the checkboxes on the left side of the Alert UI and the customization bar.

The columns of the table can also be configured, to provide you with the necessary data for your use cases.

Select a group of alerts and perform several batch operations, such as filtering, deleting, enabling, disabling, or exporting to a JSON object. Select individual alerts to perform tasks such as creating a copy for a different team.

View Alert Details

The bell button next to an alert indicates that you have not resolved the corresponding events. The Activity Over Last Two Weeks column visually notifies you with an event chart showing the number of events that were triggered over the last two weeks. The color of the event chart represents what severity level they are.

To view alert details, click the corresponding alert row. The slider with the alert details will appear. Click an individual event to Take Action. You can do one of the following:

  • Acknowledge: Mark that the event has been acknowledged by the intended recipient.

  • Create Silence from Event: If you no longer want to be notified, use this option. You can choose the scope for alert silence. When silenced, alerts will still be triggered but will not send you any notifications.

  • Explore: Use this option to troubleshoot by using the PromQL Query Explorer.

The event feed will be empty and The Activity Over Last Two Weeks column will have no event chart if no events are reported in the past two weeks.

Enable/Disable Alerts

Alerts can be enabled or disabled using the slider or the customization bar. You can perform these operations on a single alert or on multiple alerts as a batch operation.

  1. From the Alerts module, check the boxes beside the relevant alerts.

  2. Click Enable Selected or Disable Selected as necessary.

Use the slider beside the alert to disable or enable individual alerts.

Edit an Existing Alert

To edit an existing alert:

  1. Do one of the following::

    • Click the Edit button beside the alert.

    • Click an alert to open the detail view, then click Edit on the top right corner.

  2. Edit the alert, and click Save to confirm the changes.

Copy an Alert

Alerts can be copied within the current team to allow for similar alerts to be created quickly, or copied to a different team to share alerts.

Copy an Alert to the Current Team

To copy an alert within the current team:

  1. Highlight the alert to be copied.

    The detail view is displayed.

  2. Click Copy.

    The Copy Alert screen is displayed.

  3. Select Current from the drop-down.

  4. Click Copy and Open.

    The particular alert in the edit mode appears.

  5. Make necessary changes and save the alert.

Copy an Alert to a Different Team

  1. Highlight the alert to be copied.

    The detail view is displayed.

  2. Click Copy.

    The Copy Alert screen is displayed.

  3. Select the teams that the alert should be copied to.

  4. Click Send Copy.

Search for an Alert

Search Using Strings

The Alerts table can be searched using partial or full strings. For example, the search below displays only events that contain kubernetes:

Filter Alerts

The alert feed can be filtered in multiple ways, to drill-down into the environment’s history and refine the alert displayed. The feed can be filtered by severity or status. Examples of each are shown below.

The example below shows only high and medium severity:

The example below shows the alerts that are invalid:

Export Alerts as JSON

A JSON file can be exported to a local machine, containing JSON snippets for each selected alert:

  1. Click the checkboxes beside the relevant alerts to be exported.

  2. Click Export JSON.

Delete Alerts

Open the Alert page and use one of the following methods to delete alerts :

  • Hover on a specific alert and click Delete.

  • Hover on one or more alerts, click the checkbox, then click Delete on the bulk-action toolbar.

  • Click an alert to see the detailed view, then click Delete on the top right corner.

3 - Alert Types

Sysdig Monitor can generate notifications based on certain conditions or events you configure. Using the alert feature, you can keep a tab on your infrastructure and find out about problems as they happen, or even before they happen with the alert conditions you define. In Sysdig Monitor, metrics serve as the central configuration artifact for alerts. A metric ties one or more conditions or events to the measures to take when the condition is met, or an event happens. Alerts work across Sysdig modules including Explore, Dashboard, Events, and Overview.

The types of alerts available in Sysdig Monitor:

  • Downtime: Monitor any type of entity, such as a host, a container, or a process, and alert when the entity goes down.

  • Metric: Monitor time-series metrics, and alert if they violate user-defined thresholds.

  • PromQL: Monitor metrics through a PromQL query.

  • Event: Monitor occurrences of specific events, and alert if the total number of occurrences violates a threshold. Useful for alerting on container, orchestration, and service events like restarts and unauthorized access.

3.1 - Downtime Alert

Sysdig Monitor continuously surveils different types of entities in your infrastructure, such as a host, a container, a process, and sends notifications when the monitored entity is not available or responding. Downtime alert focuses mainly on unscheduled downtime of programs, containers, and hosts in your infrastructure.

In this example, the downtime of the containers are monitored. When one of more containers in the given scope go down in the 1-minute time window, notifications will be sent with necessary information on both the containers and the agents.

The lines shown in the preview chart represent the values for the segments selected to monitor. The popup is a color-coded legend to show which segment (or combination of segments if there is more than one) the lines represent. You can also deselect some segment lines to prevent them from showing in the chart. Note that there is a limit of 10 lines that Sysdig Monitor ever shows in the preview chart. For downtime alerts, segments are actually what you select for the Alerts if any of option.

About Up Metrics

To monitor the downtime of the entities, the following up metrics are used: sysdig_host_up, sysdig_container_up, and sysdig_program_up. They indicate whether the agent is able to communicate with the collector. The value 1 represents the entity is up and agent is sending this information to the collector. The value 0 represents the entity is down, implies no communication from agent to the collector about the entity.

When an alert is configured based on Up metric, two data API queries are performed during the alert check. One query will retrieve the current values and the other will retrieve the values from the previous alert check interval. For any entity that was present in previous interval and is not present in current interval, the metric is marked as 0.

An aggregated value of the up metric is displayed on the dashboard on the Alert Editor, and therefore, you might see a value between 0 and 1.

Define a Downtime Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert.

  • Severity: Set a severity level for your alert. The Priority—High, Medium, Low, and Info—are reflected in the Alert list, where you can sort by the severity of the Alert. You can use severity as a criterion when creating alerts, for example: if there are more than 10 high severity events, notify.

  • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as Kubernetes Namespace, Kubernetes Deployment, and so on.

Configure Condition

Scope

Filter the environment on which this alert will apply. For example, an alert will fire when a container associated with the agent 197288 goes down. The alert will be triggered for each container name and agent ID.

Use in or contain operators to match multiple different possible values to apply scope.

The contain and not contain operators help you retrieve values if you know part of the values. For example, us retrieves values that contain strings that start with “us”, such as “us-east-1b”, “us-west-2b”, and so on.

The in and not in operators help you filter multiple values.

You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

Metric

Select an uptime metric associated with the entity whose downtime you want to monitor for. You can select one of the following entities: host, container, program.

Entity

Specify additional segments by using the Alert if any of option.

The specified entities are segmented on and notified with the default notification template as well as on the Preview. In this example, data is segmented on container name and agent ID. When a container is affected, the notification will not only include the affected container details but also the associated agent IDs.

Trigger

Define the threshold and time window for assessing the alert condition. Supported time scales are minute, hour, or day.

If the monitored program is not available or not responding for the last 1 minute, recipients will be notified.

You can set any value for % and a value greater than 1 for the time window. For example, If you choose 50% instead of 100%, a notification will be triggered when the entity is down for 5 minutes in the selected time window of 10 minutes.

Use Cases

  • Your e-commerce website is down during the peak hours of Black Friday, Christmas, or New Year season.

  • Production servers of your data center experience a critical outage

  • MySQL database is unreachable

  • File upload does not work on your marketing website.

3.2 - PromQL Alerts

Sysdig Monitor enables you to use PromQL to define metric expressions that you can alert on

You define the alert conditions using the PromQL-based metric expression. This way, you can combine different metrics and alert on cases like service-level agreement breach, running out of disk space in a day, and so on.

Examples

For PromQL alerts, you can use any metric that is available in PromQL, including Sysdig native metrics. For more details see the various integrations available on promcat.io.

Low Disk Space Alert

Warn if disk space falls below a specified quantity. For example disk space is below 10GB in the 24h hour:

predict_linear(sysdig_fs_free_bytes{fstype!~"tmpfs"}[1h], 24*3600) < 10000000000

Slow Etcd Requests

Notify if etcd requests are slow. This example uses the promcat.io integration.

histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]) > 0.15

High Heap Usage

Warn when the heap usage in ElasticSearch is more than 80%. This example uses the promcat.io integration.

(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80

Guidelines

Sysdig Monitor does not currently support the following:

  • Interact with the Prometheus alert manager or import alert manager configuration.

  • Provide the ability to use, copy, paste, and import predefined alert rules.

  • Convert the alert rules to map to the Sysdig alert editor.

Create a PromQL Alert

Set a meaningful name and description that help recipients easily identify the alert.

Set a Priority

Select a priority for the alert that you are creating. The supported priorities are High, Medium, Low, and Info. You can also view and sort events in the dashboard and explore UI, as well as sort them by severity.

Define a PromQL Alert

PromQL: Enter a valid PromQL expression. The query will be executed every minute. However, the alert will be triggered only if the query returns data for the specified duration.

In this example, you will be alerted when disk space falls below 10GB in the 24h hour.

Duration: Specify the time window for evaluating the alert condition in minutes, hour, or day. The alert will be triggered if the query returns data for the specified duration.

Define Notification

Notification Channels: Select from the configured notification channels in the list.

Re-notification Options: Set the time interval at which multiple alerts should be sent if the problem remains unresolved.

Notification Message & Events: Enter a subject and body. Optionally, you can choose an existing template for the body. Modify the subject, body, or both for the alert notification with a hyperlink, plain text, or dynamic variables.

Import Prometheus Alert Rules

Sysdig Alert allows you to import Prometheus rules or create new rules on the fly and add them to the existing list of alerts. Click the Upload Prometheus Rules option and enter the rules as YAML in the Upload Prometheus Rules YAML editor. Importing your Prometheus alert rules will convert them to PromQL-based Sysdig alerts. Ensure that the alert rules are valid YAML.

You can upload one or more alert rules in a single YAML and create multiple alerts simultaneously.

Once the rules are imported to Sysdig Monitor, the alert list will be automatically sorted by last modified date.

Besides the pre-populated template, each rule specified in the Upload Prometheus Rules YAML editor requires the following fields:

  • alert

  • expr

  • for

See the following examples to understand the format of Prometheus Rules YAML. Ensure that the alert rules are valid YAML to pass validation.

Example: Alert Prometheus Crash Looping

To alert potential Prometheus crash looping. Create a rule to alert when Prometheus restart more than twice in the last 10 minutes.

groups:
- name: crashlooping
  rules:
  - alert: PrometheusTooManyRestarts
    expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[10m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus too many restarts (instance {{ $labels.instance }})
      description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n

Example: Alert HTTP Error Rate

To alert HTTP requests with status 5xx (> 5%) or high latency:

groups:
- name: default
  rules:
  # Paste your rules here
  - alert: NginxHighHttp5xxErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
      description: Too many HTTP requests with status 5xx
  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Nginx latency high (instance {{ $labels.instance }})
      description: Nginx p99 latency is higher than 3 seconds

Learn More

3.3 - Metric Alerts

Sysdig Monitor offers an easy way to define metrics-based alerts.

You can create metric alerts for scenarios such as:

  • Number of processes running on a host
  • Root volume disk usage in a container
  • Cpu / memory usage of a host or workload

Defining a Metric Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert

  • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as namespace, deployment, and so on.

Specify Metrics

Select a metric that this alert will monitor. You can also define how data is aggregated, such as average, maximum, minimum, or sum.

Configure Scope

Team scope is automatically applied to alerts. You can further filter the environment by overriding the scope.

For example, the below alert will fire when any host’s cpu usage will go above the defined threshold within the us-east-1a. cloud availability zone.

You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

Alerting on No Data

When a metric stop reporting, Sysdig Monitor show no data where you would normally expect data points. To detect such incidents that fail silently, you can configure alerts to notify you when a metric ceases to report data.

You can use the No Data option in the Settings section to determine how a metric alert should behave upon discovering the metric reports no data.

By default, alerts configured for metrics that stop reporting data will not be evaluated. You can change this behavior by enabling Notify on missing data, in which case, an alert will be sent when the metric stops reporting data.

This feature is currently available only for Metric Alerts.

Configure Trigger

Define the threshold and time window for assessing the alert condition. Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.

Metric alerts can be triggered to notify you of different aggregations:

Aggregation

Description

on average

The average of the retrieved metric values across the time period. Actual number of samples retrieved is used to calculate the value.

For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as on average, the alert will be calculated by summing the 3 recorded values and dividing by 3.

as a rate

The average value of the metric across the time period evaluated. The expected number of values is used to calculate the rate to trigger the alert.

For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as as a rate, the alert will be calculated by summing the 3 recorded values and dividing by 10 ( 10 x 1 minute samples).

in sum

The combined sum of the metric across the time period evaluated.

at least once

The trigger value is met for at least one sample in the evaluated period.

for the entire time

The trigger value is met for a every sample in the evaluated period.

as a rate of change

The trigger value is met the change in value over the evaluated period.

For example, the alert below will fire for each unique segment denominated by host_hostname and kube_cluster_name using more than 75% of the filesystem on average, over the last 5 minutes.

Example: Alert When Data Transfer Over the Threshold

The below example shows an alert that triggers when the average bytes of data transferred by a container is over 20 KiB/s for a period of 1 minute.

In the alert Settings, you can configure a link to a Runbook and to a Dashboard to speed up troubleshooting when the alert fires.

When viewing the triggered alert you will be able to quickly access your defined Runbook and Dashboard.

3.4 - Event Alerts

Monitor occurrences of specific events, and alert if the total number of occurrences violates a threshold. Useful for alerting on container, orchestration, and service events like restarts and deployments.

Alerts on events support one or more segmentation labels. An alert is generated for each segment.


Defining an Event Alert

Guidelines

  • Count Events That Match: Specify a meaningful filter text to count the number of related events.

  • Severity: Set a severity level for your alert. The Priority: High, Medium, Low, and Info are reflected in the Alert list, where you can sort by the severity by using the top navigation pane. You can use severity as a criterion when creating events and alerts, for example: if there are more than 10 high severity events, notify.

  • Event Source: Filter by one or more event sources that should be considered by the alert. Predefined options are included for infrastructure event sources (kubernetes, docker, and containerd), but you can freely specify other values to match custom event sources.

  • Alert if: Specify the trigger condition in terms of the number of events for a given duration.

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert.

Configure Scope

Filter the environment on which this alert will apply. Use advanced operators to include, exclude, or pattern-match groups, tags, and entities. You can also create alerts directly from Explore and Dashboards for automatically populating this scope.


In this example, failing to schedule a pod in a default namespace triggers an alert.

Configure Trigger

Define the threshold and time window for assessing the alert condition. Single alert fires an alert for your entire scope, while multiple alert fires if any or every segment breach the threshold at once.

If the number of events triggered in the monitored entity is greater than 5 for the last 10 minutes, recipients will be notified through the selected channel.

3.5 - Advanced Metric Alerts

Advanced metric alerts (Multi-condition alerts) are advanced alert threshold created on complex conditions. They are created by defining alert thresholds as custom boolean expressions that can involve multiple conditions.

The new Alert Editor does not support creating Advanced Alerts. However, it gives you an option to open the existing advanced alerts, and save them as a PromQL Alert.

To save an advanced metric alert as a PromQL alert:

  1. Open the advanced alert and click Edit.


The Advanced Metric Alert page will display an option to copy the alert to a PromQL alert.

  1. Adjust the time window as necessary, select one or more notification channels, and configure alert settings. Alternatively, you can do the configuration after copying to Prometheus alert page.

  2. Click Copy to Prometheus alert.


    The PromQL alert editor page will be displayed.

  3. Click Save.

4 - Alerts Library

To help you get started quickly, Sysdig provides a set of curated alert templates called Alerts Library.

Powered by Monitoring Integrations, Sysdig automatically detects the applications and services running in your environment and recommends alerts that you can enable.

Two types of alert templates are included in Alerts Library:

  • Recommended: Alert suggestions based on the services that are detected running in your infrastructure.

  • All templates: You can browse templates for all the services. For some templates, you might need to configure Monitor Integrations.

Access Alerts Library

  1. Log in to Sysdig Monitor.

  2. Click Alerts from the left navigation pane.

  3. On the Alerts tab, click  Library.

Import an Alert

  1. Locate the service that you want to configure an alert for.

    To do so, either use the text search or identify from a list of services.

  2. For example, click Redis.

    Eight template suggestions are displayed for 14 Redis services running on the environment.

  3. From a list of template suggestions, choose the desired template.

    The Redis page shows the alerts that are already in use and that you can enable.

  4. Enable one or more alert templates. To do so, you can do one of the following:

    • Click Enable Alert.

    • Bulk enable templates. Select the check box corresponding to the alert templates and click Enable Alert on the top-right corner.

    • Click on the alert template to display the slider. Click the Enable Alert on the slider.

  5. On the Configure Redis Alert page, specify the Scope and select the Notification channels.

  6. Click Enable Alert.

    You will see a message stating that the Redis Alert has been successfully created.

Use Alerts Library

In addition to importing an alert, you can also do the following with the Alerts Library:

  • Identify Alert templates associated with the services running in your infrastructure.

  • Bulk import Alert templates. See Import an Alert.

  • View alerts that are already configured.

  • Filter Alert templates. Enter the search string to display the matching results.

  • Discover the workloads where a service is running. To do so, click on the Alert template to display the slider. On the slider, click Workloads.

  • View the alerts in use. To do so, click on an Alert template to display the slider. On the slider, click Alerts in use.

  • Configure an alert.

    Additional alert configuration, such as changing the alert name, description, and severity can be done after the import.

5 - Silence Alert Notifications

Sysdig Monitor allows you to silence alerts for a given scope for a predefined amount of time. When silenced, alerts will still be triggered but will not send any notifications. You can schedule silencing in advance. This helps administrators to temporarily mute notifications during planned downtime or maintenance and send downtime notifications to selected channels.

With an active silence, the only notifications you will receive are those indicating the start time and the end time of the silence. All other notifications for events from that scope will be silenced. When a silence is active, creating an alert triggers the alert but no notification will be sent. Additionally, a triggering event will be generated stating that the alert is silenced.

See Working with Alert APIs for programmatically silencing alert notifications.

Configure a Silence

When you create a new silence, it is by default enabled and scheduled. When the start time arrives for a scheduled silence, it becomes active and the list shows the time remaining. When the end time arrives, the silence becomes completed and cannot be enabled again.

To configure a silence:

  1. Click Alerts on the left navigation on the Monitor UI.

  2. Click the Silence tab.

    The page shows the list of all the existing silences.

  3. Click Set a Silence.

    The Silence for Scope window is displayed.

  1. Specify the following:

    • Scope: Specify the entity you want to apply the scope as. For example, a particular workload or namespace, from environments that may include thousands of entities.

    • Begins: Specify one of the following: Today, Tomorrow, Pick Another Day. Select the time from the drop-down.

    • Duration: Specify how long notifications should be suppressed.

    • Name: Specify a name to identify the silence.

    • Notify: Select a channel you want to notify about the silence.

  2. Click Save.

Silence Alert Notifications from Event Feed

You can also create and edit silences and view silenced alert events on the Events feeds across the Monitor UI. When you create a silence, the alert will still be triggered and posted on the Events feed and in the graph overlays but will indicate that the alert has been silenced.

If you have an alert with no notification channel configured, events generated from that alert won’t be marked as silenced. They won’t be visually represented in the events feed as well with the crossed bell icon and the option to silence events.

To do so,

  1. On the event feed, select the alert event that you want to silence.

  2. On the event details slider, click Take Action.

  3. Click Create Silence from Event.

    The Silence for Scope window is displayed.

  4. Continue configuring the silence as described in 4.

Manage Silences

Silences can be managed individually, or as a group, by using the checkboxes on the left side of the Silence UI and the customization bar. Select a group of silences and perform batch delete operations. Select individual silences to perform tasks such as enabling, disabling, duplicating, and editing.

Change States

You can enable or disable a silence by sliding the state bar next to the silences. There are two kinds of silences that will show as enabled: active (a running silence) and a scheduled silence (which will start in the future). Its starting date is back in time but the end date is yet to happen. A clock icon visually represents an active silence.

Completed silences cannot be re-enabled once a silenced period is finished. However, you can duplicate it with all the data but you need to set a new silencing period.

A silence can be disabled only when:

  • The silence is not yet started

  • The silence is in progress.

Filter Silences

Use the search bar to filter silences. You can either perform a simple auto-complete text search or use the categories. The feed can be filtered by the following categories: Active, Scheduled, Completed.

For example, the following shows the completed silences that start with “cl”.

Duplicate a Silence

Do one of the following to duplicate a silence:

  • Click the Duplicate hover-the-row button on the menu.

  • Click the row for the Silence for Scope window to open. On the window, make necessary changes if required and click Duplicate.

Edit Silence

You can edit scheduled silences. For the active ones, you can only extend the time. You cannot edit completed silences.

To edit a silence, do one of the following:

  • Click the row for the Silence for Scope window to open. Make necessary changes and click Update.

  • Click the Edit hover-the-row button on the menu. The Silence for Scope window will be displayed.

    Make necessary changes and click Update.

Extend the Time Duration

For the active silences, you can extend the duration to one of the following:

  • 1 Hour

  • 2 Hours,

  • 6 Hours,

  • 12 Hours

  • 24 Hours

To do so, click the extend the time duration button on the menu and choose the duration. You can extend the time of an active silence even from the Silence for Scope window.

Extending the time duration will notify the configured notification channels that the downtime is extended. You can also extend the time from a Slack notification of a silence by clicking the given link. It opens the Silence for Scope window of the running silence where you can make necessary adjustments.

You cannot extend the duration of completed silences.

6 - Legacy Alerts Editor

If you do not have the new Sysdig metric store enabled, you will not be able to use the latest Alert Editor features. You will continue to use the legacy Alerts Editor to create and edit alert notifications.

Alert Types

The types of alerts available in Sysdig Monitor:

  • Downtime: Monitor any type of entity, such as a host, a container, or a process, and alert when the entity goes down.

  • Metric: Monitor time-series metrics, and alert if they violate user-defined thresholds.

  • PromQL: Monitor metrics through a PromQL query.

  • Event: Monitor occurrences of specific events, and alert if the total number of occurrences violates a threshold. Useful for alerting on container, orchestration, and service events like restarts and unauthorized access.

  • Anomaly Detection: Monitor hosts based on their historical behaviors, and alert when they deviate from the expected pattern.

  • Group Outlier: Monitor a group of hosts and be notified when one acts differently from the rest. Group Outlier Alert is supported only on hosts.

Alert Tools

The following tools help with alert creation:

  • Alert Library: Sysdig Monitor provides a set of alerts by default. Use it as it is or as a template to create your own.

  • Sysdig API: Use Sysdig’s Python client to create, list, delete, update and restore alerts. See examples.

Guidelines for Creating Alerts

Steps

Description

Decide What to monitor

Determine what type of problem you want to be alerted on. See Alert Types to choose a type of problem.

Define how it will be monitored

Specify exactly what behavior triggers a violation. For example, Marathon App is down on the Kubernetes Cluster named Production for ten minutes.

Decide Where to monitor

Narrow down your environment to receive fine-tuned results. Use Scope to choose an entity that you want to keep a close watch on. Specify additional segments (entities) to give context to the problem. For example, in addition to specifying a Kubernetes cluster, add a namespace and deployment to refine your scope.

Define when to notify

Define the threshold and time window for assessing the alert condition.

Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.

Multiple Alerts include all the segments you specified to uniquely identify the location and thus provides a full qualification of where the problem occurred. The higher the number of segments the easier to uniquely identify the affected entities.

A good analogy for multiple alerts is alerting on cities. For example, creating multiple alerts on San Francisco would trigger an alert which will include information such as the country that it is part of is the USA and the continent is North America.

Trigger gives you control over how notifications are created. For example, you may want to receive a notification for every violation, or want only a single notification for a series of consecutive violations.

Decide how notifications are sent

Alert supports customizable notification channels, including email, mobile push notifications, OpsGenie, Slack, and more. To see supported services, see Set Up Notification Channels.

To create alerts, simply:

  1. Choose an alert type.

  2. Configure alert parameters.

  3. Configure the notification channels you want to use for alert notification.

Sysdig sometimes deprecates outdated metrics. Alerts that use these metrics will not be modified or disabled, but will no longer be updated. See Deprecated Metrics and Labels.

Configure Alerts

Use the Alert wizard to create or edit alerts.

Open the Alert Wizard

There are multiple ways to access the Alert wizard:

From Explore

Do one of the following:

  • Select New Alert next to an entity.

  • Click More Options (three dots), and select Create a new alert.

From Dashboards

Click the More Options (three dots) icon for a panel, and select Create Alert.

From Alerts

Do one of the following:

  • Click Add Alerts.

  • Select an existing alert and click Edit.

From Overview

From the Events panel on the Overview screen, select a custom or an Infrastructure type event. From the event description screen, click Create Alert from Event.

Create an Alert

Configure notification channels before you begin, so the channels are available to assign to the alert. Optionally, you can add a custom subject and body information into individual alert notifications.

Enter Basic Alert Information

Configuration slightly defers for each Alert type. See respective pages to learn more. This section covers general instructions to help you acquainted with and navigate the Alerts user interface.

To configure an alert, open the Alert wizard and set the following parameters:

  • Create the alert:

    • Type: Select the desired Alert Types.

      Each type has different parameters, but they follow the same pattern:

      • Name: Specify a meaningful name that can uniquely represent the Alert that you are creating. For example, the entity that an alert targets, such as Production Cluster Failed Scheduling pods.

      • Group (optional): Specify a meaningful group name for the alert you are creating. Group name helps you narrow down the problem area and focus on the infrastructure view that needs your attention. For example, you can enter Redis for alerts related to Redis services. When the alert triggers you will know which service in your workload requires inspection. Alerts that have no group name will be added to the Default Group. Group name is editable. Edit the alert to do so.

        An alert can belong to only one group. An alert created from an alert template will have the group already configured by the Monitor Integrations. You can see the existing alert groups on the Alerts details page.

        See Groupings for more information on how Sysdig handles infrastructure views.

      • Description (optional): Briefly expand on the alert name or alert condition to give additional context for the recipient.

      • Priority: Select a priority. High, Medium, Low, and Info. You can later sort by the severity by using the top navigation pane.

      • Specify the parameters in the Define, Notify, and Act sections.

  • Define:

    Based on the alert type, define the parameters.

    • Downtime: Select the entity to monitor. For more information, see Downtime Alert.

    • Metric: Select a metric that this alert will monitor. You also define how the data is aggregated, such as average, maximum, minimum, or sum. Metrics are applied to a group of items (group aggregation). For more information, see Metric Alerts.

    • PromQL: Enter the PromQL query and duration. For more information, see PromQL Alerts.

    • Event: Filter the custom event to be alerted on by using the name, tag, description and one or more event sources. For more information, see Event Alerts

    • Anomaly Detection: Specify the metrics to be monitored for anomalies. For more information, see Anomaly Detection Alerts.

    • Group Outlier: Specify the metrics to be monitored for outliers. For more information, see Group Outlier Alerts.

To alert on multiple metrics using boolean logic, click Create multi-condition alerts. See Multi-Condition Alerts.

  • Scope: Everywhere, or a more limited scope to filter a specific component of the infrastructure monitored, such as a Kubernetes deployment, a Sysdig Agent, or a specific service.

  • Trigger: Boundaries for assessing the alert condition, and whether to send a single alert or multiple alerts. Supported time scales are minute, hour, or day.

    • Single alert: Single Alert fires an alert for your entire scope.

    • Multiple alerts: Multiple Alert fires if any or every segment breaches the threshold at once.

      Multiple alerts are triggered for each segment you specify. The specified segments will be represented in alerts. The higher the number of segments the easier to uniquely identify the affected entities.

For detailed description, see respective sections on Alert Types.

  • (2) Notify

    • Notification Channel: Select from the configured notification channels in the list. Supported channels are:

      • Email

      • Slack

      • Amazon SNS Topic

      • Opsgenie

      • Pagerduty

      • VictorOps

      • Webhook

      You can view the list of notification channels configured for each alert on the Alerts page.

    • Notification Options: Set the time interval at which multiple alerts should be sent.

    • Format Message: If applicable, add message format details. See Customize Notifications.

  • (3) Act

    • (Optional): Configure a Sysdig capture. See also Captures.

      Sysdig capture files are not available for Event Alerts.

  • Click Create.

Optional: Customize Notifications

You can optionally customize individual notifications to provide context for the errors that triggered the alert. All the notification channels support this added contextual information and customization flexibility.

Modify the subject, body, or both of the alert notification with the following:

  • Plaintext: A custom message stating the problem. For example, Stalled Deployment.

  • Hyperlink: For example, URL to a Dashboard.

  • Dynamic Variable: For example, a hostname. Note the conventions:

    • All variables that you insert must be enclosed in double curly braces, such as {{file_mount}}.

    • Variables are case sensitive.

    • The variables should correspond to the segment values you created the alert for. For example, if an alert is segmented byhost.hostName andcontainer.name, the corresponding variables will be{{host.hostName}}and {{container.name}} respectively. In addition to these segment variables, __alert_name__  and __alert_status__ are supported. No other segment variables are allowed in the notification subject and body.

    • Notification subjects will not show up on the Event feed.

    • Using a variable that is not a part of the segment will trigger an error.

    • The segment variables used in an alert are turned to the current system values upon sending the alert.

The body of the notification message contains a Default Alert Template. It is the default alert notification generated by Sysdig Monitor. You may add free text, variables, or hyperlinks before and after the template.

You can send a customized alert notification to the following channels:

  • Email

  • Slack

  • Amazon SNS Topic

  • Opsgenie

  • Pagerduty

  • VictorOps

  • Webhook

Multi-Condition Alerts

Multi-condition alerts are advanced alert threshold created on complex conditions. To do so, you define alert thresholds as custom boolean expressions that can involve multiple conditions. Click Create multi-condition alerts to enable adding conditions as boolean expressions.

These advanced alerts require specific syntax, as described in the examples below.

Format and Operations

Each condition has five parts:

  • Metric Name : Use the exact metric names. To avoid typos, click the HELP link to access the drop-down list of available metrics. Selecting a metric from the list will automatically add the name to the threshold expression being edited.

  • Group Aggregation (optional): If no group aggregation type is selected, the appropriate default for the metric will be applied (either sum or average). Group aggregation functions must be applied outside of time aggregation functions.

  • Time aggregation : It’s the historical data rolled up over a selected period of time.

  • Operator: Both logical and relational operators are supported.

  • Value: A static numerical value against which a condition is evaluated.

The table below displays supported time aggregation functions, group aggregation functions, and relational operators:

Time Aggregation FunctionGroup Aggregation FunctionRelational Operator
timeAvg()avg()=
min()min()<
max()max()>
sum()sum()<=
>=
!=

The format is:

condition1 AND condition2
condition1 OR condition2
NOT condition1

The order of operations can also be altered via parenthesis:

NOT (condition1 AND (condition2 OR condition3))

Conditions take the following form:

groupAggregation(timeAggregation(metric.name)) operator value

Example Expressions

Several examples of advanced alerts are given below:

timeAvg(cpu.used.percent) > 50 AND timeAvg(memory.used.percent) > 75
timeAvg(cpu.used.percent) > 50 OR timeAvg(memory.used.percent) > 75
timeAvg(container.count) != 10
min(min(cpu.used.percent)) <= 30 OR max(max(cpu.used.percent)) >= 60
sum(file.bytes.total) > 0 OR sum(net.bytes.total) > 0
timeAvg(cpu.used.percent) > 50 AND (timeAvg(mysql.net.connections) > 20 OR timeAvg(memory.used.percent) > 75)

6.1 - Legacy Downtime Alert

Sysdig Monitor continuously surveils any type of entity in your infrastructure, such as a host, a container, a process, or a service, and sends notifications when the monitored entity is not available or responding. Downtime alert focuses mainly on unscheduled downtime of your infrastructure.

In this example, a Kubernetes cluster is monitored and the alert is segmented on both cluster and namespace. When a Kubernetes cluster in the selected availability zone goes down, notifications will be sent with necessary information on both cluster and affected namespace.

The lines shown in the preview chart represent the values for the segments selected to monitor. The popup is a color-coded legend to show which segment (or combination of segments if there is more than one) the lines represent. You can also deselect some segment lines to prevent them from showing in the chart. Note that there is a limit of 10 lines that Sysdig Monitor ever shows in the preview chart. For downtime alerts, segments are actually what you select for the “Select entity to monitor” option.

Define a Downtime Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert.

  • Severity: Set a severity level for your alert. The Priority—High, Medium, Low, and Info—are reflected in the Alert list, where you can sort by the severity of the Alert. You can use severity as a criterion when creating alerts, for example: if there are more than 10 high severity events, notify.

  • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes Cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as Kubernetes Namespace, Kubernetes Deployment, and so on.

Specify Entity

  1. Select an entity whose downtime you want to monitor for.

    In this example, you are monitoring the unscheduled downtime of a host.

  2. Specify additional segments:

    The specified entities are segmented on and notified with the default notification template as well as on the Preview. In this example, data is segmented on Kubernetes cluster name and namespace name. When a cluster is affected, the notification will not only include the affected cluster details but also the associated namespaces.

Configure Scope

Filter the environment on which this alert will apply. An alert will fire when a host goes down in the availability zone, us-east-1b.

Use in or contain operators to match multiple different possible values to apply scope.

The contain and not contain operators help you retrieve values if you know part of the values. For example, us retrieves values that contain strings that start with “us”, such as “us-east-1b”, “us-west-2b”, and so on.

The in and not in operators help you filter multiple values.

You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

Configure Trigger

Define the threshold and time window for assessing the alert condition. Supported time scales are minute, hour, or day.

If the monitored host or Kubernetes cluster is not available or not responding for the last 10 minutes, recipients will be notified.

You can set any value for % and a value greater than 1 for the time window. For example, If you choose 50% instead of 100%, a notification will be triggered when the entity is down for 5 minutes in the selected time window of 10 minutes.

Use Cases

  • Your e-commerce website is down during the peak hours of Black Friday, Christmas, or New Year season.

  • Production servers of your data center experience a critical outage

  • MySQL database is unreachable

  • File upload does not work on your marketing website.

6.2 - Legacy PromQL Alerts

Sysdig Monitor enables you to use PromQL to define metric expressions that you can alert on. You define the alert conditions using the PromQL-based metric expression. This way, you can combine different metrics and warn on cases like service-level agreement breach, running out of disk space in a day, and so on.

Examples

For PromQL alerts, you can use any metric that is available in PromQL, including Sysdig native metrics. For more details see the various integrations available on promcat.io.

Low Disk Space Alert

Warn if disk space falls below a specified quantity. For example disk space is below 10GB in the 24h hour:

predict_linear(sysdig_fs_free_bytes{fstype!~"tmpfs"}[1h], 24*3600) < 10000000000

Slow Etcd Requests

Notify if etcd requests are slow. This example uses the promcat.io integration.

histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]) > 0.15

High Heap Usage

Warn when the heap usage in ElasticSearch is more than 80%. This example uses the promcat.io integration.

(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80

Guidelines

Sysdig Monitor does not currently support the following:

  • Interact with the Prometheus alert manager or import alert manager configuration.

  • Provide the ability to use, copy, paste, and import predefined alert rules.

  • Convert the alert rules to map to the Sysdig alert editor.

Create a PromQL Alert

Set a meaningful name and description that help recipients easily identify the alert.

Set a Priority

Select a priority for the alert that you are creating. The supported priorities are High, Medium, Low, and Info. You can also view and sort events in the dashboard and explore UI, as well as sort them by severity.

Define a PromQL Alert

PromQL: Enter a valid PromQL expression. The query will be executed every minute. However, the alert will be triggered only if the query returns data for the specified duration.

In this example, you will be alerted when the rate of HTTP requests has doubled over the last 5 minutes.

Duration: Specify the time window for evaluating the alert condition in minutes, hour, or day. The alert will be triggered if the query returns data for the specified duration.

Define Notification

Notification Channels: Select from the configured notification channels in the list.

Re-notification Options: Set the time interval at which multiple alerts should be sent if the problem remains unresolved.

Notification Message & Events: Enter a subject and body. Optionally, you can choose an existing template for the body. Modify the subject, body, or both for the alert notification with a hyperlink, plain text, or dynamic variables.

Import Prometheus Alert Rules

Sysdig Alert allows you to import Prometheus rules or create new rules on the fly and add them to the existing list of alerts. Click the Upload Prometheus Rules option and enter the rules as YAML in the Upload Prometheus Rules YAML editor. Importing your Prometheus alert rules will convert them to PromQL-based Sysdig alerts. Ensure that the alert rules are valid YAML.

You can upload one or more alert rules in a single YAML and create multiple alerts simultaneously.

Once the rules are imported to Sysdig Monitor, the alert list will be automatically sorted by last modified date.

Besides the pre-populated template, each rule specified in the Upload Prometheus Rules YAML editor requires the following fields:

  • alert

  • expr

  • for

See the following examples to understand the format of Prometheus Rules YAML. Ensure that the alert rules are valid YAML to pass validation.

Example: Alert Prometheus Crash Looping

To alert potential Prometheus crash looping. Create a rule to alert when Prometheus restart more than twice in the last 10 minutes.

groups:
- name: crashlooping
  rules:
  - alert: PrometheusTooManyRestarts
    expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[10m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus too many restarts (instance {{ $labels.instance }})
      description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n

Example: Alert HTTP Error Rate

To alert HTTP requests with status 5xx (> 5%) or high latency:

groups:
- name: default
  rules:
  # Paste your rules here
  - alert: NginxHighHttp5xxErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
      description: Too many HTTP requests with status 5xx
  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Nginx latency high (instance {{ $labels.instance }})
      description: Nginx p99 latency is higher than 3 seconds

Learn More

6.3 - Legacy Metric Alerts

Sysdig Monitor keeps a watch on time-series metrics, and alert if they violate user-defined thresholds.

The lines shown in the preview chart represent the values for the segments selected to monitor. The popup is a color-coded legend to show which segment (or combination of segments if there is more than one) the lines represent. You can also deselect some segment lines to prevent them from showing in the chart. Note that there is a limit of 10 lines that Sysdig Monitor ever shows in the preview chart.

Defining a Metric Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert

  • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes Cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as Kubernetes Namespace, Kubernetes Deployment, and so on.

Specify Metrics

Select a metric that this alert will monitor. You can also define how data is aggregated, such as avg, max, min or sum. To alert on multiple metrics using boolean logic, switch to multi-condition alert.

Configure Scope

Filter the environment on which this alert will apply.

Filter the environment on which this alert will apply. An alert will fire when a host goes down in the availability zone, us-east-1b.

Use advanced operators to include, exclude, or pattern-match groups, tags, and entities. See Multi-Condition Alerts.

You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

Configure Trigger

Define the threshold and time window for assessing the alert condition. Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.

Metric alerts can be triggered to notify you of different aggregations:

Aggregation

Description

on average

The average of the retrieved metric values across the time period. Actual number of samples retrieved is used to calculate the value.

For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as on average, the alert will be calculated by summing the 3 recorded values and dividing by 3.

as a rate

The average value of the metric across the time period evaluated. The expected number of values is used to calculate the rate to trigger the alert.

For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as as a rate, the alert will be calculated by summing the 3 recorded values and dividing by 10 ( 10 x 1 minute samples).

in sum

The combined sum of the metric across the time period evaluated.

at least once

The trigger value is met for at least one sample in the evaluated period.

for the entire time

The trigger value is met for a every sample in the evaluated period.

as a rate of change

The trigger value is met the change in value over the evaluated period.

For example, if the file system used percentage goes above 75 for the last 5 minutes on an average, multiple alerts will be triggered. The mac address of the host and mount directory of the file system will be represented in the alert notification.

Usecases

  • Number of processes running on a host is not normal

  • Root volume disk usage in a container is high

6.4 - Legacy Event Alerts

Monitor occurrences of specific events, and alert if the total number of occurrences violates a threshold. Useful for alerting on container, orchestration, and service events like restarts and deployments.

Alerts on events support only one segmentation label. An alert is generated for each segment.

Defining a Metric Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert.

  • Severity: Set a severity level for your alert. The Priority: High, Medium, Low, and Info are reflected in the Alert list, where you can sort by the severity by using the top navigation pane. You can use severity as a criterion when creating events and alerts, for example: if there are more than 10 high severity events, notify.

  • Event Source: Filter by one or more event sources that should be considered by the alert. Predefined options are included for infrastructure event sources (kubernetes, docker, and containerd), but you can freely specify other values to match custom event sources.

  • Trigger: Specify the trigger condition in terms of the number of events for a given duration.

    Event alert support only one segmentation label. If you choose Multiple Alerts, Sysdig generates only one alert for a selected segment.

Specify Event

  1. Specify the name, tag, or description of an event.

  2. Specify one or more Event Sources.

Configure Scope

Filter the environment on which this alert will apply. Use advanced operators to include, exclude, or pattern-match groups, tags, and entities. You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

In this example, failing a liveness probe in the agent-process-whitelist-cluster cluster triggers an alert.

Configure Trigger

Define the threshold and time window for assessing the alert condition. Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.

If the number of events triggered in the monitored entity is greater than 5 for the last 10 minutes, recipients will be notified through the selected channel.

6.5 - Legacy Anomaly Detection Alerts

Anomaly refers to an outlier in a given data set polled from an environment. It is a deviation from a conformed pattern. Anomaly detection is about identifying these anomalous observations. A set of data points collectively, a single instance of data or context-specific abnormalities help detect anomalies. For example, unauthorized copying of a directory from a container, high CPU or memory consumption, and so on.

Define an Anomaly Detection Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert

  • Severity: Set a severity level for your alert. The Priority: High, Medium, Low, and Info are reflected in the Alert list, where you can sort by the severity by using the top navigation pane. You can use severity as a criterion when creating events and alerts, for example: if there are more than 10 high severity events, notify.

  • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes Cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as Kubernetes Namespace, Kubernetes Deployment, and so on.

Specify Entity

Select one or more metrics whose behavior you want to monitor.

Configure Scope

Filter the environment on which this alert will apply. An alert will fire when the value returned by one of the selected metrics does not follow the pattern in the availability zone, us-east-1b.

You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

Configure Trigger

Trigger gives you control over how notifications are created and help prevent flooding your notification channel with notifications. For example, you may want to receive a notification for every violation, or only want a single notification for a series of consecutive violations.

Define the threshold and time window for assessing the alert condition. Supported time scales are minute, hour, or day.

If the monitored host or Kubernetes cluster is not available or not responding for the last 5 minutes, recipients will be notified.

You can set any value for % and a value greater than 1 for the time window. For example, If you choose 50% instead of 100%, a notification will be triggered when the entity is down for 2.5 minutes in the selected time window of 5 minutes.

6.6 - Legacy Group Outlier Alerts

Sysdig Monitor observes a group of hosts and notifies you when one acts differently from the rest.

Define a Group Outlier Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert

  • Severity: Set a severity level for your alert. The Priority: High, Medium, Low and Info are reflected in the Alert list, where you can sort by the severity by using the top navigation pane. You can use severity as a criterion when creating events and alerts, for example: if there are more than 10 high severity events, notify.

Specify Entity

Select one or more metrics whose behavior you want to monitor.

Configure Scope

Filter the environment on which this alert will apply. An alert will fire when the value returned by one of the selected metrics does not follow the pattern in the availability zone, us-east-1b.

You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

Configure Trigger

Trigger gives you control over how notifications are created and help prevent flooding your notification channel with notifications. For example, you may want to receive a notification for every violation, or only want a single notification for a series of consecutive violations.

Define the threshold and time window for assessing the alert condition. Supported time scales are minute, hour, or day.

If the monitored host or Kubernetes cluster is not available or not responding for the last 5 minutes, recipients will be notified.

You can set any value for % and a value greater than 1 for the time window. For example, If you choose 50% instead of 100%, a notification will be triggered when the entity is down for 2.5 minutes in the selected time window of 5 minutes.

Usecases

  • Load balancer servers have uneven workloads

  • Changes in applications or instances deployed in different availability zones.

  • Network hogging hosts in a cluster