This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

  • 1:
    • 2:
      • 3:
        • 4:
          • 5:
            • 6:
              • 7:
                • 8:
                  • 9:

                    Alerts

                    Alert is the responsive component of Sysdig Monitor. Alerts notify you when an event/issue occurs that requires attention. Events and issues are identified based on changes in the metric values collected by Sysdig Monitor. The Alerts module displays out-of-the-box alerts and a wizard for creating and editing alerts as needed.

                    About Sysdig Alert

                    Sysdig Monitor can generate notifications based on certain conditions or events you configure. Using the alert feature, you can keep a tab on your infrastructure and find out about problems as they happen, or even before they happen with the alert conditions you define. In Sysdig Monitor, metrics serve as the central configuration artifact for alerts. A metric ties one or more conditions or events to the measures to take when the condition is met, or an event happens. Alerts work across Sysdig modules including Explore, Dashboard, Events, and Overview.

                    Alert Types

                    The types of alerts available in Sysdig Monitor:

                    • Downtime: Monitor any type of entity, such as a host, a container, or a process, and alert when the entity goes down.

                    • Metric: Monitor time-series metrics, and alert if they violate user-defined thresholds.

                    • PromQL: Monitor metrics through a PromQL query.

                    • Event: Monitor occurrences of specific events, and alert if the total number of occurrences violates a threshold. Useful for alerting on container, orchestration, and service events like restarts and unauthorized access.

                    • Anomaly Detection: Monitor hosts based on their historical behaviors, and alert when they deviate from the expected pattern.

                    • Group Outlier: Monitor a group of hosts and be notified when one acts differently from the rest. Group Outlier Alert is supported only on hosts.

                    • Alert Library: Sysdig Monitor provides a set of alerts by default. Use it as it is or as a template to create your own.

                    • Sysdig API: Use Sysdig’s Python client to create, list, delete, update and restore alerts. See examples.

                    Guidelines for Creating Alerts

                    Steps

                    Description

                    Decide What to monitor

                    Determine what type of problem you want to be alerted on. See Alert Types to choose a type of problem.

                    Define how it will be monitored

                    Specify exactly what behavior triggers a violation. For example, Marathon App is down on the Kubernetes Cluster named Production for ten minutes.

                    Decide Where to monitor

                    Narrow down your environment to receive fine-tuned results. Use Scope to choose an entity that you want to keep a close watch on. Specify additional segments (entities) to give context to the problem. For example, in addition to specifying a Kubernetes cluster, add a namespace and deployment to refine your scope.

                    Define when to notify

                    Define the threshold and time window for assessing the alert condition.

                    Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.

                    Multiple Alerts include all the segments you specified to uniquely identify the location and thus provides a full qualification of where the problem occurred. The higher the number of segments the easier to uniquely identify the affected entities.

                    A good analogy for multiple alerts is alerting on cities. For example, creating multiple alerts on San Francisco would trigger an alert which will include information such as the country that it is part of is the USA and the continent is North America.

                    Trigger gives you control over how notifications are created. For example, you may want to receive a notification for every violation, or want only a single notification for a series of consecutive violations.

                    Decide how notifications are sent

                    Alert supports customizable notification channels, including email, mobile push notifications, OpsGenie, Slack, and more. To see supported services, see Set Up Notification Channels.

                    To create alerts, simply:

                    1. Choose an alert type.

                    2. Configure alert parameters.

                    3. Configure the notification channels you want to use for alert notification.

                    Sysdig sometimes deprecates outdated metrics. Alerts that use these metrics will not be modified or disabled, but will no longer be updated. See Heuristic and Deprecated Metrics.

                    Configure Alerts

                    Use the Alert wizard to create or edit alerts.

                    Open the Alert Wizard

                    There are multiple ways to access the Alert wizard:

                    From Explore

                    Do one of the following:

                    • Select New Alertbeside an entity:

                    • Click More Options (three dots), and select Create a new alert.

                    From Dashboards

                    Click the More Options (three dots) icon for a panel, and select Create Alert.

                    From Alerts

                    Do one of the following:

                    • Click Add Alerts.

                    • Select an existing alert and click Edit.

                    From Overview

                    From the Events panel on the Overview screen, select a custom or an Infrastructure type event. From the event description screen, click Create Alert from Event.

                    Create an Alert

                    Configure notification channels before you begin, so the channels are available to assign to the alert. Optionally, you can add a custom subject and body information into individual alert notifications.

                    Enter Basic Alert Information

                    Configuration slightly defers for each Alert type. See respective pages to learn more. This section covers general instructions to help you acquainted with and navigate the Alerts user interface.

                    To configure an alert, open the Alert wizard and set the following parameters:

                    • Create the alert:

                      • Type: Select the desired Alert Types.

                        Each type has different parameters, but they follow the same pattern:

                        • Name: Specify a meaningful name that can uniquely represent the Alert that you are creating. For example, the entity that an alert targets, such as Production Cluster Failed Scheduling pods.

                        • Group (optional): Specify a meaningful group name for the alert you are creating. Group name helps you narrow down the problem area and focus on the infrastructure view that needs your attention. For example, you can enter Redis for alerts related to Redis services. When the alert triggers you will know which service in your workload requires inspection. Alerts that have no group name will be added to the Default Group. Group name is editable. Edit the alert to do so.

                          An alert can belong to only one group. An alert created from an alert template will have the group already configured by the Monitor Integrations. You can see the existing alert groups on the Alerts details page.

                          See Groupings for more information on how Sysdig handles infrastructure views.

                        • Description (optional): Briefly expand on the alert name or alert condition to give additional context for the recipient.

                        • Priority: Select a priority. High, Medium, Low, and Info. You can later sort by the severity by using the top navigation pane.

                        • Specify the parameters in the Define, Notify, and Act sections.

                    • Define:

                      Based on the alert type, define the parameters.

                      • Downtime: Select the entity to monitor. For more information, see Downtime Alert.

                      • Metric: Select a metric that this alert will monitor. You also define how the data is aggregated, such as average, maximum, minimum, or sum. Metrics are applied to a group of items (group aggregation). For more information, see Metric Alerts.

                      • PromQL: Enter the PromQL query and duration. For more information, see PromQL Alerts.

                      • Event: Filter the custom event to be alerted on by using the name, tag, description, and a source tag. For more information, see Event Alerts

                      • Anomaly Detection: Specify the metrics to be monitored for anomalies. For more information, see Anomaly Detection Alerts.

                      • Group Outlier: Specify the metrics to be monitored for outliers. For more information, see Group Outlier Alerts.

                    To alert on multiple metrics using boolean logic, click Create multi-condition alerts. See Multi-Condition Alerts.

                    • Scope: Everywhere, or a more limited scope to filter a specific component of the infrastructure monitored, such as a Kubernetes deployment, a Sysdig Agent, or a specific service.

                    • Trigger: Boundaries for assessing the alert condition, and whether to send a single alert or multiple alerts. Supported time scales are minute, hour, or day.

                      • Single alert: Single Alert fires an alert for your entire scope.

                      • Multiple alerts: Multiple Alert fires if any or every segment breaches the threshold at once.

                        Multiple alerts are triggered for each segment you specify. The specified segments will be represented in alerts. The higher the number of segments the easier to uniquely identify the affected entities.

                    For detailed description, see respective sections on Alert Types.

                    • (2) Notify

                      • Notification Channel: Select from the configured notification channels in the list. Supported channels are:

                        • Email

                        • Slack

                        • Amazon SNS Topic

                        • Opsgenie

                        • Pagerduty

                        • VictorOps

                        • Webhook

                        You can view the list of notification channels configured for each alert on the Alerts page.

                      • Notification Options: Set the time interval at which multiple alerts should be sent.

                      • Format Message: If applicable, add message format details. See Customize Notifications.

                    • (3) Act

                      • (Optional): Configure a Sysdig capture. See also Captures.

                        Sysdig capture files are not available for Event Alerts.

                    • Click Create.

                    Optional: Customize Notifications

                    You can optionally customize individual notifications to provide context for the errors that triggered the alert. All the notification channels support this added contextual information and customization flexibility.

                    Modify the subject, body, or both of the alert notification with the following:

                    • Plaintext: A custom message stating the problem. For example, Stalled Deployment.

                    • Hyperlink: For example, URL to a Dashboard.

                    • Dynamic Variable: For example, a hostname. Note the conventions:

                      • All variables that you insert must be enclosed in double curly braces, such as {{file_mount}}.

                      • Variables are case sensitive.

                      • The variables should correspond to the segment values you created the alert for. For example, if an alert is segmented byhost.hostName andcontainer.name, the corresponding variables will be{{host.hostName}}and {{container.name}} respectively. In addition to these segment variables, __alert_name__  and __alert_status__ are supported. No other segment variables are allowed in the notification subject and body.

                      • Notification subjects will not show up on the Event feed.

                      • Using a variable that is not a part of the segment will trigger an error.

                      • The segment variables used in an alert are turned to the current system values upon sending the alert.

                    The body of the notification message contains a Default Alert Template. It is the default alert notification generated by Sysdig Monitor. You may add free text, variables, or hyperlinks before and after the template.

                    You can send a customized alert notification to the following channels:

                    • Email

                    • Slack

                    • Amazon SNS Topic

                    • Opsgenie

                    • Pagerduty

                    • VictorOps

                    • Webhook

                    Multi-Condition Alerts

                    Multi-condition alerts are advanced alert threshold created on complex conditions. To do so, you define alert thresholds as custom boolean expressions that can involve multiple conditions. Click Create multi-condition alerts to enable adding conditions as boolean expressions.

                    These advanced alerts require specific syntax, as described in the examples below.

                    Format and Operations

                    Each condition has five parts:

                    • Metric Name : Use the exact metric names. To avoid typos, click the HELP link to access the drop-down list of available metrics. Selecting a metric from the list will automatically add the name to the threshold expression being edited.

                    • Group Aggregation (optional): If no group aggregation type is selected, the appropriate default for the metric will be applied (either sum or average). Group aggregation functions must be applied outside of time aggregation functions.

                    • Time aggregation : It’s the historical data rolled up over a selected period of time.

                    • Operator: Both logical and relational operators are supported.

                    • Value: A static numerical value against which a condition is evaluated.

                    The table below displays supported time aggregation functions, group aggregation functions, and relational operators:

                    Time Aggregation FunctionGroup Aggregation FunctionRelational Operator
                    timeAvg()avg()=
                    min()min()<
                    max()max()>
                    sum()sum()<=
                    >=
                    !=

                    The format is:

                    condition1 AND condition2
                    condition1 OR condition2
                    NOT condition1
                    

                    The order of operations can also be altered via parenthesis:

                    NOT (condition1 AND (condition2 OR condition3))
                    

                    Conditions take the following form:

                    groupAggregation(timeAggregation(metric.name)) operator value
                    

                    Example Expressions

                    Several examples of advanced alerts are given below:

                    timeAvg(cpu.used.percent) > 50 AND timeAvg(memory.used.percent) > 75
                    timeAvg(cpu.used.percent) > 50 OR timeAvg(memory.used.percent) > 75
                    timeAvg(container.count) != 10
                    min(min(cpu.used.percent)) <= 30 OR max(max(cpu.used.percent)) >= 60
                    sum(file.bytes.total) > 0 OR sum(net.bytes.total) > 0
                    timeAvg(cpu.used.percent) > 50 AND (timeAvg(mysql.net.connections) > 20 OR timeAvg(memory.used.percent) > 75)
                    
                    

                    1 -

                    Manage Alerts

                    Alerts can be managed individually, or as a group, by using the checkboxes on the left side of the Alert UI and the customization bar. The columns of the table can also be configured, to provide you with the necessary data for your use cases.

                    Select a group of alerts and perform several batch operations, such as filtering, deleting, enabling, disabling, or exporting to a JSON object. Select individual alerts to perform tasks such as creating a copy for a different team.

                    View Alert Details

                    The bell button next to an alert indicates that you have not resolved the corresponding events. The Activity Over Last Two Weeks column visually notifies you with an event chart showing the number of events that were triggered over the last two weeks. The color of the event chart represents what severity level they are.

                    To view alert details, click the corresponding alert row. The slider with the alert details will appear. Click an individual event to Take Action. You can do one of the following:

                    • Acknowledge: Mark that the event has been acknowledged by the intended recipient.

                    • Create Silence from Event: If you no longer want to be notified, use this option. You can choose the scope for alert silence. When silenced, alerts will still be triggered but will not send you any notifications.

                    • Explore: Use this option to troubleshoot by using the PromQL Query.

                    The event feed will be empty and The Activity Over Last Two Weeks column will have no event chart if no events are reported in the past two weeks.

                    Enable/Disable Alerts

                    Alerts can be enabled or disabled using the slider or the customization bar. You can perform these operations on a single alert or on multiple alerts as a batch operation.

                    1. From the Alerts module, check the boxes beside the relevant alerts.

                    2. Click Enable Selected or Disable Selected as necessary.

                    Use the slider beside the alert to disable or enable individual alerts.

                    Edit an Existing Alert

                    To edit an existing alert:

                    1. Do one of the following::

                      • Click the Edit button beside the alert.

                      • Click an alert to open the detail view, then click Edit on the top right corner

                    2. Edit the alert, and click Save to confirm the changes.

                    Copy an Alert

                    Alerts can be copied within the current team to allow for similar alerts to be created quickly, or copied to a different team to share alerts.

                    Copy an Alert to the Current Team

                    To copy an alert within the current team:

                    1. Highlight the alert to be copied.

                      The detail view is displayed.

                    2. Click Copy.

                      The Copy Alert screen is displayed.

                    3. Select Current from the drop-down.

                    4. Click Copy and Open.

                      The particular alert in the edit mode appears.

                    5. Make necessary changes and save the alert.

                    Copy an Alert to a Different Team

                    1. Highlight the alert to be copied.

                      The detail view is displayed.

                    2. Click Copy.

                      The Copy Alert screen is displayed.

                    3. Select the teams that the alert should be copied to.

                    4. Click Send Copy.

                    Search for an Alert

                    Search Using Strings

                    The Alerts table can be searched using partial or full strings. For example, the search below displays only events that contain kubernetes:

                    Filter Alerts

                    The alert feed can be filtered in multiple ways, to drill-down into the environment’s history and refine the alert displayed. The feed can be filtered by severity or status. Examples of each are shown below.

                    The example below shows only high and medium severity:

                    The example below shows the alerts that are invalid:

                    Export Alerts as JSON

                    A JSON file can be exported to a local machine, containing JSON snippets for each selected alert:

                    1. Click the checkboxes beside the relevant alerts to be exported.

                    2. Click Export JSON.

                    Delete Alerts

                    Open the Alert page and use one of the following methods to delete alerts :

                    • Hover on a specific alert and click Delete.

                    • Hover on one or more alerts, click the checkbox, then click Delete on the bulk-action toolbar.

                    • Click an alert to see the detailed view, then click Delete on the top right corner.

                    2 -

                    Silence Alert Notifications

                    Sysdig Monitor allows you to silence alerts for a given scope for a predefined amount of time. When silenced, alerts will still be triggered but will not send any notifications. You can schedule silencing in advance. This helps administrators to temporarily mute notifications during planned downtime or maintenance and send downtime notifications to selected channels.

                    With an active silence, the only notifications you will receive are those indicating the start time and the end time of the silence. All other notifications for events from that scope will be silenced. When a silence is active, creating an alert triggers the alert but no notification will be sent. Additionally, a triggering event will be generated stating that the alert is silenced.

                    See Working with Alert APIs for programmatically silencing alert notifications.

                    Configure a Silence

                    When you create a new silence, it is by default enabled and scheduled. When the start time arrives for a scheduled silence, it becomes active and the list shows the time remaining. When the end time arrives, the silence becomes completed and cannot be enabled again.

                    To configure a silence:

                    1. Click Alerts on the left navigation on the Monitor UI.

                    2. Click the Silence tab.

                      The page shows the list of all the existing silences.

                    3. Click Set a Silence.

                      The Silence for Scope window is displayed.

                    1. Specify the following:

                      • Scope: Specify the entity you want to apply the scope as. For example, a particular workload or namespace, from environments that may include thousands of entities.

                      • Begins: Specify one of the following: Today, Tomorrow, Pick Another Day. Select the time from the drop-down.

                      • Duration: Specify how long notifications should be suppressed.

                      • Name: Specify a name to identify the silence.

                      • Notify: Select a channel you want to notify about the silence.

                    2. Click Save.

                    Silence Alert Notifications from Event Feed

                    You can also create and edit silences and view silenced alert events on the Events feeds across the Monitor UI. When you create a silence, the alert will still be triggered and posted on the Events feed and in the graph overlays but will indicate that the alert has been silenced.

                    If you have an alert with no notification channel configured, events generated from that alert won’t be marked as silenced. They won’t be visually represented in the events feed as well with the crossed bell icon and the option to silence events.

                    To do so,

                    1. On the event feed, select the alert event that you want to silence.

                    2. On the event details slider, click Take Action.

                    3. Click Create Silence from Event.

                      The Silence for Scope window is displayed.

                    4. Continue configuring the silence as described in 4.

                    Manage Silences

                    Silences can be managed individually, or as a group, by using the checkboxes on the left side of the Silence UI and the customization bar. Select a group of silences and perform batch delete operations. Select individual silences to perform tasks such as enabling, disabling, duplicating, and editing.

                    Change States

                    You can enable or disable a silence by sliding the state bar next to the silences. There are two kinds of silences that will show as enabled: active (a running silence) and a scheduled silence (which will start in the future). Its starting date is back in time but the end date is yet to happen. A clock icon visually represents an active silence.

                    Completed silences cannot be re-enabled once a silenced period is finished. However, you can duplicate it with all the data but you need to set a new silencing period.

                    A silence can be disabled only when:

                    • The silence is not yet started

                    • The silence is in progress.

                    Filter Silences

                    Use the search bar to filter silences. You can either perform a simple auto-complete text search or use the categories. The feed can be filtered by the following categories: Active, Scheduled, Completed.

                    For example, the following shows the completed silences that start with “ag”.

                    Duplicate a Silence

                    Do one of the following to duplicate a silence:

                    • Click the Duplicate hover-the-row button on the menu.

                    • Click the row for the Silence for Scope window to open. On the window, make necessary changes if required and click Duplicate.

                    Edit Silence

                    You can edit scheduled silences. For the active ones, you can only extend the time. You cannot edit completed silences.

                    To edit a silence, do one of the following:

                    • Click the row for the Silence for Scope window to open. Make necessary changes and click Update.

                    • Click the Edit hover-the-row button on the menu. The Silence for Scope window will be displayed.

                      Make necessary changes and click Update.

                    Extend the Time Duration

                    For the active silences, you can extend the duration to one of the following:

                    • 1 Hour

                    • 2 Hours,

                    • 6 Hours,

                    • 12 Hours

                    • 24 Hours

                    To do so, click the extend the time duration button on the menu and choose the duration. You can extend the time of an active silence even from the Silence for Scope window.

                    Extending the time duration will notify the configured notification channels that the downtime is extended. You can also extend the time from a Slack notification of a silence by clicking the given link. It opens the Silence for Scope window of the running silence where you can make necessary adjustments.

                    You cannot extend the duration of completed silences.

                    3 -

                    Alerts Library

                    To help you get started quickly, Sysdig provides a set of curated alert templates called Alerts Library. Powered by Monitor Integrations , Sysdig automatically detects the applications and services running in your environment and recommends alerts that you can enable.

                    Two types of alert templates are included in Alerts Library:

                    • Recommended: Alert suggestions based on the services that are detected running in your infrastructure.

                    • All templates: You can browse templates for all the services. For some templates, you might need to configure Monitor Integrations.

                    Access Alerts Library

                    1. Log in to Sysdig Monitor.

                    2. Click Alerts from the left navigation pane.

                    3. On the Alerts tab, click  Library.

                    Import an Alert

                    1. Locate the service that you want to configure an alert for.

                      To do so, either use the text search or identify from a list of services.

                    2. For example, click Redis.

                      Eight template suggestions are displayed for 14 Redis services running on the environment.

                    3. From a list of template suggestions, choose the desired template.

                      The Redis page shows the alerts that are already in use and that you can enable.

                    4. Enable one or more alert templates. To do so, you can do one of the following:

                      • Click Enable Alert.

                      • Bulk enable templates. Select the check box corresponding to the alert templates and click Enable Alert on the top-right corner.

                      • Click on the alert template to display the slider. Click the Enable Alert on the slider.

                    5. On the Configure Redis Alert page, specify the Scope and select the Notification channels.

                    6. Click Enable Alert.

                      You will see a message stating that the Redis Alert has been successfully created.

                    Use Alerts Library

                    In addition to importing an alert, you can also do the following with the Alerts Library:

                    • Identify Alert templates associated with the services running in your infrastructure.

                    • Bulk import Alert templates. See Import an Alert.

                    • View alerts that are already configured.

                    • Filter Alert templates. Enter the search string to display the matching results.

                    • Discover the workloads where a service is running. To do so, click on the Alert template to display the slider. On the slider, click Workloads.

                    • View the alerts in use. To do so, click on an Alert template to display the slider. On the slider, click Alerts in use.

                    • Configure an alert.

                      Additional alert configuration, such as changing the alert name, description, and severity can be done after the import.

                    4 -

                    Downtime Alert

                    Sysdig Monitor continuously surveils any type of entity in your infrastructure, such as a host, a container, a process, or a service, and sends notifications when the monitored entity is not available or responding. Downtime alert focuses mainly on unscheduled downtime of your infrastructure.

                    In this example, a Kubernetes cluster is monitored and the alert is segmented on both cluster and namespace. When a Kubernetes cluster in the selected availability zone goes down, notifications will be sent with necessary information on both cluster and affected namespace.

                    The lines shown in the preview chart represent the values for the segments selected to monitor. The popup is a color-coded legend to show which segment (or combination of segments if there is more than one) the lines represent. You can also deselect some segment lines to prevent them from showing in the chart. Note that there is a limit of 10 lines that Sysdig Monitor ever shows in the preview chart. For downtime alerts, segments are actually what you select for the “Select entity to monitor” option.

                    Define a Downtime Alert

                    Guidelines

                    • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert.

                    • Severity: Set a severity level for your alert. The Priority—High, Medium, Low, and Info—are reflected in the Alert list, where you can sort by the severity of the Alert. You can use severity as a criterion when creating alerts, for example: if there are more than 10 high severity events, notify.

                    • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes Cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as Kubernetes Namespace, Kubernetes Deployment, and so on.

                    Specify Entity

                    1. Select an entity whose downtime you want to monitor for.

                      In this example, you are monitoring the unscheduled downtime of a host.

                    2. Specify additional segments:

                      The specified entities are segmented on and notified with the default notification template as well as on the Preview. In this example, data is segmented on Kubernetes cluster name and namespace name. When a cluster is affected, the notification will not only include the affected cluster details but also the associated namespaces.

                    Configure Scope

                    Filter the environment on which this alert will apply. An alert will fire when a host goes down in the availability zone, us-east-1b.

                    Use in or contain operators to match multiple different possible values to apply scope.

                    The contain and not contain operators help you retrieve values if you know part of the values. For example, us retrieves values that contain strings that start with “us”, such as “us-east-1b”, “us-west-2b”, and so on.

                    The in and not in operators help you filter multiple values.

                    You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

                    Configure Trigger

                    Define the threshold and time window for assessing the alert condition. Supported time scales are minute, hour, or day.

                    If the monitored host or Kubernetes cluster is not available or not responding for the last 10 minutes, recipients will be notified.

                    You can set any value for % and a value greater than 1 for the time window. For example, If you choose 50% instead of 100%, a notification will be triggered when the entity is down for 5 minutes in the selected time window of 10 minutes.

                    Use Cases

                    • Your e-commerce website is down during the peak hours of Black Friday, Christmas, or New Year season.

                    • Production servers of your data center experience a critical outage

                    • MySQL database is unreachable

                    • File upload does not work on your marketing website.

                    5 -

                    PromQL Alerts

                    Sysdig Monitor enables you to use PromQL to define metric expressions that you can alert on. You define the alert conditions using the PromQL-based metric expression. This way, you can combine different metrics and warn on cases like service-level agreement breach, running out of disk space in a day, and so on.

                    Examples

                    For PromQL alerts, you can use any metric that is available in PromQL, including Sysdig native metrics. For more details see the various integrations available on promcat.io.

                    Low Disk Space Alert

                    Warn if disk space falls below a specified quantity. For example disk space is below 10GB in the 24h hour:

                    predict_linear(sysdig_fs_free_bytes{fstype!~"tmpfs"}[1h], 24*3600) < 10000000000
                    

                    Slow Etcd Requests

                    Notify if etcd requests are slow. This example uses the promcat.io integration.

                    histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]) > 0.15
                    

                    High Heap Usage

                    Warn when the heap usage in ElasticSearch is more than 80%. This example uses the promcat.io integration.

                    (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80
                    

                    Guidelines

                    Sysdig Monitor does not currently support the following:

                    • Interact with the Prometheus alert manager or import alert manager configuration.

                    • Provide the ability to use, copy, paste, and import predefined alert rules.

                    • Convert the alert rules to map to the Sysdig alert editor.

                    Create a PromQL Alert

                    Set a meaningful name and description that help recipients easily identify the alert.

                    Set a Priority

                    Select a priority for the alert that you are creating. The supported priorities are High, Medium, Low, and Info. You can also view and sort events in the dashboard and explore UI, as well as sort them by severity.

                    Define a PromQL Alert

                    PromQL: Enter a valid PromQL expression. The query will be executed every minute. However, the alert will be triggered only if the query returns data for the specified duration.

                    In this example, you will be alerted when the rate of HTTP requests has doubled over the last 5 minutes.

                    Duration: Specify the time window for evaluating the alert condition in minutes, hour, or day. The alert will be triggered if the query returns data for the specified duration.

                    Define Notification

                    Notification Channels: Select from the configured notification channels in the list.

                    Re-notification Options: Set the time interval at which multiple alerts should be sent if the problem remains unresolved.

                    Notification Message & Events: Enter a subject and body. Optionally, you can choose an existing template for the body. Modify the subject, body, or both for the alert notification with a hyperlink, plain text, or dynamic variables.

                    Import Prometheus Alert Rules

                    Sysdig Alert allows you to import Prometheus rules or create new rules on the fly and add them to the existing list of alerts. Click the Upload Prometheus Rules option and enter the rules as YAML in the Upload Prometheus Rules YAML editor. Importing your Prometheus alert rules will convert them to PromQL-based Sysdig alerts. Ensure that the alert rules are valid YAML.

                    You can upload one or more alert rules in a single YAML and create multiple alerts simultaneously.

                    Once the rules are imported to Sysdig Monitor, the alert list will be automatically sorted by last modified date.

                    Besides the pre-populated template, each rule specified in the Upload Prometheus Rules YAML editor requires the following fields:

                    • alert

                    • expr 

                    •  for

                    See the following examples to understand the format of Prometheus Rules YAML. Ensure that the alert rules are valid YAML to pass validation.

                    Example: Alert Prometheus Crash Looping

                    To alert potential Prometheus crash looping. Create a rule to alert when Prometheus restart more than twice in the last 10 minutes.

                    groups:
                    - name: crashlooping
                      rules:
                      - alert: PrometheusTooManyRestarts
                        expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[10m]) > 2
                        for: 0m
                        labels:
                          severity: warning
                        annotations:
                          summary: Prometheus too many restarts (instance {{ $labels.instance }})
                          description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n
                    

                    Example: Alert HTTP Error Rate

                    To alert HTTP requests with status 5xx (> 5%) or high latency:

                    groups:
                    - name: default
                      rules:
                      # Paste your rules here
                      - alert: NginxHighHttp5xxErrorRate
                        expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
                        for: 1m
                        labels:
                          severity: critical
                        annotations:
                          summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
                          description: Too many HTTP requests with status 5xx
                      - alert: NginxLatencyHigh
                        expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
                        for: 2m
                        labels:
                          severity: warning
                        annotations:
                          summary: Nginx latency high (instance {{ $labels.instance }})
                          description: Nginx p99 latency is higher than 3 seconds
                    

                    Learn More

                    6 -

                    Metric Alerts

                    Sysdig Monitor keeps a watch on time-series metrics, and alert if they violate user-defined thresholds.

                    The lines shown in the preview chart represent the values for the segments selected to monitor. The popup is a color-coded legend to show which segment (or combination of segments if there is more than one) the lines represent. You can also deselect some segment lines to prevent them from showing in the chart. Note that there is a limit of 10 lines that Sysdig Monitor ever shows in the preview chart.

                    Defining a Metric Alert

                    Guidelines

                    • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert

                    • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes Cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as Kubernetes Namespace, Kubernetes Deployment, and so on.

                    Specify Metrics

                    Select a metric that this alert will monitor. You can also define how data is aggregated, such as avg, max, min or sum. To alert on multiple metrics using boolean logic, switch to multi-condition alert.

                    Configure Scope

                    Filter the environment on which this alert will apply.

                    Filter the environment on which this alert will apply. An alert will fire when a host goes down in the availability zone, us-east-1b.

                    Use advanced operators to include, exclude, or pattern-match groups, tags, and entities. See Multi-Condition Alerts.

                    You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

                    Configure Trigger

                    Define the threshold and time window for assessing the alert condition. Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.

                    Metric alerts can be triggered to notify you of different aggregations:

                    Aggregation

                    Description

                    on average

                    The average of the retrieved metric values across the time period. Actual number of samples retrieved is used to calculate the value.

                    For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as on average, the alert will be calculated by summing the 3 recorded values and dividing by 3.

                    as a rate

                    The average value of the metric across the time period evaluated. The expected number of values is used to calculate the rate to trigger the alert.

                    For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as as a rate, the alert will be calculated by summing the 3 recorded values and dividing by 10 ( 10 x 1 minute samples).

                    in sum

                    The combined sum of the metric across the time period evaluated.

                    at least once

                    The trigger value is met for at least one sample in the evaluated period.

                    for the entire time

                    The trigger value is met for a every sample in the evaluated period.

                    as a rate of change

                    The trigger value is met the change in value over the evaluated period.

                    For example, if the file system used percentage goes above 75 for the last 5 minutes on an average, multiple alerts will be triggered. The mac address of the host and mount directory of the file system will be represented in the alert notification.

                    Usecases

                    • Number of processes running on a host is not normal

                    • Root volume disk usage in a container is high

                    7 -

                    Event Alerts

                    Monitor occurrences of specific events, and alert if the total number of occurrences violates a threshold. Useful for alerting on container, orchestration, and service events like restarts and deployments.

                    Alerts on events support only one segmentation label. An alert is generated for each segment.

                    Defining a Metric Alert

                    Guidelines

                    • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert.

                    • Severity: Set a severity level for your alert. The Priority: High, Medium, Low,andInfo are reflected in the Alert list, where you can sort by the severity by using the top navigation pane. You can use severity as a criterion when creating events and alerts, for example: if there are more than 10 high severity events, notify.

                    • Source Tag: Supported source tags are Kubernetes, Docker, and Containerd.

                    • Trigger: Specify the trigger condition in terms of the number of events for a given duration.

                      Event alert support only one segmentation label. If you choose Multiple Alerts, Sysdig generates only one alert for a selected segment.

                    Specify Event

                    1. Specify the name, tag, or description of an event.

                    2. Specify a Source Tag.

                    Configure Scope

                    Filter the environment on which this alert will apply. Use advanced operators to include, exclude, or pattern-match groups, tags, and entities. You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

                    In this example, failing a liveness probe in the agent-process-whitelist-cluster cluster triggers an alert.

                    Configure Trigger

                    Define the threshold and time window for assessing the alert condition. Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.

                    If the number of events triggered in the monitored entity is greater than 5 for the last 10 minutes, recipients will be notified through the selected channel.

                    8 -

                    Anomaly Detection Alerts

                    Anomaly refers to an outlier in a given data set polled from an environment. It is a deviation from a conformed pattern. Anomaly detection is about identifying these anomalous observations. A set of data points collectively, a single instance of data or context-specific abnormalities help detect anomalies. For example, unauthorized copying of a directory from a container, high CPU or memory consumption, and so on.

                    Define a Group Outlier Alert

                    Guidelines

                    • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert

                    • Severity: Set a severity level for your alert. The Priority: High, Medium, Low,andInfo are reflected in the Alert list, where you can sort by the severity by using the top navigation pane. You can use severity as a criterion when creating events and alerts, for example: if there are more than 10 high severity events, notify.

                    • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes Cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as Kubernetes Namespace, Kubernetes Deployment, and so on.

                    Specify Entity

                    Select one or more metrics whose behavior you want to monitor.

                    Configure Scope

                    Filter the environment on which this alert will apply. An alert will fire when the value returned by one of the selected metrics does not follow the pattern in the availability zone, us-east-1b.

                    You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

                    Configure Trigger

                    Trigger gives you control over how notifications are created and help prevent flooding your notification channel with notifications. For example, you may want to receive a notification for every violation, or only want a single notification for a series of consecutive violations.

                    Define the threshold and time window for assessing the alert condition. Supported time scales are minute, hour, or day.

                    If the monitored host or Kubernetes cluster is not available or not responding for the last 5 minutes, recipients will be notified.

                    You can set any value for % and a value greater than 1 for the time window. For example, If you choose 50% instead of 100%, a notification will be triggered when the entity is down for 2.5 minutes in the selected time window of 5 minutes.

                    9 -

                    Group Outlier Alerts

                    Sysdig Monitor observes a group of hosts and notifies you when one acts differently from the rest.

                    Define a Group Outlier Alert

                    Guidelines

                    • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert

                    • Severity: Set a severity level for your alert. The Priority: High, Medium, Low,andInfo are reflected in the Alert list, where you can sort by the severity by using the top navigation pane. You can use severity as a criterion when creating events and alerts, for example: if there are more than 10 high severity events, notify.

                    Specify Entity

                    Select one or more metrics whose behavior you want to monitor.

                    Configure Scope

                    Filter the environment on which this alert will apply. An alert will fire when the value returned by one of the selected metrics does not follow the pattern in the availability zone, us-east-1b.

                    You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

                    Configure Trigger

                    Trigger gives you control over how notifications are created and help prevent flooding your notification channel with notifications. For example, you may want to receive a notification for every violation, or only want a single notification for a series of consecutive violations.

                    Define the threshold and time window for assessing the alert condition. Supported time scales are minute, hour, or day.

                    If the monitored host or Kubernetes cluster is not available or not responding for the last 5 minutes, recipients will be notified.

                    You can set any value for % and a value greater than 1 for the time window. For example, If you choose 50% instead of 100%, a notification will be triggered when the entity is down for 2.5 minutes in the selected time window of 5 minutes.

                    Usecases

                    • Load balancer servers have uneven workloads

                    • Changes in applications or instances deployed in different availability zones.

                    • Network hogging hosts in a cluster