- 1:
- 2:
- 3:
- 4:
- 5:
- 6:
- 7:
- 8:
- 9:
Alerts
Alert is the responsive component of Sysdig Monitor. Alerts notify you
when an event/issue occurs that requires attention. Events and issues
are identified based on changes in the metric values collected by Sysdig
Monitor. The Alerts module displays out-of-the-box alerts and a wizard
for creating and editing alerts as needed.
About Sysdig Alert
Sysdig Monitor can generate notifications based on certain conditions or
events you configure. Using the alert feature, you can keep a tab on
your infrastructure and find out about problems as they happen, or even
before they happen with the alert conditions you define. In Sysdig
Monitor, metrics serve as the central configuration artifact for alerts.
A metric ties one or more conditions or events to the measures to take
when the condition is met, or an event happens. Alerts work across
Sysdig modules including Explore, Dashboard, Events, and Overview.
Alert Types
The types of alerts available in Sysdig Monitor:
Downtime: Monitor any
type of entity, such as a host, a container, or a process, and alert
when the entity goes down.
Metric: Monitor
time-series metrics, and alert if they violate user-defined
thresholds.
PromQL: Monitor
metrics through a PromQL query.
Event: Monitor
occurrences of specific events, and alert if the total number of
occurrences violates a threshold. Useful for alerting on container,
orchestration, and service events like restarts and unauthorized
access.
Anomaly Detection:
Monitor hosts based on their historical behaviors, and alert when
they deviate from the expected pattern.
Group Outlier: Monitor
a group of hosts and be notified when one acts differently from the
rest. Group Outlier Alert is supported only on hosts.
The following tools help with alert creation:
Alert Library: Sysdig
Monitor provides a set of alerts by default. Use it as it is or as a
template to create your own.
Sysdig
API:
Use Sysdig’s Python client to create, list, delete, update and
restore alerts. See
examples.
Guidelines for Creating Alerts
Decide What to monitor | Determine what type of problem you want to be alerted on. See Alert Types to choose a type of problem. |
Define how it will be monitored | Specify exactly what behavior triggers a violation. For example, Marathon App is down on the Kubernetes Cluster named Production for ten minutes. |
Decide Where to monitor | Narrow down your environment to receive fine-tuned results. Use Scope to choose an entity that you want to keep a close watch on. Specify additional segments (entities) to give context to the problem. For example, in addition to specifying a Kubernetes cluster, add a namespace and deployment to refine your scope. |
Define when to notify | Define the threshold and time window for assessing the alert condition. Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.
Multiple Alerts include all the segments you specified to uniquely identify the location and thus provides a full qualification of where the problem occurred. The higher the number of segments the easier to uniquely identify the affected entities.
A good analogy for multiple alerts is alerting on cities. For example, creating multiple alerts on San Francisco would trigger an alert which will include information such as the country that it is part of is the USA and the continent is North America. Trigger gives you control over how notifications are created. For example, you may want to receive a notification for every violation, or want only a single notification for a series of consecutive violations.
|
Decide how notifications are sent | Alert supports customizable notification channels, including email, mobile push notifications, OpsGenie, Slack, and more. To see supported services, see Set Up Notification Channels. |
To create alerts, simply:
Choose an alert
type.
Configure alert
parameters.
Configure the notification
channels you want to
use for alert notification.
Sysdig sometimes deprecates outdated metrics. Alerts that use these
metrics will not be modified or disabled, but will no longer be updated.
See Deprecated
Metrics and Labels.
Use the Alert wizard to create or edit alerts.
Open the Alert Wizard
There are multiple ways to access the Alert wizard:
From Explore
Do one of the following:

From Dashboards
Click the More Options (three dots) icon for a panel, and
select Create Alert.

From Alerts
Do one of the following:
From Overview
From the Events panel on the Overview screen, select a custom or an
Infrastructure type event. From the event description screen, click
Create Alert from Event.

Create an Alert
Configure notification
channels before you begin,
so the channels are available to assign to the alert. Optionally, you
can add a custom subject and body information into individual alert
notifications.
Configuration slightly defers for each Alert type. See respective pages
to learn more. This section covers general instructions to help you
acquainted with and navigate the Alerts user interface.
To configure an alert, open the Alert wizard and set the following
parameters:
Create the alert:
Type: Select the desired Alert
Types.

Each type has different parameters, but they follow the same
pattern:
Name: Specify a meaningful name that can uniquely
represent the Alert that you are creating. For example, the
entity that an alert targets, such as
Production Cluster Failed Scheduling pods
.
Group (optional): Specify a meaningful group name for
the alert you are creating. Group name helps you narrow down
the problem area and focus on the infrastructure view that
needs your attention. For example, you can enter Redis
for alerts related to Redis services. When the alert
triggers you will know which service in your workload
requires inspection. Alerts that have no group name will be
added to the Default Group. Group name is editable. Edit
the alert to do so.
An alert can belong to only one group. An alert created from
an alert template will have the group already configured by
the Monitor
Integrations.
You can see the existing alert groups on the Alerts
details page.

See
Groupings
for more information on how Sysdig handles infrastructure
views.
Description (optional): Briefly expand on the alert name
or alert condition to give additional context for the
recipient.
Priority: Select a priority. High, Medium, Low,
and Info. You can later sort by the
severity by
using the top navigation pane.

Specify the parameters in the Define, Notify, and
Act sections.
To alert on multiple metrics using boolean logic, click Create
multi-condition alerts. See Multi-Condition
Alerts.

Scope: Everywhere, or a more limited scope to filter a specific
component of the infrastructure monitored, such as a Kubernetes
deployment, a Sysdig Agent, or a specific service.
Trigger: Boundaries for assessing the alert condition, and
whether to send a single alert or multiple alerts. Supported time
scales are minute, hour, or day.
Single alert: Single Alert fires an alert for your entire
scope.
Multiple alerts: Multiple Alert fires if any or every
segment breaches the threshold at once.
Multiple alerts are triggered for each segment you specify. The
specified segments will be represented in alerts. The higher the
number of segments the easier to uniquely identify the affected
entities.
For detailed description, see respective sections on Alert Types.
(2) Notify
Notification Channel: Select from the configured
notification channels in the list. Supported channels are:
Email
Slack
Amazon SNS Topic
Opsgenie
Pagerduty
VictorOps
Webhook
You can view the list of notification channels configured for
each alert on the Alerts page.

Notification Options: Set the time interval at which
multiple alerts should be sent.
Format Message: If applicable, add message format details.
See Customize
Notifications.
(3) Act
Click Create.
Optional: Customize Notifications
You can optionally customize individual notifications to provide context
for the errors that triggered the alert. All the notification channels
support this added contextual information and customization flexibility.
Modify the subject, body, or both of the alert notification with the
following:
Plaintext: A custom message stating the problem. For example,
Stalled Deployment.
Hyperlink: For example, URL to a Dashboard.
Dynamic Variable: For example, a hostname. Note the conventions:
All variables that you insert must be enclosed in double curly
braces, such as {{file_mount}}
.
Variables are case sensitive.
The variables should correspond to the segment values you
created the alert for. For example, if an alert is segmented
byhost.hostName
andcontainer.name
, the corresponding
variables will be{{host.hostName}}
and {{container.name}}
respectively. In addition to these segment variables,
__alert_name__
and __alert_status__
are supported. No other
segment variables are allowed in the notification subject and
body.
Notification subjects will not show up on the Event feed.
Using a variable that is not a part of the segment will trigger
an error.
The segment variables used in an alert are turned to the current
system values upon sending the alert.
The body of the notification message contains a Default Alert Template.
It is the default alert notification generated by Sysdig Monitor. You
may add free text, variables, or hyperlinks before and after the
template.
You can send a customized alert notification to the following channels:
Email
Slack
Amazon SNS Topic
Opsgenie
Pagerduty
VictorOps
Webhook
Multi-Condition Alerts
Multi-condition alerts are advanced alert threshold created on complex
conditions. To do so, you define alert thresholds as custom boolean
expressions that can involve multiple conditions. Click Create
multi-condition alerts to enable adding conditions as boolean
expressions.

These advanced alerts require specific syntax, as described in the
examples below.
Each condition has five parts:
Metric Name : Use
the exact metric names. To avoid typos, click the HELP
link to
access the drop-down list of available metrics. Selecting a metric
from the list will automatically add the name to the threshold
expression being edited.
Group Aggregation
(optional): If no group aggregation type is selected, the
appropriate default for the metric will be applied (either sum or
average). Group aggregation functions must be applied outside of
time aggregation functions.
Time aggregation :
It’s the historical data rolled up over a selected period of time.
Operator: Both logical and relational operators are supported.
Value: A static numerical value against which a condition is
evaluated.
The table below displays supported time aggregation functions, group
aggregation functions, and relational operators:
Time Aggregation Function | Group Aggregation Function | Relational Operator |
---|
timeAvg() | avg() | = |
min() | min() | < |
max() | max() | > |
sum() | sum() | <= |
| | >= |
| | != |
The format is:
condition1 AND condition2
condition1 OR condition2
NOT condition1
The order of operations can also be altered via parenthesis:
NOT (condition1 AND (condition2 OR condition3))
Conditions take the following form:
groupAggregation(timeAggregation(metric.name)) operator value
Example Expressions
Several examples of advanced alerts are given below:
timeAvg(cpu.used.percent) > 50 AND timeAvg(memory.used.percent) > 75
timeAvg(cpu.used.percent) > 50 OR timeAvg(memory.used.percent) > 75
timeAvg(container.count) != 10
min(min(cpu.used.percent)) <= 30 OR max(max(cpu.used.percent)) >= 60
sum(file.bytes.total) > 0 OR sum(net.bytes.total) > 0
timeAvg(cpu.used.percent) > 50 AND (timeAvg(mysql.net.connections) > 20 OR timeAvg(memory.used.percent) > 75)
1 -
Manage Alerts
Alerts can be managed individually, or as a group, by using the
checkboxes on the left side of the Alert UI and the customization bar.
The columns of the table can also be configured, to provide you with the
necessary data for your use cases.

Select a group of alerts and perform several batch operations, such as
filtering, deleting, enabling, disabling, or exporting to a JSON object.
Select individual alerts to perform tasks such as creating a copy for a
different team.
View Alert Details
The bell button next to an alert indicates that you have not resolved
the corresponding events. The Activity Over Last Two Weeks column
visually notifies you with an event chart showing the number of events
that were triggered over the last two weeks. The color of the event
chart represents what severity level they are.
To view alert details, click the corresponding alert row. The slider
with the alert details will appear. Click an individual event to Take
Action. You can do one of the following:
Acknowledge: Mark that the event has been acknowledged by the
intended recipient.
Create Silence from Event: If you no longer want to be notified,
use this option. You can choose the scope for alert
silence. When silenced,
alerts will still be triggered but will not send you any
notifications.
Explore: Use this option to troubleshoot by using the PromQL
Query.
The event feed will be empty and The Activity Over Last Two Weeks
column will have no event chart if no events are reported in the past
two weeks.
Enable/Disable Alerts
Alerts can be enabled or disabled using the slider or the customization
bar. You can perform these operations on a single alert or on multiple
alerts as a batch operation.
From the Alerts module, check the boxes beside the relevant alerts.
Click Enable Selected or Disable Selected as necessary.
Use the slider beside the alert to disable or enable individual alerts.

Edit an Existing Alert
To edit an existing alert:
Do one of the following::
Click the Edit button beside the alert.

Click an alert to open the detail view, then click Edit on
the top right corner

Edit the alert, and click Save to confirm the changes.
Copy an Alert
Alerts can be copied within the current team to allow for similar alerts
to be created quickly, or copied to a different team to share alerts.
Copy an Alert to the Current Team
To copy an alert within the current team:
Highlight the alert to be copied.
The detail view is displayed.

Click Copy.
The Copy Alert screen is displayed.
Select Current from the drop-down.
Click Copy and Open.
The particular alert in the edit mode appears.
Make necessary changes and save the alert.
Copy an Alert to a Different Team
Highlight the alert to be copied.
The detail view is displayed.
Click Copy.
The Copy Alert screen is displayed.
Select the teams that the alert should be copied to.

Click Send Copy.
Search for an Alert
Search Using Strings
The Alerts table can be searched using partial or full strings. For
example, the search below displays only events that contain
kubernetes
:

Filter Alerts
The alert feed can be filtered in multiple ways, to drill-down into the
environment’s history and refine the alert displayed. The feed can be
filtered by severity or status. Examples of each are shown below.
The example below shows only high and medium severity:

The example below shows the alerts that are invalid:

Export Alerts as JSON
A JSON file can be exported to a local machine, containing JSON snippets
for each selected alert:
Click the checkboxes beside the relevant alerts to be exported.
Click Export JSON.

Delete Alerts
Open the Alert page and use one of the following methods to delete
alerts :
Hover on a specific alert and click Delete.

Hover on one or more alerts, click the checkbox, then click
Delete on the bulk-action toolbar.

Click an alert to see the detailed view, then click Delete on
the top right corner.

2 -
Silence Alert Notifications
Sysdig Monitor allows you to silence alerts for a given scope for a
predefined amount of time. When silenced, alerts will still be triggered
but will not send any notifications. You can schedule silencing in
advance. This helps administrators to temporarily mute notifications
during planned downtime or maintenance and send downtime notifications
to selected channels.
With an active silence, the only notifications you will receive are
those indicating the start time and the end time of the silence. All
other notifications for events from that scope will be silenced. When a
silence is active, creating an alert triggers the alert but no
notification will be sent. Additionally, a triggering event will be
generated stating that the alert is silenced.
See Working with Alert
APIs for programmatically
silencing alert notifications.
When you create a new silence, it is by default enabled and scheduled.
When the start time arrives for a scheduled silence, it becomes active
and the list shows the time remaining. When the end time arrives, the
silence becomes completed and cannot be enabled again.
To configure a silence:
Click Alerts on the left navigation on the Monitor UI.
Click the Silence tab.
The page shows the list of all the existing silences.
Click Set a Silence.
The Silence for Scope window is displayed.

Specify the following:
Scope: Specify the entity you want to apply the scope as.
For example, a particular workload or namespace, from
environments that may include thousands of entities.
Begins: Specify one of the following: Today,
Tomorrow, Pick Another Day. Select the time from the
drop-down.
Duration: Specify how long notifications should be
suppressed.
Name: Specify a name to identify the silence.
Notify: Select a channel you want to notify about the
silence.
Click Save.
Silence Alert Notifications from Event Feed
You can also create and edit silences and view silenced alert events on
the Events feeds across the Monitor UI. When you create a silence, the
alert will still be triggered and posted on the Events feed and in the
graph overlays but will indicate that the alert has been silenced.
If you have an alert with no notification channel configured, events
generated from that alert won’t be marked as silenced. They won’t be
visually represented in the events feed as well with the crossed bell
icon and the option to silence events.
To do so,
On the event feed, select the alert event that you want to silence.
On the event details slider, click Take Action.

Click Create Silence from Event.
The Silence for Scope window is displayed.
Continue configuring the silence as described in
4.
Manage Silences
Silences can be managed individually, or as a group, by using the
checkboxes on the left side of the Silence UI and the customization
bar. Select a group of silences and perform batch delete operations.
Select individual silences to perform tasks such as enabling, disabling,
duplicating, and editing.
Change States
You can enable or disable a silence by sliding the state bar next to the
silences. There are two kinds of silences that will show as enabled:
active (a running silence) and a scheduled silence (which will start in
the future). Its starting date is back in time but the end date is yet
to happen. A clock icon visually represents an active silence.

Completed silences cannot be re-enabled once a silenced period is
finished. However, you can duplicate it with all the data but you need
to set a new silencing period.
A silence can be disabled only when:
Filter Silences
Use the search bar to filter silences. You can either perform a simple
auto-complete text search or use the categories. The feed can be
filtered by the following categories: Active, Scheduled,
Completed.
For example, the following shows the completed silences that start with
“cl”.

Duplicate a Silence
Do one of the following to duplicate a silence:
Click the Duplicate hover-the-row button on the menu.

Click the row for the Silence for Scope window to open. On the
window, make necessary changes if required and click Duplicate.
Edit Silence
You can edit scheduled silences. For the active ones, you can only
extend the time. You cannot edit completed silences.
To edit a silence, do one of the following:
Click the row for the Silence for Scope window to open. Make
necessary changes and click Update.
Click the Edit hover-the-row button on the menu. The Silence
for Scope window will be displayed.

Make necessary changes and click Update.
Extend the Time Duration
For the active silences, you can extend the duration to one of the
following:
1 Hour
2 Hours,
6 Hours,
12 Hours
24 Hours
To do so, click the extend the time duration button on the menu and
choose the duration. You can extend the time of an active silence even
from the Silence for Scope window.

Extending the time duration will notify the configured notification
channels that the downtime is extended. You can also extend the time
from a Slack notification of a silence by clicking the given link. It
opens the Silence for Scope window of the running silence where you
can make necessary adjustments.
You cannot extend the duration of completed silences.
3 -
Alerts Library
To help you get started quickly, Sysdig provides a set of curated alert
templates called Alerts Library. Powered by Monitor Integrations
, Sysdig automatically
detects the applications and services running in your environment and
recommends alerts that you can enable.
Two types of alert templates are included in Alerts Library:

Recommended: Alert suggestions based on the services that are
detected running in your infrastructure.
All templates: You can browse templates for all the services.
For some templates, you might need to configure Monitor
Integrations.
Access Alerts Library
Log in to Sysdig Monitor.
Click Alerts from the left navigation pane.
On the Alerts tab, click Library.

Import an Alert
Locate the service that you want to configure an alert for.
To do so, either use the text search or identify from a list of
services.
For example, click Redis.

Eight template suggestions are displayed for 14 Redis services
running on the environment.
From a list of template suggestions, choose the desired template.
The Redis page shows the alerts that are already in use and that you
can enable.
Enable one or more alert templates. To do so, you can do one of the
following:
Click Enable Alert.
Bulk enable templates. Select the check box corresponding to the
alert templates and click Enable Alert on the top-right
corner.
Click on the alert template to display the slider. Click the
Enable Alert on the slider.
On the Configure Redis Alert page, specify the Scope and select
the Notification channels.

Click Enable Alert.
You will see a message stating that the Redis Alert has been
successfully created.
Use Alerts Library
In addition to importing an alert, you can also do the following with
the Alerts Library:
Identify Alert templates associated with the services running in
your infrastructure.

Bulk import Alert templates. See Import an
Alert.
View alerts that are already configured.
Filter Alert templates. Enter the search string to display the
matching results.

Discover the workloads where a service is running. To do so, click
on the Alert template to display the slider. On the slider, click
Workloads.
View the alerts in use. To do so, click on an Alert template to
display the slider. On the slider, click Alerts in use.

Configure an alert.
Additional alert configuration, such as changing the alert name,
description, and severity can be done after the import.
4 -
Downtime Alert
Sysdig Monitor continuously surveils any type of entity in your
infrastructure, such as a host, a container, a process, or a service,
and sends notifications when the monitored entity is not available or
responding. Downtime alert focuses mainly on unscheduled downtime of
your infrastructure.

In this example, a Kubernetes cluster is monitored and the alert is
segmented on both cluster and namespace. When a Kubernetes cluster in
the selected availability zone goes down, notifications will be sent
with necessary information on both cluster and affected namespace.
The lines shown in the preview chart represent the values for the
segments selected to monitor. The popup is a color-coded legend to show
which segment (or combination of segments if there is more than one) the
lines represent. You can also deselect some segment lines to prevent
them from showing in the chart. Note that there is a limit of 10 lines
that Sysdig Monitor ever shows in the preview chart. For downtime
alerts, segments are actually what you select for the “Select entity
to monitor” option.
Define a Downtime Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert.
Severity: Set a severity level for your alert. The
Priority—High, Medium, Low, and Info—are reflected
in the Alert list, where you can sort by the severity of the Alert.
You can use severity as a criterion when creating alerts, for
example: if there are more than 10 high severity events, notify.
Specify multiple segments: Selecting a single segment might not
always supply enough information to troubleshoot. Enrich the
selected entity with related information by adding additional
related segments. Enter hierarchical entities so you have the
bottom-down picture of what went wrong and where. For example,
specifying a Kubernetes Cluster alone does not provide the context
necessary to troubleshoot. In order to narrow down the issue, add
further contextual information, such as Kubernetes Namespace,
Kubernetes Deployment, and so on.
Specify Entity
Select an entity whose downtime you want to monitor for.
In this example, you are monitoring the unscheduled downtime of a
host.
Specify additional segments:

The specified entities are segmented on and notified with the
default notification template as well as on the Preview. In this
example, data is segmented on Kubernetes cluster name and namespace
name. When a cluster is affected, the notification will not only
include the affected cluster details but also the associated
namespaces.
Filter the environment on which this alert will apply. An alert will
fire when a host goes down in the availability zone, us-east-1b.

Use in or contain operators to match multiple different possible
values to apply scope.
The contain and not contain operators help you retrieve values
if you know part of the values. For example, us retrieves values
that contain strings that start with “us”, such as “us-east-1b”,
“us-west-2b”, and so on.
The in and not in operators help you filter multiple values.
You can also create alerts directly from Explore and Dashboards for
automatically populating this scope.
Define the threshold and time window for assessing the alert condition.
Supported time scales are minute, hour, or day.

If the monitored host or Kubernetes cluster is not available or not
responding for the last 10 minutes, recipients will be notified.
You can set any value for % and a value greater than 1 for the time
window. For example, If you choose 50% instead of 100%, a notification
will be triggered when the entity is down for 5 minutes in the selected
time window of 10 minutes.
Use Cases
Your e-commerce website is down during the peak hours of Black
Friday, Christmas, or New Year season.
Production servers of your data center experience a critical outage
MySQL database is unreachable
File upload does not work on your marketing website.
5 -
PromQL Alerts
Sysdig Monitor enables you to use PromQL to define metric expressions
that you can alert on. You define the alert conditions using the
PromQL-based metric expression. This way, you can combine different
metrics and warn on cases like service-level agreement breach, running
out of disk space in a day, and so on.
Examples
For PromQL alerts, you can use any metric that is available in PromQL,
including Sysdig native
metrics. For more details
see the various integrations available on
promcat.io.
Low Disk Space Alert
Warn if disk space falls below a specified quantity. For example disk
space is below 10GB in the 24h hour:
predict_linear(sysdig_fs_free_bytes{fstype!~"tmpfs"}[1h], 24*3600) < 10000000000
Slow Etcd Requests
Notify if etcd
requests are slow. This example uses the
promcat.io integration.
histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]) > 0.15
High Heap Usage
Warn when the heap usage in ElasticSearch is more than 80%. This example
uses the promcat.io
integration.
(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80
Guidelines
Sysdig Monitor does not currently support the following:
Interact with the Prometheus alert manager or import alert manager
configuration.
Provide the ability to use, copy, paste, and import predefined alert
rules.
Convert the alert rules to map to the Sysdig alert editor.
Create a PromQL Alert
Set a meaningful name and description that help recipients easily
identify the alert.
Set a Priority
Select a priority for the alert that you are creating. The supported
priorities are High, Medium, Low, and Info. You can also
view and sort events in the dashboard and explore UI, as well as sort
them by severity.
Define a PromQL Alert
PromQL: Enter a valid PromQL expression. The query will be executed
every minute. However, the alert will be triggered only if the query
returns data for the specified duration.

In this example, you will be alerted when the rate of HTTP requests has
doubled over the last 5 minutes.
Duration: Specify the time window for evaluating the alert condition
in minutes, hour, or day. The alert will be triggered if the query
returns data for the specified duration.
Define Notification
Notification Channels: Select from the configured notification
channels in the list.
Re-notification Options: Set the time interval at which multiple
alerts should be sent if the problem remains unresolved.
Notification Message & Events: Enter a subject and body. Optionally,
you can choose an existing template for the body. Modify the subject,
body, or both for the alert notification with a hyperlink, plain text,
or dynamic variables.
Import Prometheus Alert Rules
Sysdig Alert allows you to import Prometheus rules or create new rules
on the fly and add them to the existing list of alerts. Click the
Upload Prometheus Rules option and enter the rules as YAML in the
Upload Prometheus Rules YAML editor. Importing your Prometheus alert
rules will convert them to PromQL-based Sysdig alerts. Ensure that the
alert rules are valid YAML.

You can upload one or more alert rules in a single YAML and create
multiple alerts simultaneously.

Once the rules are imported to Sysdig Monitor, the alert list will be
automatically sorted by last modified date.

Besides the pre-populated template, each rule specified in the Upload
Prometheus Rules YAML editor requires the following fields:
See the following examples to understand the format of Prometheus Rules
YAML. Ensure that the alert rules are valid YAML to pass validation.
Example: Alert Prometheus Crash Looping
To alert potential Prometheus crash looping. Create a rule to alert when
Prometheus restart more than twice in the last 10 minutes.
groups:
- name: crashlooping
rules:
- alert: PrometheusTooManyRestarts
expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[10m]) > 2
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus too many restarts (instance {{ $labels.instance }})
description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n
Example: Alert HTTP Error Rate
To alert HTTP requests with status 5xx (> 5%) or high latency:
groups:
- name: default
rules:
# Paste your rules here
- alert: NginxHighHttp5xxErrorRate
expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
for: 1m
labels:
severity: critical
annotations:
summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
description: Too many HTTP requests with status 5xx
- alert: NginxLatencyHigh
expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Nginx latency high (instance {{ $labels.instance }})
description: Nginx p99 latency is higher than 3 seconds
Learn More
6 -
Metric Alerts
Sysdig Monitor keeps a watch on time-series metrics, and alert if they
violate user-defined thresholds.

The lines shown in the preview chart represent the values for the
segments selected to monitor. The popup is a color-coded legend to show
which segment (or combination of segments if there is more than one) the
lines represent. You can also deselect some segment lines to prevent
them from showing in the chart. Note that there is a limit of 10 lines
that Sysdig Monitor ever shows in the preview chart.
Defining a Metric Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert
Specify multiple segments: Selecting a single segment might not
always supply enough information to troubleshoot. Enrich the
selected entity with related information by adding additional
related segments. Enter hierarchical entities so you have the
bottom-down picture of what went wrong and where. For example,
specifying a Kubernetes Cluster alone does not provide the context
necessary to troubleshoot. In order to narrow down the issue, add
further contextual information, such as Kubernetes Namespace,
Kubernetes Deployment, and so on.
Specify Metrics
Select a metric that this alert will monitor. You can also define how
data is aggregated, such
as avg, max, min or sum. To alert on multiple metrics using boolean
logic, switch to multi-condition
alert.
Filter the environment on which this alert will apply.
Filter the environment on which this alert will apply. An alert will
fire when a host goes down in the availability zone, us-east-1b.

Use advanced operators to include, exclude, or pattern-match groups,
tags, and entities. See Multi-Condition
Alerts.
You can also create alerts directly from Explore and Dashboards for
automatically populating this scope.
Define the threshold and time window for assessing the alert condition.
Single Alert fires an alert for your entire scope, while Multiple Alert
fires if any or every segment breach the threshold at once.
Metric alerts can be triggered to notify you of different aggregations:
on average | The average of the retrieved metric values across the time period. Actual number of samples retrieved is used to calculate the value. For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as on average, the alert will be calculated by summing the 3 recorded values and dividing by 3. |
as a rate | The average value of the metric across the time period evaluated. The expected number of values is used to calculate the rate to trigger the alert. For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as as a rate, the alert will be calculated by summing the 3 recorded values and dividing by 10 ( 10 x 1 minute samples). |
in sum | The combined sum of the metric across the time period evaluated. |
at least once | The trigger value is met for at least one sample in the evaluated period. |
for the entire time | The trigger value is met for a every sample in the evaluated period. |
as a rate of change | The trigger value is met the change in value over the evaluated period. |
For example, if the file system used percentage goes above 75 for the
last 5 minutes on an average, multiple alerts will be triggered. The mac
address of the host and mount directory of the file system will be
represented in the alert notification.

Usecases
7 -
Event Alerts
Monitor occurrences of specific events, and alert if the total number of
occurrences violates a threshold. Useful for alerting on container,
orchestration, and service events like restarts and deployments.
Alerts on events support only one segmentation label. An alert is
generated for each segment.

Defining a Metric Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert.
Severity: Set a severity level for your alert. The Priority:
High
, Medium
, Low
, and Info
are reflected in the Alert list,
where you can sort by the severity by using the top navigation pane.
You can use severity as a criterion when creating events and alerts,
for example: if there are more than 10 high severity events, notify.
Source Tag: Supported source tags are Kubernetes,
Docker, and Containerd.
Trigger: Specify the trigger condition in terms of the number of
events for a given duration.
Event alert support only one segmentation label. If you choose
Multiple Alerts, Sysdig generates only one alert for a selected
segment.
Specify Event
Specify the name, tag, or description of an event.

Specify a Source Tag.
Filter the environment on which this alert will apply. Use advanced
operators to include, exclude, or pattern-match groups, tags, and
entities. You can also create alerts directly from Explore and
Dashboards for automatically populating this scope.

In this example, failing a liveness probe in the
agent-process-whitelist-cluster cluster triggers an alert.
Define the threshold and time window for assessing the alert condition.
Single Alert fires an alert for your entire scope, while Multiple Alert
fires if any or every segment breach the threshold at once.

If the number of events triggered in the monitored entity is greater
than 5 for the last 10 minutes, recipients will be notified through the
selected channel.
8 -
Anomaly Detection Alerts
Anomaly refers to an outlier in a given data set polled from an
environment. It is a deviation from a conformed pattern. Anomaly
detection is about identifying these anomalous observations. A set of
data points collectively, a single instance of data or context-specific
abnormalities help detect anomalies. For example, unauthorized copying
of a directory from a container, high CPU or memory consumption, and so
on.

Define an Anomaly Detection Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert
Severity: Set a severity level for your alert. The Priority:
High
, Medium
, Low
, and Info
are reflected in the Alert list,
where you can sort by the severity by using the top navigation pane.
You can use severity as a criterion when creating events and alerts,
for example: if there are more than 10 high severity events, notify.
Specify multiple segments: Selecting a single segment might not
always supply enough information to troubleshoot. Enrich the
selected entity with related information by adding additional
related segments. Enter hierarchical entities so you have the
bottom-down picture of what went wrong and where. For example,
specifying a Kubernetes Cluster alone does not provide the context
necessary to troubleshoot. In order to narrow down the issue, add
further contextual information, such as Kubernetes Namespace,
Kubernetes Deployment, and so on.
Specify Entity
Select one or more metrics whose behavior you want to monitor.
Filter the environment on which this alert will apply. An alert will
fire when the value returned by one of the selected metrics does not
follow the pattern in the availability zone, us-east-1b.

You can also create alerts directly from Explore and Dashboards for
automatically populating this scope.
Trigger gives you control over how notifications are created and help
prevent flooding your notification channel with notifications. For
example, you may want to receive a notification for every violation, or
only want a single notification for a series of consecutive violations.
Define the threshold and time window for assessing the alert condition.
Supported time scales are minute, hour, or day.

If the monitored host or Kubernetes cluster is not available or not
responding for the last 5 minutes, recipients will be notified.
You can set any value for % and a value greater than 1 for the time
window. For example, If you choose 50% instead of 100%, a notification
will be triggered when the entity is down for 2.5 minutes in the
selected time window of 5 minutes.
9 -
Group Outlier Alerts
Sysdig Monitor observes a group of hosts and notifies you when one acts
differently from the rest.

Define a Group Outlier Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert
Severity: Set a severity level for your alert. The Priority:
High
, Medium
, Low
and Info
are reflected in the Alert list,
where you can sort by the severity by using the top navigation pane.
You can use severity as a criterion when creating events and alerts,
for example: if there are more than 10 high severity events, notify.
Specify Entity
Select one or more metrics whose behavior you want to monitor.
Filter the environment on which this alert will apply. An alert will
fire when the value returned by one of the selected metrics does not
follow the pattern in the availability zone, us-east-1b.

You can also create alerts directly from Explore and Dashboards for
automatically populating this scope.
Trigger gives you control over how notifications are created and help
prevent flooding your notification channel with notifications. For
example, you may want to receive a notification for every violation, or
only want a single notification for a series of consecutive violations.
Define the threshold and time window for assessing the alert condition.
Supported time scales are minute, hour, or day.

If the monitored host or Kubernetes cluster is not available or not
responding for the last 5 minutes, recipients will be notified.
You can set any value for % and a value greater than 1 for the time
window. For example, If you choose 50% instead of 100%, a notification
will be triggered when the entity is down for 2.5 minutes in the
selected time window of 5 minutes.
Usecases
Load balancer servers have uneven workloads
Changes in applications or instances deployed in different
availability zones.
Network hogging hosts in a cluster