Downtime Alert

Sysdig Monitor continuously surveils different types of entities in your infrastructure, such as a host, a container, a process, and sends notifications when the monitored entity is not available or responding. Downtime alert focuses mainly on unscheduled downtime of programs, containers, and hosts in your infrastructure.


In this example, the downtime of the containers are monitored. When one of more containers in the given scope go down in the 1-minute time window, notifications will be sent with necessary information on both the containers and the agents.

The lines shown in the preview chart represent the values for the segments selected to monitor. The popup is a color-coded legend to show which segment (or combination of segments if there is more than one) the lines represent. You can also deselect some segment lines to prevent them from showing in the chart. Note that there is a limit of 10 lines that Sysdig Monitor ever shows in the preview chart. For downtime alerts, segments are actually what you select for the Alerts if any of option.

About Up Metrics

To monitor the downtime of the entities, the following up metrics are used: sysdig_host_up, sysdig_container_up, and sysdig_program_up. They indicate whether the agent is able to communicate with the collector. The value 1 represents the entity is up and agent is sending this information to the collector. The value 0 represents the entity is down, implies no communication from agent to the collector about the entity.

When an alert is configured based on Up metric, two data API queries are performed during the alert check. One query will retrieve the current values and the other will retrieve the values from the previous alert check interval. For any entity that was present in previous interval and is not present in current interval, the metric is marked as 0.

An aggregated value of the up metric is displayed on the dashboard on the Alert Editor, and therefore, you might see a value between 0 and 1.

Define a Downtime Alert

Guidelines

  • Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert.

  • Severity: Set a severity level for your alert. The Priority—High, Medium, Low, and Info—are reflected in the Alert list, where you can sort by the severity of the Alert. You can use severity as a criterion when creating alerts, for example: if there are more than 10 high severity events, notify.

  • Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as Kubernetes Namespace, Kubernetes Deployment, and so on.

Configure Condition

Scope

Filter the environment on which this alert will apply. For example, an alert will fire when a container associated with the agent 197288 goes down. The alert will be triggered for each container name and agent ID.

Use in or contain operators to match multiple different possible values to apply scope.

The contain and not contain operators help you retrieve values if you know part of the values. For example, us retrieves values that contain strings that start with “us”, such as “us-east-1b”, “us-west-2b”, and so on.

The in and not in operators help you filter multiple values.

You can also create alerts directly from Explore and Dashboards for automatically populating this scope.

Metric

Select an uptime metric associated with the entity whose downtime you want to monitor for. You can select one of the following entities: host, container, program.

Entity

Specify additional segments by using the Alert if any of option.

The specified entities are segmented on and notified with the default notification template as well as on the Preview. In this example, data is segmented on container name and agent ID. When a container is affected, the notification will not only include the affected container details but also the associated agent IDs.

Trigger

Define the threshold and time window for assessing the alert condition. Supported time scales are minute, hour, or day.

If the monitored program is not available or not responding for the last 1 minute, recipients will be notified.

You can set any value for % and a value greater than 1 for the time window. For example, If you choose 50% instead of 100%, a notification will be triggered when the entity is down for 5 minutes in the selected time window of 10 minutes.

Use Cases

  • Your e-commerce website is down during the peak hours of Black Friday, Christmas, or New Year season.

  • Production servers of your data center experience a critical outage

  • MySQL database is unreachable

  • File upload does not work on your marketing website.



Last modified August 9, 2022