You can create metric alerts for scenarios such as:
- Number of processes running on a host
- Root volume disk usage in a container
- Cpu / memory usage of a host or workload
Defining a Metric Alert
Set a unique name and description: Set a meaningful name and description that help recipients easily identify the alert
Specify multiple segments: Selecting a single segment might not always supply enough information to troubleshoot. Enrich the selected entity with related information by adding additional related segments. Enter hierarchical entities so you have the bottom-down picture of what went wrong and where. For example, specifying a Kubernetes cluster alone does not provide the context necessary to troubleshoot. In order to narrow down the issue, add further contextual information, such as namespace, deployment, and so on.
Select a metric that this alert will monitor. You can also define how data is aggregated, such as average, maximum, minimum, or sum.
Team scope is automatically applied to alerts. You can further filter the environment by overriding the scope.
For example, the below alert will fire when any host’s cpu usage will go above the defined threshold within the
us-east-1a. cloud availability zone.
You can also create alerts directly from Explore and Dashboards for automatically populating this scope.
Define the threshold and time window for assessing the alert condition. Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.
Metric alerts can be triggered to notify you of different aggregations:
The average of the retrieved metric values across the time period. Actual number of samples retrieved is used to calculate the value.
For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as on average, the alert will be calculated by summing the 3 recorded values and dividing by 3.
as a rate
The average value of the metric across the time period evaluated. The expected number of values is used to calculate the rate to trigger the alert.
For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as as a rate, the alert will be calculated by summing the 3 recorded values and dividing by 10 ( 10 x 1 minute samples).
The combined sum of the metric across the time period evaluated.
at least once
The trigger value is met for at least one sample in the evaluated period.
for the entire time
The trigger value is met for a every sample in the evaluated period.
as a rate of change
The trigger value is met the change in value over the evaluated period.
For example, the alert below will fire for each unique segment denominated by
kube_cluster_name using more than 75% of the filesystem on average, over the last 5 minutes.
Example: Alert When Data Transfer Over the Threshold
The below example shows an alert that triggers when the
average bytes of data transferred by a container is over 20 KiB/s for a period of 1 minute.
In the alert Settings, you can configure a link to a Runbook and to a Dashboard to speed up troubleshooting when the alert fires.
When viewing the triggered alert you will be able to quickly access your defined Runbook and Dashboard.