- 1:
- 2:
- 2.1:
- 2.1.1:
- 2.1.2:
- 2.1.3:
- 2.1.4:
- 3:
- 4:
- 4.1:
- 4.2:
- 4.3:
- 4.4:
- 4.5:
- 4.6:
- 4.7:
- 4.8:
- 4.9:
- 5:
- 5.1:
- 5.2:
- 5.3:
- 5.4:
- 5.4.1:
- 5.4.2:
- 5.4.3:
- 5.4.4:
- 5.5:
- 5.5.1:
- 5.5.2:
- 5.5.2.1:
- 5.5.2.2:
- 5.5.2.3:
- 5.5.2.4:
- 5.5.2.5:
- 5.5.2.6:
- 5.5.3:
- 5.5.3.1:
- 5.5.3.2:
- 5.5.3.3:
- 5.5.3.4:
- 5.5.3.5:
- 5.6:
- 5.6.1:
- 5.6.2:
- 5.6.3:
- 5.6.4:
- 5.7:
- 6:
- 6.1:
- 6.2:
- 6.3:
- 6.4:
- 6.5:
- 6.6:
- 6.7:
- 6.8:
- 6.9:
- 7:
- 7.1:
- 7.2:
- 7.3:
- 7.4:
- 7.5:
- 7.6:
- 7.7:
- 8:
- 8.1:
- 8.2:
- 8.2.1:
- 8.2.2:
- 8.2.3:
- 8.2.4:
- 8.2.5:
- 8.2.6:
- 8.2.7:
- 8.2.8:
- 8.2.9:
- 8.2.10:
- 8.2.11:
- 8.2.12:
- 8.2.13:
- 8.2.14:
- 8.2.15:
- 8.2.16:
- 8.2.17:
- 8.2.18:
- 8.2.19:
- 8.2.20:
- 8.2.21:
- 8.2.22:
- 8.3:
- 8.3.1:
- 8.3.1.1:
- 8.3.1.2:
- 8.3.1.3:
- 8.3.2:
- 8.3.3:
- 8.3.4:
- 8.4:
- 8.4.1:
- 8.4.2:
- 8.4.3:
- 8.4.4:
- 8.5:
- 8.6:
- 8.6.1:
- 8.6.1.1:
- 8.6.1.2:
- 8.6.1.3:
- 8.6.1.4:
- 8.6.1.5:
- 8.6.1.6:
- 8.6.1.7:
- 8.6.2:
- 8.6.2.1:
- 8.6.2.2:
- 8.6.2.3:
- 8.6.2.4:
- 8.6.2.5:
- 8.6.2.6:
- 8.6.2.7:
- 8.6.2.8:
- 8.6.2.9:
- 8.6.2.10:
- 8.6.2.11:
- 8.6.2.12:
- 8.6.2.13:
- 8.6.2.14:
- 8.6.2.15:
- 8.6.2.16:
- 8.6.2.17:
- 8.6.2.18:
- 8.6.2.19:
- 8.6.2.20:
- 8.6.2.21:
- 8.6.2.22:
- 8.6.2.23:
- 8.6.2.24:
- 8.6.2.25:
- 8.6.2.26:
- 8.6.2.27:
- 8.6.3:
- 8.6.4:
- 9:
- 10:
- 10.1:
- 10.1.1:
- 10.1.2:
- 10.1.3:
- 10.1.4:
- 10.2:
- 10.3:
- 10.4:
- 10.5:
- 10.6:
- 10.7:
- 10.8:
- 10.9:
- 10.10:
- 10.11:
- 10.11.1:
- 10.11.2:
- 10.11.2.1:
- 10.11.2.2:
- 10.11.2.2.1:
- 10.11.2.2.2:
- 10.11.2.3:
- 10.11.2.3.1:
- 10.11.2.3.2:
- 10.11.2.4:
- 10.11.2.5:
- 10.11.2.6:
- 10.11.2.7:
- 10.11.2.8:
- 10.11.2.9:
- 10.11.2.10:
- 10.11.2.11:
- 10.11.2.12:
- 10.11.2.13:
- 10.11.2.14:
- 10.11.2.14.1:
- 10.11.2.14.2:
- 10.11.2.14.3:
- 10.11.2.15:
- 10.11.2.16:
- 10.11.2.17:
- 10.11.2.17.1:
- 10.11.2.17.2:
- 10.11.2.18:
- 10.11.2.19:
- 10.11.2.20:
- 10.11.2.21:
- 10.11.2.22:
- 10.11.2.23:
- 10.11.2.24:
- 10.11.2.25:
- 10.11.3:
- 10.11.4:
- 10.11.5:
- 10.11.5.1:
- 10.11.5.1.1:
- 10.11.5.1.2:
- 10.11.5.1.3:
- 10.11.5.1.4:
- 10.11.5.1.5:
- 10.11.5.1.6:
- 10.11.5.1.7:
- 10.11.5.1.8:
- 10.11.6:
- 10.11.7:
- 10.11.8:
- 10.11.9:
- 10.11.10:
- 10.11.11:
- 10.11.12:
- 10.11.13:
- 10.11.14:
- 10.11.15:
- 10.11.16:
Sysdig Monitor
Sysdig Monitor is part of Sysdig’s container intelligence platform.
Sysdig uses a unified platform to deliver security, monitoring, and
forensics in a container- and microservices-friendly architecture.
Sysdig Monitor is a monitoring, troubleshooting, and alerting suite
offering deep, process-level visibility into dynamic, distributed
production environments. Sysdig Monitor captures, correlates, and
visualizes full-stack data, and provides dashboards for monitoring.
In the background, the Sysdig agent lives on the hosts being monitored
and collects the appropriate metrics and events. Out of the box, the
agent reports on a wide variety of pre-defined metrics. Additional
metrics and custom parameters are available via agent configuration
files. For more information, see the Sysdig Agent
Documentation.
Major Benefits
Explore and monitor application performance at any level of the
infrastructure stack
Correlate metrics and events, and compare with past performance
Observe platform state and health
Auto-detect anomalies
Visualize and share performance metrics with out-of-the-box and
custom dashboards
Powerful, tuned, and flexible alerts
Proactively alert on incidents across services, hosts, containers
and so on
Trigger system captures for offline troubleshooting and forensics
Analyze system call activity to accelerate problem resolution
Key Components
Log into the Sysdig Monitor interface, and get started with the basics.
Operate and troubleshoot Kubernetes infrastructure easily with a curated and unified view of metrics, alerts, and events.
Dive into Sysdig Monitor with a deeper understanding of the Explore
module, data aggregation, and how to break down data.
The backbone of monitoring: learn more about metrics, integrate external
platforms, and explore the complete metrics dictionary.
Learn how to build alerts to notify users of infrastructure events,
changes in behavior, and unauthorized access.
Learn how to build a custom dashboard, configure the default ones, or
reconfigure panels to best suit your infrastructure.
Integrate with various inbound and outbound data sources ranging from a number of platforms and orchestrators to a wide range of applications.
Integrate Docker and Kubernetes events, customize event notifications,
and review infrastructure history.
Create capture files containing system calls and other OS events to
assist monitoring and troubleshooting the infrastructure.
1 -
Getting Started with Sysdig Monitor
Sysdig Monitor allows you to maximize the visibility of your Kubernetes
environments with native Prometheus support. You can troubleshoot issues
faster with Sysdig’s eBPF derived metrics, out-of-the-box dashboards,
and alerts.
You can choose Sysdig Monitor for a Free
Trial option to quickly connect
to a single cloud account with Sysdig and start with
Prometheus-compatible Kubernetes and cloud monitoring.
Once connected, the Get Started page shows a subset of the options
available in the 30-day trial or Enterprise.
Get Started Page
The Get Started page targets the key steps to ensure users are
getting the most value out of Sysdig Monitor. The page is updated with
new steps as users complete tasks and Sysdig adds new features to the
product.
The Get Started page also serves as a linking page for
Documentation
Release Notes
The Sysdig Blog
Self-Paced Training
Support
Users can access the Get Started page at any time by clicking the
rocketship in the side menu.

Install the Agent
Installing the agent on your infrastructure allows Sysdig to collect
data for monitoring and security purposes. For more information, see
Quick Install Sysdig Agent on
Kubernetes.
(Optional) Connect Your Prometheus Servers
Connecting your Prometheus servers to Sysdig-managed Prometheus Service
helps leverage Sysdig for scalable long-term storage of your Prometheus
metrics, PromQL dashboards, centralized querying, and PromQL-based
alerting. For more information, see Collect Prometheus
Metrics.
Invite Your Team
Invite someone in your team to use this Sysdig Monitor account. They
will be notified with an email. A user will be created for them and will
be added to the default team. They are automatically assigned to the
Advanced User role.
Monitor Your Kubernetes Clusters
Get a unified view of the health, risk, and capacity of your Kubernetes
infrastructure in a multi- and hybrid-cloud environment. For more
information, see Dashboard
Templates.
Get deep insight into your Kubernetes workloads faster with the
Workload Status & Performance Dashboard.
Drill down to workload pods and monitor pod-level resource usage and
troubleshoot performance issues with the Pod Status & Performance
Dashboard.
Cluster Capacity Planning
Verify if your cluster is sized properly for existing deployed
applications, identify over-commit on resources that can lead to pod
evictions, discover unused requested resources or containers without
limits defined with the Cluster Capacity Planning Dashboard.
Cluster/Namespace Available Resources
Determine if your cluster has the capacity to deploy a new workload and
ascertain if increasing CPU or memory requests or placing limits on an
existing application is necessary with the Cluster/Namespace Available
Resources Dashboard.
Pod Rightsizing & Workload Capacity Optimization
Identify resource-hogging workloads while optimizing your capacity with
the Pod Rightsizing & Workload Capacity Optimization Dashboard.
Set Up Alert
Sysdig Monitor emits alerts to get proactive notification of events,
anomalies, or any incident that requires attention. The alerting system
provides out-of-the-box push gateways for regular email, Slack,
Cloud-provider notification queues, and custom webhooks, among others.
See
.
Alerts are used in Sysdig Monitor when Event thresholds have been
crossed and can be sent over a variety of supported notification
channels. Integrate Sysdig with your notification dispatchers and
incident management workflows. See Set Up Notification
Channels
Turn on Alerts
Turn on recommended alerts from our Alerts Library. Customize our
recommendations or create your own alerts from scratch. See Alerts
Library.
Monitor Your Services
Create a Dashboard
Create customized dashboards to display the most relevant views and
metrics for the infrastructure in a single location. Each dashboard is
comprised of a series of panels configured to display specific data in a
number of different formats. See
Dashboards.
Get Started with PromQL
Write PromQL queries
easier with form-based querying available with Sysdig Monitor. All
metrics are enriched with cloud and Kubernetes metadata avoiding
complicated PromQL joins. See Using
PromQL.
Monitoring Integrations
Sysdig discovers services running in infrastructure and recommends
appropriate Monitoring Integrations that allow you to collect
service-specific metrics. The integration bundle includes out-of-the-box
dashboards and default alerts. See Configure
Monitoring Integrations.
Advanced Actions
Integrate development tools:
2 -
Advisor
Advisor brings your metrics, alerts, and events into a focused and curated view to help you operate and troubleshoot Kubernetes infrastructure. To help you solve problems faster, over time, Advisor will surface your infrastructure issues that you should pay attention to.
Advisor is available to only our SaaS users. The feature is not currently available for on-prem environments.
Advisor presents your infrastructure grouped by cluster, namespace, workload, and pod. You cannot currently configure a custom grouping. Depending on the selection, you will see different curated views and you can switch between the following:
- Triggered alerts
- Events from Kubernetes, container engines, and custom events sent via the Monitor APIs
- Cluster usage and capacity
- Key golden signals (requests, latency, errors) derived from system calls
- Kubernetes metrics about the health and status of Kubernetes objects
- Container live logs
- Process and network telemetry (CPU, memory, network connections, etc.)
- Monitoring Integrations
The time window of metrics displayed on Advisor is the last 1 hour of collected data. To see historical values for a metric, drill down to a related dashboard or explore a metric using the Explore UI.
Live logs
Advisor can display live logs for a container, which is the equivalent of running kubectl logs
. This is useful for troubleshooting application errors or problems such as pods in a CrashLoopBackOff state.
When selecting a Pod, a Logs tab will appear. If there are multiple containers within a pod, you can select the container you wish to view logs for. Once requested, logs are streamed for 3 minutes before the session is automatically closed (you can simply re-start streaming if necessary).
Live logs are tailed on-demand and thus not persisted. After a session is closed they are no longer accessible.
Manage Access to Live Logs
By default live logs is available for users within the scope of their Sysdig Team. Use Custom Roles to manage live logs permissions.
Live logs are enabled by default in agent 12.7.0 or newer. Agent 12.6.0 supports live logs but must be manually enabled by setting enabled: true
. Older versions of the Sysdig Agent do not support live logs.
Live logs can be enabled or disabled within the agent configuration.
To turn live logs off globally for a cluster, add the following in the dragent.yaml
file:
live_logs:
enabled: false
If using Helm, this is configured via sysdig.settings
. For example:
sysdig:
# Advanced settings. Any option in here will be directly translated into dragent.yaml in the Configmap
settings:
live_logs:
enabled: false
2.1 -
Overview
Overview leverages Sysdig’s unified data platform to monitor, secure,
and troubleshoot your hosts and Kubernetes clusters and workloads.
The module provides a unified view of the health, risk, and capacity of
your Kubernetes infrastructure— a single pane of glass for host machines
as well as Kubernetes Clusters, Nodes, Namespaces, and Workloads across
a multi- and hybrid-cloud environment. You can easily filter by any of
these entities and view associated events and health data.
Overview shows metrics prioritized by event count and severity, allowing
you to get to the root cause of the problem faster. Sysdig Monitor polls
the infrastructure data every 10 minutes and refreshes the metrics and
events on the Overview page with the system health.
Key Benefits
Overview provides the following benefits:
Show a unified view of the health, risk, resource use, and capacity
of your infrastructure environment at scale
Render metrics, security events, compliance CIS benchmark
results, and contextual events in a single location
Eliminate the need for stand-alone security, monitoring, and
forensics tools
View data on-the-fly by workload or by infrastructure
Display contextual live event stream from alerts, Kubernetes,
containers, policies, and image scanning results
Surface entities intelligently based on event count and severity
Drills down from Clusters to Nodes and Namespaces
Support Infrastructure monitoring of multi- and hybrid- cloud
environments
Expose relevant information based on core operational users :
DevOps / Platform Ops
Security Analyst
Service Owner
Accessing the Overview User Interface
You can access and set the scope of Overview in the Sysdig Monitor UI or
with the URL:
Click Overview
in the left navigation, then select one of the
Kubernetes entities:
About the Overview User Interface
The Overview interface opens to the Clusters Overview page. This section describes the major components of the interface and the navigation options.

Though the default landing page is Clusters Overview, when you have no Kubernetes clusters configured, the Overview tab opens to the Hosts view. In addition, when you reopen the Overview menu, the default view will be your last visited Overview page as it retains the visit history.
Overview Rows
Each row represents a Kubernetes entity: a cluster, node, namespace, or
workload. In the screenshot above, each row shows a Kubernetes cluster.
Navigating rows is easy
Click on the Overview icon in the left navigation and choose an
Overview page, or drill down into the next Overview page to explore
the next granular level of data. Each Overview page shows 10 rows by
default and a maximum of 100 rows. Click Load More
to display
additional rows if there are more than 10 rows per page.
Ability to select a specific row in an Overview page
Each row contains the scope of the relevant entity that it is
showing data for. Clicking a specific row leads to deselecting the
rest of the rows (for instance, selecting staging deselects all
other rows in the screenshot above) to focus on the scope of the
selected entity, including the events which are scoped out by that
row. Pausing to focus on a single row provides a snapshot of what is
going on until at the moment with the entity under purview.
Entities are ranked according to the severity and the number of events detected in them
Rows are sorted by the count and severity level of the events
associated with the entity and are displayed in descending order.
The items with the highest number of high severity events are shown
first, followed by medium, low, and info. This organization helps to
highlight events demanding immediate attention and to streamline
troubleshooting efforts, in environments that may include thousands
of entities.
Scope Editor
Scope Editor allows targeting down to a specific entity, such as a
particular workload or namespace, from environments that may include
thousands of entities. The levels of scope, determined by Kubernetes
hierarchy, progresses from Workload to Cluster where Cluster being at
the top level. In smaller environments, using the Scope Editor is
equivalent to clicking a single row in an Overview page where no scope
has been applied.
Cluster: The highest level
in the hierarchy. The only scope applied to the page is Cluster. It
allows you to select a specific cluster from a list of available ones.
Node: The second level in
the hierarchy. The scope is determined by Cluster and Node. Selection is
narrowed down to a specific node in a selected cluster.
Namespace: The third level
in the hierarchy. The scope is determined by Cluster and Namespace.
Selection is narrowed down to a specific namespace in a selected
cluster.
Workloads: The last entity
in the hierarchy. The scope is initially determined by Cluster and
Namespace, then the selection is narrowed to a specific Deployment,
Service, or StatefulSet. Choosing all three options are not allowed.
Time Navigation
The Overview feature is based around time. Sysdig Monitor polls the infrastructure data every 1 minute and refreshes the metrics and events on the Overview page with the system health. The time range is fixed at 12 hours. However, the gauge and compliance score widgets display the latest data sample, not an aggregation over the entire 12-hour time range.
The Overview feed is always live and cannot be paused.
Unified Stream of Events
The right panel of Overview provides a context-sensitive events
feed.
Click an overview row to see relevant Events on the right. Each event is
intelligently populated with end-to-end metadata to give context and
enable troubleshooting.
Event Types
Overview renders the following event types:
Alert: See Alerts.
Custom: Ensure that Custom labels are enabled to view this type of
events.
Containers: Events associated with containers.
Kubernetes: Events associated with Kubernetes infrastructure.
Scanning: See Image
Scanning.
Policy: See Policies.

Event Statuses
Overview renders the following alert-generated event statuses:
Triggered: The alert condition has been met and still persists.
Resolved: A previously existed alert condition no longer
persists.
Acknowledged: The event has been acknowledged by the intended
recipient.
Un-acknowledged: The event has not been acknowledged by an
intended recipient. All events are by default marked as
Un-acknowledged.
Silenced: The alert event has been silenced for a specified
scope. No alert notification will be sent out to the channels during
the silenced window.
General Guidelines
First-Time Usage
If the environment is created for the first time, Sysdig Monitor
fetches data and generates associated pages. The Overview feature is
immediately enabled. However, wait for, at the maximum, 1 hour to
see the Overview pages with the necessary data.
Overview uses time windows in segments of 1H, 6H and 1D, and
therefore wait respectively for 1H, 6H and 1D to be able to see data
on the Overview pages.
If enough data is not available for the first 1 hour, the “No Data
Available” page will be presented until the first 1 hour passes.
Tuning Overview Data
Sysdig Monitor leverages a caching mechanism to fetch pre-computed data
for the Overview screens.
If pre-computed data is unavailable, data fetched will be non-computed
data, which must be calculated before displaying. This additional
computational time adds delays. Caching is enabled for Overview but for
optimum performance, you must wait for 1H, 6H, and 1D windows the first
time you use Overview. After the specified time has passed, the data
will be automatically be cached with every passing minute.
Enabling Overview for On-Prem Deployments
The Overview feature is not available by default on On-Prem deployments.
Use the following API to enable it:
Get the Beta settings as follows:
curl -X GET 'https://<Sysdig URL>/api/on-prem/settings/overviews' \
-H 'Authorization: Bearer <GLOBAL_SUPER_ADMIN_SDC_TOKEN>' \
-H 'X-Sysdig-Product: SDC' -k
Replace <Sysdig URL> with the Sysdig URL associated with
your deployment and <GLOBAL_SUPER_ADMIN_SDC_TOKEN> with
the SDC token associated with your deployment.
Copy the payload and change the desired values in the settings.
Update the settings as follows:
curl X PUT 'https://<Sysdig URL>/api/on-prem/settings/overview' \
-H 'Authorization: Bearer <GLOBAL_SUPER_ADMIN_SDC_TOKEN>' \
-H 'X-Sysdig-Product: SDC' \
-d '{ "overviews": true, "eventScopeExpansion": true}'
Feature Flags
2.1.1 -
Clusters Data
This topic discusses the Clusters Overview page and helps you understand
its gauge charts and the data displayed on them.
About Clusters Overview
In Kubernetes, a pool of nodes combine together their resources to form
a more powerful machine, that is a Cluster. The Cluster Overview page
provides key metrics indicating the health, risk, capacity, and
compliance of each cluster. Your cluster can reside in any cloud or
multi-cloud environment of your choice.

Each row in the Clusters page represents a cluster. Clusters are sorted
by the severity of corresponding events in order to highlight the area
that needs attention. For example, a cluster with high severity events
is bubbled up to the top of the page to highlight the issue. You can
further drill down to the Nodes or Namespaces Overview page for
investigating at each level.
In environments where no Sysdig Secure is enabled, Network I/O is shown
instead of the Compliance score.
Interpret the Cluster Data
This topic gives insight into the metrics displayed on the Clusters
Overview screen.
Node Ready Status
The chart shows the latest value returned by
avg(min(kubernetes.node.ready))
.
What Is It?
The number shows the readiness for nodes to accept pods across the
entire cluster. The numeric availability indicates the percentage of
time the nodes are reported as ready by
Kubernetes.
For example:
100% is displayed when 10 out of 10 nodes are ready for the entire
time window, say, for the last one hour.
95% is displayed when 9 out of 10 nodes are ready for the entire
time window and one node is ready only for 50% of the time.
The bar chart displays the trend across the selected time window, and
each bar represents a time slice. For example, selecting the last 1-hour
window displays 6 bars, each indicating a 10-minute time slice. Each bar
represents the availability across the time slice (green) or the
unavailability (red).
For instance, the following image shows an average availability of 80%
across the last 1-hour, and each 10-minute time slice shows a constant
availability for the same time window:

What to Expect?
Expect a constant 100% at all times.
What to Do Otherwise?
If the value is less than 100%, determine whether a node is not
available at all, or one or more nodes are partially available.
Drill down either to the Nodes screen in Overview or to the
“Kubernetes Cluster Overview” in Explore to see the list of
nodes and their availability.
Check the Kubernetes Node Overview dashboard in Explore to
identify the problem that Kubernetes reports.
Pods Available vs Desired
The chart shows the latest value returned by
sum(avg(kubernetes.namespace.pod.available.count)) / sum(avg(kubernetes.namespace.pod.desired.count))
.
What Is It?
The chart displays the ratio between available and desired pods,
averaged across the selected time window, for all the pods in a given
Cluster. The upper bound shows the number of desired pods in the
Cluster.
For instance, the following image shows 42 desired pods are available to
use:

What to Expect?
You should typically expect 100%.
If certain pods take a long time to be available you might temporarily
see a value that is less than 100%. Pulling images, pod initialization,
readiness probe, and so on causes such delays.
What to Do Otherwise?
Identify one or more Namespaces that have lower availability. To do so,
drill down to the Namespaces screen, then drill down to the
Workloads screen to identify the unavailable pods.
If the number of unavailable pods is considerably higher (the ratio is
significantly low), check the status of the Nodes. A Node failure will
cause several pods to become unavailable across most of the Namespaces.
Several factors could cause the pods to stuck in the Pending state:
Pods make requests for resources that exceed what’s available across
the nodes (the remaining allocatable pods).
Pods make requests higher than the availability of every single
node. For example, you have 8-core Nodes and you create a pod with a
16-core request. These pods might require reconfiguration and
specific setup related to Node affinity and anti-affinity
constraints.
Namespace quota is reached before making a high resource request.
If a quota is enforced at the Namespace level, you may hit the limit
independent of the resource availability across the Nodes.
CPU Requests vs Allocatable
The chart shows the latest value returned by
sum(avg(kubernetes.pod.resourceRequests.cpuCores)) / sum(avg(kubernetes.node.allocatable.cpuCores))
.
What Is It?
The chart displays the ratio between CPU requests configured for all the
pods in a selected Cluster and allocatable CPUs across all the nodes.
The upper bound shows the number of allocatable CPU cores across all the
nodes in the Cluster.
For instance, the image below shows that out of 620 available CPU cores
across all the nodes (allocatable CPUs), 71% is requested by the pods:

What to Expect?
Your resource utilization strategy determines what ratio you can expect.
A healthy ratio falls between 50% and 80%.
Assuming all the nodes have the same amount of allocatable resources, a
reasonable upper bound is the value of
(node_count - 1) / node_count x 100
. For example, the ratio will be
90% if you have 9 nodes. Having this percentage protects you against a
node becoming unavailable.
What to Do Otherwise?
A lower ratio indicates under-utilized resources (and corresponding
cost) in your infrastructure. A higher ratio indicates insufficient
resources. As a result
To triage, do the following:
Drill down to the Nodes screen to get insights into how
resources are utilized across all nodes.
Drill down to the Namespaces screen to understand how resources
are requested across Namespaces.
Drill down to Explore and refer to the following dashboards:
Kubernetes CPU Allocation Optimization: Evaluate whether a
significant amount of resources are under-utilized in the
infrastructure.
Kubernetes Workloads CPU Usage and Allocation: Determine
whether pods are properly configured and are using resources as
expected.
Can the Value Be Higher than 100%?
Currently, the ratio accounts only for scheduled pods, while pending
pods are excluded from the calculation. This means pods have been
scheduled to run on Nodes out of the allocatable pods. Consequently, the
ratio cannot be higher than 100%.
In the case of over-commitment (pods requesting for more resources than
what’s available), you can expect a higher Requests vs Allocatable
ratio and a lower Pods Available vs Desired ratio. What it indicates
is that most of the available resources are being used, and what’s left
is not enough to schedule additional pods. Therefore, the Available vs
Desired ratio for pods will decrease.
When your environment has pods that are updated often or that are
deleted and created often (for example, testing Clusters), the total
requests might appear higher than what it is at any given time.
Consequently, the ratio becomes higher across the selected time window,
and you might see a value that is higher than 100%. This error is
rendered due to how the data engine calculates the aggregated ratio.
Drill down to Kubernetes Cluster Overview to see the CPU Cores
Usage vs Requests vs Allocatable time series to correctly evaluate the
trend of the request commitments.
Listed below are some of the factors that could cause the pods to stuck
in a Pending state:
Pods make requests that exceed what’s available across the nodes
(the remaining allocatable pods). The Requests vs Allocatable
ratio is an indicator of this issue.
Pods make requests that are higher than the availability of every
single Node. For example, you have 8-core Nodes and you create a pod
with a 16-core request. These pods might require reconfiguration and
specific setup related to Node affinity and anti-affinity
constraints.
The Quota set at the Namespace level is reached before a request is
configured. The Requests vs Allocatable ratio may not suggest
the problem, but the Pods Available vs Desired ratio would
decrease, especially for the specific Namespaces. See the
Namespaces screen in Overview.
Memory Requests vs Allocatable
The chart shows the latest value returned by
sum(avg(kubernetes.pod.resourceRequests.memBytes)) / sum(avg(kubernetes.node.allocatable.memBytes))
.
What Is It?
The chart displays the ratio between memory requests configured for all
the pods in the Cluster and allocatable memory available across all the
Nodes.
The upper bound shows the allocatable memory available across all Nodes.
The value is expressed in bytes, displayed in a specified unit.
For instance, the image below shows that out of 29.7 GiB available
across all Nodes (allocatable memory), 35% is requested by the pods:

What to Expect?
Your resource utilization strategy determines what ratio you can expect.
A healthy ratio falls between 50% and 80%.
Assuming all the nodes have the same amount of allocatable resources, a
reasonable upper bound is the value of
(node_count - 1) / node_count x 100
. For example, 90% if you have 9
nodes. This ratio protects your system against a node becoming
unavailable.
What to do Otherwise
A lower ratio indicates under-utilized resources (and corresponding
cost) in your infrastructure. A higher ratio indicates insufficient
resources. As a result
To troubleshoot, do the following:
Drill down to the Nodes screen to get insights into how
resources are utilized across all the Nodes.
Drill down to the Namespaces screen to understand how resources
are requested across Namespaces.
Drill down to Explore and refer to the following dashboards:
Kubernetes Memory Allocation Optimization: Evaluate whether
a significant amount of resources are under-utilized in the
infrastructure.
Kubernetes Workloads Memory Usage and Allocation: Determine
whether pods are properly configured and are using resources as
expected.
Can the Value be Higher than 100%?
The ratio currently accounts only for scheduled pods, while pending pods
are excluded from the calculation. What this implies is that pods have
been scheduled to run on Nodes out of the allocatable resources
available. Consequently, the ratio cannot be higher than 100%.
In the case of over-commitment (pods requesting for more resources than
what’s available), expect a higher Requests vs Allocatable ratio and
a lower Pods Available vs Desired ratio. What it indicates is that
most of the available resources have been used and what’s left is not
enough to schedule additional pods. Therefore, the Pods Available vs
Desired ratio will decrease.
When your environment has pods that are updated often or that are
deleted and created often (for example, testing Clusters), the total
requests might appear higher than what it is at any given time.
Consequently, the ratio becomes higher across the selected time window,
and you might see a value that is higher than 100%. This error is
rendered due to how the data engine calculates the aggregated ratio.
Drill down to Kubernetes Cluster Overview to see the Memory
Requests vs Allocatable time series to correctly evaluate the trend
for the request commitments.
Listed are some of the factors that could cause your pods to stuck in a
Pending state:
Pods make requests that exceed what’s available across the nodes
(the remaining allocatable pods). The Requests vs Allocatable
ratio is an indicator of this issue.
Pods make requests that are higher than the availability of every
single Node. For example, you have 8-core nodes and you create a pod
with a 16-core request. These pods might require configuration
changes and specific setup related to node affinity and
anti-affinity factors.
The Quota set at the Namespace-level is reached before a high
request is configured. The Requests vs Allocatable ratio might
not suggest the problem, but the Pods Available vs Desired ratio
would decrease, especially for the specific Namespaces. See the
Namespaces screen in Overview.
Compliance Score
Docker: The latest value returned by
avg(avg(compliance.k8s-bench.pass_pct))
.
Kubernetes: The latest value returned by
avg(avg(compliance.docker-bench.pass_pct))
.
What Is it?
The numbers show the percentage of benchmarks that succeeded in the
selected time window, respectively for Docker and Kubernetes entities.
What to Expect
If you do not have Sysdig Secure enabled, or you do not have benchmarks
scheduled, then you should expect no data available.
Otherwise, the higher the score, the more compliant your infrastructure
is.
What to Do Otherwise?
If the score is lower than expected, drill down to Docker Compliance
Report or Kubernetes Compliance Report to see further details
about benchmark checks and their results.
You may also want to use the Benchmarks / Results page in Sysdig
Secure to see the history of
checks.
2.1.2 -
Nodes Data
This topic discusses the Nodes Overview page and helps you understand
its gauge charts and the data displayed on them.
About Nodes Overview
A node refers to a worker machine in Kubernetes. A physical machine or
VM can represent a node. The Nodes Overview page provides key metrics
indicating the health, capacity, and compliance of each node in your
cluster.

In environments where no Sysdig Secure is enabled, Network I/O is shown
instead of the Compliance score.
Interpret the Nodes Data
This topic gives insight into the metrics displayed on the Nodes
Overview page.
Node Ready Status
The chart shows the latest value returned by
avg(min(kubernetes.node.ready))
.
What Is It?
The number expresses the Node readiness to accept pods across the
Cluster. The numeric availability indicates the percentage of time the
Node is reported ready by
Kubernetes.
For example:
100% is displayed when a Node is ready for the entire time window,
say, for the last one hour.
95% when the Node is ready for 95% of the time window, say, 57 out
of 60 minutes.
The bar chart displays the trend across the selected time window, and
each bar represents a time slice. For example, selecting “last 1 hour”
displays 6 bars, each indicating a 10-minute time slice. Each bar shows
the availability across the time slice (green) and the unavailability
(red).
For instance, the image below indicates the Node has not been ready for
the entire last 1-hour time window:

What to Expect?
The chart should show a constant 100% at all times.
What to Do Otherwise?
If the number is less than 100%, review the status reported by
Kubernetes. Drill-down to the Kubernetes Node Overview Dashboard in
Explore to see details about the Node readiness:

If the Node Ready Status has an alternating behavior, as shown in
the image, the node is flapping. Flapping indicates that the kubelet is
not healthy. See specific conditions reported by Kubernetes that would
help determine the causes for the Node not being ready. Such conditions
include network issues and memory pressure.
Pods Ready vs Allocatable
The chart reports the latest value of
sum(avg(kubernetes.pod.status.ready)) / avg(avg(kubernetes.node.allocatable.pods))
.
What Is It?
It is the ratio between available and allocatable pods configured on the
node, averaged across the selected time window.
The Clusters page includes a similar chart named Pods Available vs
Desired. However, the meaning is different:
The Pods Available vs Desired chart for Clusters highlights how
many pods you expect and how many are actually available. See
IsPodAvailable
for a detailed definition.
The Pods Ready vs Allocatable chart for Nodes indicates how many
pods can be scheduled on each Node and how many are actually ready.
The upper bound shows the number of pods you can allocate in the node.
See node
configuration.
For instance, the image below indicates that you can allocate 110 pods
in the Node (default configuration), but only 11 pods are ready:

What to Expect?
The ratio does not relate to resource utilization, but it measures the
pod density on each node. The more pods you have on a single node, the
more effort the kubelet has to put in order to manage the pods, the
routing mechanism, and Kubernetes overall.
Given the allocatable is properly set, values lower than 80% indicate a
healthy status.
What to Do Otherwise?
Reviewing the default maximum pods configuration of the kubelet to
allow more pods, especially if the CPU and memory utilization is
healthy.
Adding more nodes to allow for more pods to be scheduled.
Reviewing kubelet process performance and Node resource utilization
in general. A higher ratio indicates high pressure on the operating
system and for Kubernetes itself.
CPU Requests vs Allocatable
The chart shows the latest value returned by
sum(avg(kubernetes.pod.resourceRequests.cpuCores)) / sum(avg(kubernetes.node.allocatable.cpuCores))
.
What Is It?
The chart shows the ratio between the number of CPU cores requested by
the pods scheduled on the Node and the number of cores available to
pods. The upper bound shows the CPU cores available to pods, which
corresponds to the user-defined configuration for allocatable
CPU.
For instance, the image below shows that the Node has 16 CPU cores
available, out of which, 84% are requested by the pods scheduled on the
Node:

What to Expect?
Expect a value up to 80%.
Assuming all the nodes have the same amount of allocatable resources, a
reasonable upper bound is the value of
(node_count - 1) / node_count x 100
. For example, 90% if you have 9
nodes. Having a high ratio protects your system against a Node becoming
unavailable.
What to Do Otherwise?
A low ratio indicates the Node is underutilized. Drill up to the
corresponding cluster in the Clusters page to determine whether
the number of pods currently running is lower, or if the pods cannot
run for other reasons.
A high ratio indicates a potential risk of being unable to schedule
additional pods on the Node.
Drill down to the Kubernetes Node Overview Dashboard to
evaluate what Namespaces, Workloads, and pods are running.
Additionally, drill up in the Clusters page to evaluate whether
you are over-committing the CPU resource. You might not have enough
resources to fulfill requests, and consequently, pods might not be
able to run on the Node. Consider adding Nodes or replacing Nodes
with additional CPU cores.
Can the Value Be Higher than 100%?
Kubernetes schedules pods on Nodes where sufficient allocatable
resources are available to fulfill the pod request. This means
Kubernetes does not allow having a total request higher than the
allocatable. Consequently, the ratio cannot be higher than 100%.
Over-committing (pods requesting resources higher than the capacity)
results in a high Requests vs Allocatable ratio and a low Pods
Available vs Desired ratio at the Cluster level. What it indicates is
that most of the available resources are being used, consequently,
what’s available is not sufficient to schedule additional pods.
Therefore, Pods Available vs Desired ratio will also decrease.
Memory Requests vs Allocatable
The chart highlights the latest value returned by
sum(avg(kubernetes.pod.resourceRequests.memBytes)) / sum(avg(kubernetes.node.allocatable.memBytes))
.
What Is It?
The ratio between the number of bytes of memory is requested by the pods
scheduled on the node and the number of bytes of memory available.The
upper bound shows the memory available to pods, which corresponds to the
user-defined allocatable memory
configuration.
For instance, the image below indicates the node has 62.8 GiB of memory
available, out of which, 37% is requested by the pods scheduled on the
Node:

What to Expect?
A healthy ratio falls under 80%.
Assuming all the nodes have the same amount of allocatable resources, a
reasonable upper bound is the value of
(node_count - 1) / node_count x 100
. For example, the ratio is 90% if
you have 9 nodes. Having a high ratio protects your system against a
node becoming unavailable.
What to Do Otherwise?
A low ratio indicates that the Node is underutilized. Drill up to
the corresponding cluster in the Clusters page to determine
whether the number of pods running is low, or if pods cannot run for
other reasons.
A high ratio indicates a potential risk of being unable to schedule
additional pods on the node.
Drill down to the Kubernetes Node Overview dashboard to
evaluate what Namespaces, Workloads, and pods are running.
Additionally, drill up in the Clusters page to evaluate
whether you are over-committing the memory resource.
Consequently, you don’t have enough resources to fulfill
requests, and pods might not be able to run. Consider adding
nodes or replacing nodes with more memory.
Can the Value be Higher than 100%?
Kubernetes schedules pods on nodes where sufficient allocatable
resources are available to fulfill the pod request. This means
Kubernetes does not allow having a total request higher than the
allocatable. Consequently, the ratio cannot be higher than 100%.
Over-committing (pods requesting for more resources than that are
available) results in a high Requests vs Allocatable ratio at the
Nodes level and a low Pods Available vs Desired ratio at the Cluster
level. What it indicates is that most of the resources are being used,
consequently, what’s available is not sufficient to schedule additional
pods. Therefore, Pods Available vs Desired ratio will also decrease.
Network I/O
The chart shows the latest value returned by
avg(avg(net.bytes.total))
.
What Is It?
The sparkline shows the trend of network traffic (inbound and outbound)
for a Node. The number indicates the most recent rate of restarts per
second.

For reference, the sparklines show the following number of steps
(sampling):
Last hour: 6 steps, each for a 10-minute time slice
Last 6 hours: 12 steps, each for a 20-minute time slice
Last day: 12 steps, each for a 2-hour time slice
What to Expect?
The metric highly depends on what type of applications run on the Node.
You should expect some network activity for Kubernetes related
operations.
Drilling down to the Kubernetes Node Overview Dashboard in
Explore will provide additional details, such as network activity
across pods.
2.1.3 -
Namespaces Data
This topic discusses the Namespaces Overview page and helps you
understand its gauge charts and the data displayed on them.
About Namespaces Overview
Namespaces
are virtual clusters on a physical cluster. They provide logical
separation between the teams and their environments. The Namespaces
Overview page provides key metrics indicating the health, capacity, and
performance of each Namespace in your cluster.

Interpret the Namespaces Data
This topic gives insight into the metrics displayed on the Namespaces
Overview screen.
Pod Restarts
The chart highlights the latest value returned by
avg(timeAvg(kubernetes.pod.restart.rate))
.
What Is It?
The sparkline shows the trend of pod restarts rate across all the pods
in a selected Namespace. The number shows the most recent rate of
restarts per second.

For instance, the image shows a rate of 0.04 restarts per second for the
last 2-hours, given the selected time window is one day. The trend also
suggests a non-flat pattern (periodic crashes).
Last hour: 6 steps, each for a 10-minute time slice
Last 6 hours: 12 steps, each for a 20-minute time slice
Last day: 12 steps, each for a 2-hour time slice
What to Expect?
Expect 0 restarts for any pod.
What to Do Otherwise?
A few restarts across the last one hour or larger time windows might not
indicate a serious problem. In the event restart loop, identify the root
cause as follows:
Drill down to the Workloads page in Overview to identify the
Workloads that have been stuck at a restart loop.
Drill down to the Kubernetes Namespace Overview to see a
detailed trend broken down by pods:

Pods Available vs Desired
The chart shows the latest value returned by
sum(avg(kubernetes.namespace.pod.available.count)) / sum(avg(kubernetes.namespace.pod.desired.count))
.
What Is It?
The chart displays the ratio between available and desired pods,
averaged across the selected time window, in a given Namespace.
The upper bound shows the number of desired pods in the namespace.
For instance, the image below shows 42 desired pods that are available:

What to Expect?
Expect 100% on the chart.
If certain pods take a significant amount of time to become available
due to delays (image pull time, pod initialization, readiness probe) you
might temporarily see a ratio lower than 100%.
What to Do Otherwise?
Identify one or more Workloads that have low availability by
drilling down to the Workloads page.
Once you identify the Workload, drill down to the related dashboard
in Explore. For example, Kubernetes Deployment Overview to
determine the trend and the state of the pods.
For instance, in the following image, the ratio is 98% (3.93 / 4 x
100). The decline is due to an update that caused pods to be
terminated and consequently to be started with a newer version.

CPU Used vs Requests
The chart shows the latest value returned by
sum(avg(cpu.cores.used)) / sum(avg(kubernetes.pod.resourceRequests.cpuCores))
.
What Is It?
The chart shows the ratio between the total CPU usage across all the
pods in the Namespace and the total CPU requested by all the pods.
The upper bound shows the total CPU requested by all the pods. The value
is expressed as the number of CPU cores.
For instance, the image below shows the pods in a Namespace requests for
40 CPU cores, of which only 43% is being used (about 17 cores):

What to Expect?
The value you see depends on the type of Workloads running in the
Namespace.
Typically, values that fall between 80% and 120% is considered healthy.
Values higher than 100% is considered healthy relatively for a short
amount of time.
For applications whose resource usage is constant (such as background
processes), expect the ratio to be close to 100%.
For “bursty” applications, such as an API server, expect the ratio to be
less than 100%. Note that this value is averaged for the selected time
window, therefore, a usage spike would be compensated by an idle period.
What to Do Otherwise?
A low usage indicates that the application is not properly running (not
executing the expected functions) or the Workload configuration is not
accurate (requests are too high compared to what the pods actually
need).
A high usage indicates that the application is operating with a heavy
load or the workload configuration is not accurate (requests are too low
compared to what pods actually need).
In either case, drill down to the Workloads page to determine the
workload that requires a deeper analysis.
Can the Value Be Higher than 100%?
Yes, it can.
You can configure requests without limits, or requests lower than
the limits. In either case, you are allowing the containers to use
more resources than requested, typically to handle temporary
overloads.
Consider a Namespace with two Workloads with one pod each. Say, one
Workload is configured to request for 1 CPU core and uses 1 CPU core
(ratio of Used vs Request is 100%). The other Workload is
configured without any request and uses 1 CPU core. In this example,
2 CPU cores used to 1 CPU core requested ratio at the Namespace
level is 200%.
Memory Used vs Requests
The chart shows the latest value returned by
sum(avg(memory.bytes.used)) / sum(avg(kubernetes.pod.resourceRequests.memBytes))
.
What Is It?
The chart shows the ratio between the total memory usage across all pods
of the Namespace and the total memory requested by all pods.
The upper bound shows the total memory requested by all the pods,
expressed in a specified unit for bytes.
For instance, the image below shows that all the pods in the Namespace
requests for 120 GiB, of which only 24% is being used (about 29 GiB):

What to Expect?
It depends on the type of Workloads you run in the Namespace. Typically,
values that fall between 80% and 120% are considered healthy.
Values that are higher than 100% considered normal for a relatively
short amount of time.
What to Do Otherwise?
A low usage indicates the application is not properly running (not
executing the expected functions) or the workload configuration is not
accurate (high requests compared to what the pods actually need).
A high usage indicates the application is operating with a high load or
the Workload configuration is not accurate (Fewer requests compared to
what the pods actually need).
Given the configured limits for the Workloads and the memory pressure on
the nodes, if the Workloads use more memory than what’s requested they
are at risk of eviction. See Exceed a Container’s
Limit
for more information.
In both cases, you may want to drill down to the Workloads page to
determine which Workload requires a deeper analysis.
Can the Value Be Higher than 100%?
Yes, it can.
You can configure requests without limits, or requests lower than
the limits. In either case, you are allowing the containers to use
more resources than requested, typically to handle temporary
overloads.
Consider a Namespace with two Workloads with one pod each. Say, one
Workload is configured to request for 1 GiB of memory and uses 1 GiB
(Used vs Request ratio is 100%). The other Workload is configured
without any request and uses 1 GiB. In this example, 2 GiB of Memory
Used to1 GiB Requested ratio at the Namespace level is 200%.
Network I/O
The chart shows the latest value returned by
avg(avg(net.bytes.total))
.
What Is It?
The sparkline shows the trend of network traffic (inbound and outbound)
for all the pods in the Namespace. The number shows the most recent
rate, expressed in restarts per second.
For reference, the sparklines show the following number of steps
(sampling):

Last hour: 6 steps, each for a 10-minute time slice
Last 6 hours: 12 steps, each for a 30-minute time slice
Last day: 12 steps, each for a 2-hour time slice
What to Expect?
The type of applications run in the Namespace determine the metrics.
Drilling down to the Kubernetes Namespace Overview Dashboard in
Explore provides additional details, such as network activity across
pods.
2.1.4 -
Workloads Data
This topic discusses the Workloads Overview page and helps you
understand its gauge charts and the data displayed on them.
About Workloads Overview
Workloads, in Kubernetes terminology, refers to your containerized
applications. Workloads comprise of Deployments, Statefulsets, and
Daemonsets within a Namespace.
In a Cluster, worker nodes run your application workloads, whereas the
master node provides the core Kubernetes services and orchestration for
application workloads. The Workloads Overview page provides the key
metrics indicating health, capacity, and compliance.

Interpret the Workloads Data
This topic gives insight into the metrics displayed on the Workloads
Overview page.
Pod Restarts
The chart displays the latest value returned by
sum(timeAvg(kubernetes.pod.restart.rate))
.
What Is It?
The sparkline shows the trend of Pod Restarts rate across all the pods
in a selected Workload. The number shows the most recent rate, expressed
in Restarts per Second.
For instance, the image below shows the trend for the last hour. The
number indicates that the rate of pod restarts is less than 0.01 for the
last 10 minutes.

For reference, the sparklines show the following number of steps
(sampling):
Last hour: 6 steps, each for a 10-minute time slice.
Last 6 hours: 12 steps, each for a 20-minute time slice.
Last day: 12 steps, each for a 2-hour time slice.
What to Expect?
A healthy pod will have 0 restarts at any given time.
What to Do Otherwise?
In most cases, fewer restarts in the last hour (or larger time windows)
do not indicate a serious problem. Drill down to the Kubernetes
Overview Dashboard related to the Workload in Explore. For
example, Kubernetes StatefulSet Overview provides a detailed trend
broken down by pods.

In this example, the number of restarts is constant (roughly every 5
minutes) and no pods are ready. This might indicate a crash loop
back-off .
Pods Available vs Desired
The chart shows the latest value of returned by
sum(avg(kubernetes.deployment.replicas.available)) / sum(avg(kubernetes.deployment.replicas.desired))
.
What Is It?
The chart displays the ratio between available and desired pods,
averaged across the selected time window, for all the pods in a given
Workload.
The upper bound shows the number of desired pods in the Workload.
For instance, the image below shows all the 42 desired pods are
available.

What to Expect?
You should typically expect 100%.
If certain pods take a significant amount of time to become available
(image pull time, pod initialization, readiness probe), then you may
temporarily see a ratio lower than 100%.
What to Do Otherwise?
Determine the Workloads that have low availability by drilling down to
the related Dashboard in Explore. For example, the Kubernetes
Deployment Overview helps understand the trend and the state of the
pods.

For instance, the image above shows that the ratio is 98% (3.93 / 4 x
100). The slight decline is due to an update that caused pods to be
terminated and consequently to be started with a newer version.
CPU Used vs Requests
The chart shows the latest value returned by
sum(avg(cpu.cores.used)) / sum(avg(kubernetes.pod.resourceRequests.cpuCores))
.
What Is It?
The chart shows the ratio between the total CPU usage across all pods of
a selected Workload and the total CPU requested by all the pods.
The upper bound shows the total CPU requested by all the pods. The value
denotes the number of CPU cores.

In this image, the pods in the Workload requests for 40 CPU cores, of
which 43% is actually used (about 17 cores).
What to Expect?
It depends on the type of workload.
For applications (background processes) whose resource usage is
constant, expect the ratio to be around 100%.
For “bursty” applications, such as an API server, expect the ratio to be
lower than 100%. Note that the value is averaged for the selected time
window, therefore, a usage spike would be compensated by an idle period.
Generally, values between 80% and 120% are considered normal. Values
that are higher than 100% deemed normal if it’s observed only for a
relatively short time.
What to Do Otherwise?
A low usage indicates that the application is not properly running
(not executing the expected functions) or the Workload configuration
is not accurate (requests are too high compared to what the pods
actually need).
A high usage indicates that the load is high for applications or the
Workload configuration is not accurate (low requests compared to
what the pods actually need).
In either case, drill down to the Kubernetes Overview Dashboard
corresponding to the Workload in Explore. For example, the
Kubernetes Deployment Overview Dashboard provides insight into
resource usage and configuration.
Can the Value Be Higher than 100%?
Yes, it can.
Configuring CPU requests without limits or requests lower than
limits is permissible. In these cases, you are allowing the
containers to use more resources than requested, typically to handle
temporary overloads.
Consider a Workload with two containers. Say, one container is
configured to request for 1 CPU core and uses 1 CPU core (Used vs
Request ratio is 100%). The other is configured without any request
and uses 1 CPU core. In this example, the 2 CPU core Used to 1 CPU
core Requested ratio is 200% at the Workload level.
What Does “No Data” Mean?
If the Workload is configured with no requests and limits, then the
Usage vs Requests ratio cannot be computed. In this case, the chart will
show “no data”. Drill down to the Dashboard in Explore to evaluate
the actual usage.
You must always configure requests. Setting requests helps to detect
Workloads that require reconfiguration.
Kubernetes itself might expose Workloads with no requests or limits
configured. For example, the kube-system
Namespace can have Workloads
without requests configured.
Memory Used vs Requests
The chart shows the latest value returned by
sum(avg(memory.bytes.used)) / sum(avg(kubernetes.pod.resourceRequests.memBytes))
.
What Is It?
The chart shows the ratio between the total memory usage across all the
pods in a Workload and the total memory requested by the Workload.
The upper bound shows the total memory requested by all the pods,
expressed in the specified unit of bytes.

For instance, the image shows that the pods in the selected Workload
requested for 120 GiB, of which 24% is actually used (about 29 GiB).
What to Expect?
The type of Workload determines the ratio. Values between 80% and 120%
are considered normal. Values that are higher than 100% is deemed normal
if it’s observed only for a relatively short time.
What to Do Otherwise?
A low memory usage indicates that the application is not properly
running (not executing the expected functions) or the Workload
configuration is not accurate (requests are too high compared to what
the pods actually need).
A high memory usage indicates that the load is higher for applications
or the Workload configuration is not accurate (low requests compared to
what the pods actually need).
Given the configured limits for the Workloads and the memory pressure on
the nodes, if the Workloads use more memory than what’s requested they
are at risk of eviction. For more information, see Container’s Memory
Limit.
In either case, drill down to the Workloads page to determine the
Workload that requires a deeper analysis.
Can the Value Be Higher than 100%?
Yes, it can.
Configuring memory requests without limits or requests lower than
limits is permissible. In these cases, you are allowing the
containers to use more resources than requested, typically to handle
temporary overloads.
Consider a Workload with two containers. Say, one container is
configured to request for 1 GiB of memory and uses 1 GiB (Used vs
Request ratio is 100%), while the other is configured without any
request and uses 1 GiB of memory. In this example, the 2 GiB of
memory used to 1 GiB requested ratio is 200% at the Workload level.
What Does “No Data” Mean?
If the Workload is configured with no memory requests and limits, then
the Usage vs Requests ratio cannot be computed. In this case, the chart
will show “no data”. Drill down to the Dashboard in Explore to
evaluate the actual usage.
You must configure requests. It helps to detect Workloads that require
reconfiguration.
Kubernetes itself might expose Workloads with no requests or limits
configured. For example, the kube-system
Namespace can have Workloads
without requests configured.
Network I/O
The chart shows the latest value returned by
avg(avg(net.bytes.total))
.
What Is It?
The sparkline shows the trend of network traffic (inbound and outbound)
for the Workload. The number shows the most recent rate, expressed in
bytes per second in a specific unit.

For reference, the sparklines show the following number of steps
(sampling):
Last hour: 6 steps, each for a 10-minute time slice
Last 6 hours: 12 steps, each for a 30-minute time slice
Last day: 12 steps, each for a 2-hour time slice
What to Expect?
The type of application runs in the Workload determines the metrics.
Drill down to the Kubernetes Overview Dashboard corresponding to the
Workload in Explore. For example, the Kubernetes Deployment
Overview Dashboard provides additional details, such as network
activity across pods.
3 -
Explore
About Explore
The Sysdig Monitor user interface centers around the Explore module,
where you perform the majority of infrastructure monitoring operations.
Sysdig Monitor automatically discovers your stack and presents pre-built views in Metric Explorer. Explore provides you with the ability to view and troubleshoot key metrics and entities of your infrastructure stack. You can drill down to any layers of your infrastructure hierarchy and view
granular-level data. Metrics Explorer allows you to run form queries and build infrastructure views by using interactive metric and label filtering.
Grouping controls how entities are organized in
Explore. Grouping is fully customizable by logical layers, such as
containers, Kubernetes clusters, or services.
In addition to the Explore interface, Sysdig provides a PromQL Query
Explorer and PromQL Library. They help you understand metrics and
corresponding labels and values clearly, to create queries faster, and
to build Dashboard and Alerts easily.
Benefits of Using Explore
Explore Interface
This section outlines the key areas of the interface and detail
basic navigation steps.

There are several key areas highlighted in the image above:
Switch Products: This allows you to switch between Sysdig products.
Grouping: Groupings are hierarchical organizations of tags, allowing
users to organize their infrastructure views using the Grouping
Wizard in a logical hierarchy. For more information on groupings,
refer to Grouping, Scoping, and Segmenting
Metrics.
Modules: Quick links for each of the main Sysdig Monitor modules:
Explore, Dashboards, Alerts, Events, and Captures.
PromQL Query Explorer: Run PromQL queries to build your
infrastructure views and get an in-depth insight into what’s going
on. See PromQL Query
Explorer.
PromQL Query Library: Provides a set of out-of-the-box PromQL queries. See PromQL Library.
Management: Quick links for Sysdig Spotlight, help material, and the
user profile configuration settings.
Scope Filtering: This allows you to explore deep down the infrastructure
stack and retrieve all the components in a certain category in a
single organized element.
Search Metrics: Helps you select desired metrics and build a query with one-click.
Time Navigation: Helps you customize the time window used for
displaying data
Key Page Actions: Quick links to create alerts and dashboards.
Learn More
Learn more about using Explore in the following sections:
3.1 -
Metrics Explorer
Use the Metrics Explorer for advanced metric exploration and querying. In addition to the core functionalities (grouping, scope tree, metrics, and graphing) of Explore, Metrics Explorer provides you the ability to:
- Graph multiple metrics simultaneously for correlation. For example, CPU usage vs CPU limits.
- View ungrouped queries by default, showing the individual time series for a metric.
- View context-specific metrics for a selected a scope. You no longer see no data for a selected metric.
- View metrics that are logically categorized with metric namespace prefix.
- Display metrics at high resolution. For example a 1-hour view now shows data at 10-seconds resolution instead of 1 minute.
About the Metrics Explorer UI
The main components of the Metrics Explorer UI are widgets, time navigation, dashboard, and time series panel.
You’ll find Metrics Explorer on the Explore slider menu on the Sysdig Monitor UI. Click Explore to display the slider.

Use Metrics Explorer
This section helps you drill down into your infrastructure stack for troubleshooting views and create alerts and dashboard by using Metrics Explorer.
Switch Groupings
Sysdig Monitor detects and collects the metrics associated with your
infrastructure once the agent is deployed in your environment. Use the
Explore UI to search, group, and troubleshoot your infrastructure
components.
To switch between available data sources:
On the Metrics Explorer tab, click the My Groupings drop-down menu:
Select the desired grouping from the drop-down list.
Groupings Editor
The Groupings Editor helps you create and manage your infrastructure groupings.

Filter Infrastructure (Scope Filtering)
You can drill down the infrastructure stack and get insight into the numerous metrics available to you at each level of your stack. These displays can be found by selecting a top-level infrastructure object, then using the scope filtering for relevant infrastructure objects and metrics filtering for desired metrics.
Sysdig Monitor displays only the metrics and dashboards that are relevant to the selected infrastructure object.
Metrics
You can view specific metrics for an infrastructure object by navigating the scope filtering and metrics filtering menus:
On the Metrics Explorer tab, open the scope filtering menu.
Select the infrastructure object you want to explore.
Navigate to Filter metrics.
Click the desired metrics.
The metric will instantly be presented on the form query and on the dashboard.
The scope of the metric, when viewed via the scope filtering menu, is set to the infrastructure object that you have selected.
Optionally, click Add Query, then click a metric to add additional queries.
You can do all the operations, such as setting Time Aggregation, Show Top 50 and Bottom 50 time series, Group Rollup, Segmentation, and Unit of Value Returned by Query, as you use form query. See Building a Form-Based Query for more information.
Create an Alert
Build a form query as described in Metrics.
Click Create Alert.
If you have built multiple queries, you will be prompted to choose a single metric to be alerted on.
Select the metric you want to create an alert for.
Click Create Alert.
The New Metric Alert page will be displayed.
The group aggregation will be set to the default one for an alert that is created from a query with group aggregation set to none.
Complete creating the alert as described in Metric Alerts.
Create a Dashboard Panel
Build a form query as described in Metrics.
Click Create dashboard panel.
Select an existing dashboard or create a new dashboard by typing in a name.
Click Copy and Open.
The newly created dashboard will be displayed.
The group aggregation will be set to the default one for a dashboard that is created from a query with group aggregation set to none.
Optionally, continue with other operations as described in Managing Panels.
3.1.1 -
Groupings Editor
Groupings are hierarchical organizations of labels, allowing you to
organize your infrastructure views on the Explore UI in a logical
hierarchy.
An example grouping is shown below:

The example above groups the infrastructure into four levels. This
results in a tree view in the Groupings Editor with four levels, with
rows for each infrastructure object applicable to each level.
As each label is selected, Sysdig Monitor automatically filters out
labels for the next selection that no longer fit the hierarchy, to
ensure that only logical groupings are created.
Sysdig Monitor automatically organizes all the configured groupings that
are inapplicable to the current infrastructure under Inapplicable
Groupings.
Manage Groupings
You can perform the following operations using the Groupings Editor:
Search existing groupings
Create a new grouping
Edit an existing grouping
Rename a groupings
Share a grouping with the active team
Search for a Grouping
Do one of the following:
From Explore, click the Groupings drop-down. Search for
the desired grouping.
Either select the desired grouping, or search for it by
scrolling down the list or by using the search bar, and then
select it.
Click Manage Groupings and open the Groupings Editor.
Either select the desired grouping, or search for it by
scrolling down the list or by using the search bar, and then
select it.
Create a New Grouping
In the Explore tab, click the Groupings drop-down, then
click Manage Groupings.
Open the Groupings Editor.
Click Add.
The New Groupings page is displayed.
Enter the following information:
Groupings Name: Set an appropriate name to identify the
grouping that you are creating.
Shared with Team: Select if you want to share the grouping
with the active team that you are part of.
Hierarchy: Determine the hierarchical representation of the
grouping by choosing a top-level label and subsequent ones.
Repeat adding the labels until there are no further layers
available in the infrastructure label hierarchy.
You can search for the label by entering the first few
characters in the Select label drop-down or scrolling down.
As you add labels, the preview displays associated components in
your infrastructure.
Check the preview to ensure that the label selection is correct.
Click Save&Apply.
Rename a Grouping
Renaming is allowed only for groupings that are owned by you. To rename
a shared grouping, create a copy of it and edit the name.
On Explore, click the Groupings drill-down. Search for the desired
grouping.
Click the Edit button next to the grouping.
Open the Groupings Editor.
Select the desired grouping.
You can either scroll down the list or use the search bar.
Click Edit.

The edit window is displayed on the screen.
Specify the new grouping name, then click Save& Apply to save
the changes.
Share a Grouping with Your Active Team
Custom groupings are owned by you, and therefore you can share them with
all the members of your active team. To share a default grouping, create
a custom grouping and use the Shared with Team option in the
Grouping Editor.
Click the Groupings drill-down and click Manage Groupings.
The Grouping Editor screen appears.
Highlight the relevant grouping and click Edit.
Click Shared with Team.
Click Save &Apply to save the changes.
To share a default grouping, create a custom grouping and then use the
Shared with Team option in the Grouping Editor.
3.1.2 -
Time Windows
By default, Sysdig Monitor displays information in Live mode. This means
that dashboards, panels, and the Explore views will be automatically
updated with new data as time passes, and will display the most recent
data available for the configured time window.
By default, time navigation will enter Live mode with an hour time
window.
The time window navigation bar provides users with quick links to common
time windows, as well as the ability to configure a custom time period
in order to review historical data.

As shown in the image above, the navigation bar provides a number of
pieces of information:
In addition, the navigation bar provides:
Quick links for common time windows
- Metrics Explorer: five minute, ten minutes, one hour, six hours, twelve hours, one day, and two weeks.
- Explore: ten seconds, five minute, ten minutes, one hour, six hours, one day, and two weeks.
A custom time window configuration option.
A pause/play button to exit Live mode and freeze the data to a time
window, and to return to Live mode.
Step back/forward buttons to jump through a frozen time window to
review historical data.
Zoom in/out buttons to increase/decrease the time window (note applicable to Metrics Explorer)
The Time Navigation drop-down panel can be used to configure a specific time range. To configure a manual range:
Metrics Explorer
On the Metrics Explorer tab, click the custom panel the time navigation bar.

Configure the start and end points, and click Save to save the changes.

Some limitations apply to custom time windows. Refer to Time Window Limitations for more information.
Explore
On the Explore tab, click CUSTOM on the time navigation bar.
Configure the start and end points, and click Adjust time to save the changes.

Some limitations apply to custom time windows. Refer to Time Window Limitations for more information.
Time Window Limitations
Some time window configurations may not be available in certain
situations. In these instances, a modification to the time window is
automatically applied, and a warning notification will be displayed:

There are two main reasons for a time window being unavailable. Both
relate to data granularity and specificity:
The time window specifies the granularity of data that has expired
and is no longer available. For example, a time window specifying a
one-hour time range from six months ago would not be available,
resulting in the time window being modified to a time range of at
least one day.
The time window specifies a granularity of data that is too high
given the size of the window, as a graph can only handle a certain
number of data points. For example, a multi-hour time range would
contain too many datapoints at one-minute granularity, and would
automatically be modified to 10-minute granularity.
3.1.3 -
Explore Workflows
While every user has unique needs from Sysdig Monitor, there are three
main workflows that you can follow when building out the interface and
monitoring your infrastructure.
Workflow One
This workflow assumes that an alert has not been triggered yet.
Start with Explore
, identify a problem area, then drill-down into
the data. This workflow is the most basic approach, as it begins with a
user monitoring the overall infrastructure, rather than with a specific
alert notification. The workflow tends to follow the following steps:
Organize the infrastructure with groupings.
Define key signals with alerts and dashboards to detect a problem.
Identify a problem area, and drill down into the data using
dashboards, metrics, and by adjusting groupings and scope as
necessary.
Workflow Two
Start with an event notification, and begin troubleshooting. This
workflow begins with an already configured alert and event being
triggered. Unlike workflow one, this workflow assumes that
pre-determined data boundaries have already been set:
Explore the event by adjusting time windows, scope, and
segmentation.
Identify the exact area of concern within the infrastructure.
Drill down into the data to troubleshoot the issue.
Workflow Three
Customize default dashboard panels to troubleshoot a potential issue.
This workflow assumes that an issue has been identified within one of
the default dashboards, but alerts have not been set up for the problem
area.
Copy the displayed panel to a new dashboard.
Create an alert based on the dashboard panel.
Configure a Sysdig Capture on demand.
3.2 -
PromQL Query Explorer
Use the PromQL Query Explorer to run PromQL queries and build
infrastructure views. It allows you
Write PromQL queries faster by automatically identifying the common
labels and labels among different metrics.
See Run PromQL Queries Faster with Extended Label
Set.
Query metrics by leveraging advanced functions, operators, and
boolean logic.
Interactively modify the PromQL results by using visual label
filtering.
Use label filtering to visualize the common labels between metrics,
which is key when combining multiple metrics.
About the PromQL Explorer UI
The main components of the PromQL Query Explorer UI include widgets,
time navigation, and dashboard and time series panel.
You’ll find PromQL Explore under the Explore tab on the Sysdig
Monitor UI.

PromQL Query
The PromQL field supports manually building PromQL queries. You can
manually enter simple or complex PromQL queries and build dashboards and
create alerts. The PromQL Query Explorer allows running up to 5 queries
simultaneously. With the query field, you can do the following:
Explore metrics and labels available in your infrastructure.
For example, calculate the number of bytes received in a selected
host:
sysdig_host_net_total_bytes{host_mac="0a:e2:e8:b4:6c:1a"}
Calculate the number of bytes received in all the hosts except one:
sysdig_host_net_total_bytes{host_mac!="0a:a3:4b:3e:db:a2"}
Compare current data with historical data:
sysdig_host_net_total_bytes offset 7d
Use arithmetic operators to perform calculations on one or more
metrics or labels.
For example, calculate the rate of incoming bytes and convert it to
bits:
rate(sysdig_host_net_total_bytes[5m]) * 8
Build complex PromQL queries.
For example, return summary ingress traffic across all the network
interfaces grouped by instances
sum(rate(sysdig_host_net_total_bytes[5m])) by (container_id)
Label Filtering
Label filtering to automatically identify common labels between queries
for vector matching. In the given example, you can see that A and B
metrics have only the host_mac
label in common.

You can also filter by using the relational operators available in the
time series table. Simply click the operator for it to be automatically
applied to the queries. Run the queries again to visualize the metrics.

Filtering simultaneously applies to all the queries in the PromQL Query
Explorer.
PromQL Query Explorer supports only time series (Timechart). You can run
advanced (PromQL) queries and build dashboard panels. PromQL Explorer
does not support building form-based queries.
Time Navigation
PromQL Query Explorer is designed around time. After a query has been
executed, Sysdig Monitor polls the infrastructure data every 10 seconds
and refreshes the metrics on the Dashboard panel. You select how to view
this gathered data by choosing a Preset interval and a time Range. For
more information, see Time
Navigation.
Legend
The legend is positioned on the upper right corner of the panel. Each
query will have associated legends listed in the same execution order.
Build a Query
On the Explore tab, click PromQL Query.

Enter a PromQL query manually.
sysdig_host_cpu_used_percent
Click Add Query to run multiple queries. You can run up to 5
queries at once.
sysdig_container_cpu_used_percent
Click Run Query or press command+Enter.
A dashboard will appear on the screen. You can either Copy to a
Dashboard
or Create an
Alert.

Copy to a Dashboard
Run a PromQL query.
Click Create > Create a Dashboard Panel.
Either select an existing Dashboard or enter the Dashboard name to
copy to a new Dashboard.
Click Copy and Open.

The new Dashboard panel with the given title will open to the
Dashboard tab.

You might want to continue with the Dashboard operations as given in
Dashboards.
Create an Alert
Run a PromQL query.
Click Create > Create Alert.
If you have multiple queries, select the query you want to create
the alert for.
A new PromQL Alert page for the selected query appears on the
screen.
Continue with PromQL
Alerts.
Remove a Query
Click the three dots next to the query field to remove the query.

Toggle Query Results
Click the respective query buttons, for example, A or B, to show or hide
query results.

3.3 -
PromQL Library
PromQL is a powerful language to query metrics, but it could be
challenging for beginners. To ease the learning curve of PromQL, Sysdig
provides a set of curated examples, called PromQL Library. It helps you
perform complex queries against your metrics with one click and get
insight into your infrastructure problems which was not previously
possible with Sysdig querying. For example, identify containers > 90%
limit and counting pods per namespace, and so on.

You have the following categories currently to experiment with PromQL:
Kubernetes
Infrastructure
Troubleshooting
PromQL 101
Access PromQL Library
Log in to Sysdig Monitor.
Click Explore from the left navigation pane.
On the Explore tab, click PromQL Library.

The tab opens to a list of PromQL examples.
Use PromQL Library
Click Try me to open PromQL Query Explore. A visualization
corresponding to the query will be displayed. You can do the following
with the query:
Create a dashboard panel
Create an alert
See PromQL Query Explorer
for more information.
To copy a query, click the copy icon next to the query.
Filter PromQL Queries
Automatic tag filtering identifies common tags in the given examples.
You can use the following to filter queries:
Visual label filtering: Simply click the desired color-coded label
to filter queries based on tags.
Text search: Use the Text Search bar on the top-left navigation
pane.
Label search: Use the Label drop-down list on the top-left
navigation pane.
Filter using categories: Use the All Categories checkboxes.
3.4 -
(Deprecated) Using the Explore Interface
This section helps you navigate the Explore menu in the Sysdig Monitor
UI.
Switch Groupings
Sysdig Monitor detects and collects the metrics associated with your
infrastructure once the agent is deployed in your environment. Use the
Explore UI to search, group, and troubleshoot your infrastructure
components.
To switch between available data sources:
On the Explore tab, click the My Groupings drop-down menu:
Select the desired grouping from the drop-down list.
Groupings Editor
The Groupings Editor helps you create and manage your infrastructure
groupings.

Sysdig Monitor users can drill down into the infrastructure by using the
numerous dashboards and metrics available for display in the Explore UI.
These displays can be found by selecting an infrastructure object, and
opening the drill-down menu.
Sysdig Monitor only displays the metrics and dashboards that are
relevant to the selected infrastructure object.
Metrics
Sysdig Monitor users can view specific metrics for an infrastructure
object by navigating the drill-down menu:
On the Explore tab, open the drill-down menu.
Navigate to Search Metrics and Dashboard.
Select the desired metrics.
The metric will now be presented on the Explore UI, until the user
navigates away from it.
The scope of the metric, when viewed via the drill-down menu, is set
to the infrastructure object that you have selected.
Troubleshooting Views
The drill-down menu displays all the default dashboard
templates relevant to the
selected infrastructure object. These Troubleshooting Views are
broken into the following sections:
The scope of the Troubleshooting View, when viewed via the drill-down
menu, is set to the infrastructure object that you have selected from
the drill-down.
To navigate to the Troubleshooting Views:
On the Explore tab, select an infrastructure object.
Open the drill-down menu and select the desired infrastructure
element
Navigate to Search Metrics and Dashboard.
Select the desired troubleshooting view.
The selected dashboard will now be presented on the screen, until
you navigate away from it.
Pin and Unpin the Drill-Down Menu
On the Explore tab, select an infrastructure object.
Open the drill-down menu.
Click Pin Menu to pin the menu to the Explore tab.
To unpin the menu, click Unpin Menu at the bottom of the menu.
4 -
Metrics
Metrics are quantitative values or measures that can be grouped/divided
by labels. Sysdig Monitor metrics are divided into two groups: default
metrics (out-of-the-box metrics associated with the system, orchestrator, and
network infrastructure), and custom metrics (JMX, StatsD, and multiple
other integrated application metrics).

Sysdig automatically collects all types of metrics, and auto-labels
them. Custom metrics can also have custom (user-defined) labels.
Out-of-the box, when an agent is deployed on a host, Sysdig
Monitor automatically begins collecting and reporting on a wide array of
metrics. The sections below describe how those metrics are conceptualized within the system.
In the sections, you can learn more also about the metrics types and the data aggregation techniques
supported by Sysdig Monitor:
4.1 -
Grouping, Scoping, and Segmenting Metrics
Data aggregation and filtering in Sysdig Monitor are done through the
use of assigned labels. The sections below explain how labels work, the
ways they can be used, and how to work with groupings, scopes, and
segments.
Labels
Labels are used to identify and differentiate characteristics of a
metric, allowing them to be aggregated or filtered for Explore module
views, dashboards, alerts, and captures. Labels can be used in different
ways:
To group infrastructure objects into logical hierarchies displayed
on the Explore tab (called groupings). For more information, refer
to
Groupings
.
To split aggregated data into segments. For more information, refer
to
Segments.

There are two types of labels:
Infrastructure labels
Metric descriptor labels
Infrastructure Labels
Infrastructure labels are used to identify objects or entities within
the infrastructure that a metric is associated with, including hosts,
containers, and processes. An example label is shown below:
Sysdig Notation
kubernetes.pod.name
Proemetheus Notation
kubernetes_pod_name
The table below outlines what each part of the label represents:
Example Label Component | Description |
---|
kubernetes | The infrastructure type. |
pod | The object. |
name | The label key. |
Infrastructure labels are obtained from the infrastructure (including
from orchestrators, platforms, and the runtime processes), and Sysdig
automatically builds a relationship model using the labels. This allows
users to create logical hierarchical groupings to better aggregate the
infrastructure objects in the Explore module.
For more information on groupings, refer to the
Groupings.
Metric Descriptor Labels
Metric descriptor labels are custom descriptors or key-value pairs
applied directly to metrics, obtained from integrations like StatsD,
Prometheus, and JMX. Sysdig automatically collects custom metrics from
these integrations, and parses the labels from them. Unlike
infrastructure labels, these labels can be arbitrary, and do not
necessarily map to any entity or object.
Metric descriptor labels can only be used for segmenting, not grouping
or scoping.
An example metric descriptor label is shown below:
website_failedRequests:20|region='Asia', customer_ID='abc'
The table below outlines what each part of the label represents:
Example Label Component | Description |
---|
website_failedRequests | The metric name. |
20 | The metric value. |
region=‘Asia’, customer_ID=‘abc’ | The metric descriptor labels. Multiple key-value pairs can be assigned using a comma separated list. |
Sysdig recommends not using labels to store dimensions with high
cardinalities (numerous different label values), such as user IDs, email
addresses, URLs, or other unbounded sets of values. Each unique
key-value label pair represents a new time series, which can
dramatically increase the amount of data stored.
Groupings
Groupings are hierarchical organizations of labels, allowing users to
organize their infrastructure views on the Explore tab in a logical
hierarchy. An example grouping is shown below:

The example above groups the infrastructure into four levels. This
results in a tree view in the Explore module with four levels, with rows
for each infrastructure object applicable to each level.
As each label is selected, Sysdig Monitor automatically filters out
labels for the next selection that no longer fit the hierarchy, to
ensure that only logical groupings are created.
The example below shows the logical hierarchy structure for Kubernetes:
Clusters: Cluster > Namespace > Replicaset > Pod
Namespace: Cluster > Namespace > HorizontalPodAutoscaler >
Deployment > Pod
Daemonsets : Cluster > Namespace > Daemonsets > Pod
Services: Cluster > Namespace > Service > StatefulSet >
Pod
Job: Cluster > Namespace > Job > Pod
ReplicationController: Cluster > Namespace >
ReplicationController > Pod

The default groupings are immutable: They cannot be modified or deleted.
However, you can make a copy of them that you can modify.
Unified Workload Labels
Sysdig provides the following labels to help improve your infrastructure
organization and troubleshooting easier.
kubernetes_workload_name: Displays all the Kubernetes workloads
and indicates what type and name of workload resource (deployment,
daemonSet, replicaSet, and so on) it is.
kubernetes_workload_type: Indicates what type of workload
resource (deployment, daemonSet, replicaSet, and so on) it is.

The availability of these labels also simplifies Groupings. You do
not need different groupings for each type of deployment, instead, you
have a single grouping for workloads.
The labels allow you to segment metrics, such as sysdig_host_cpu_cores_used_percent
, by
kubernetes_workload_name
to see CPU cores usage for all the workloads,
instead of having a separate query for segmenting by
kubernetes_deployment_name
, kubernetes_replicaSet_name
, and so on.
Learn More
Scopes
A scope is a collection of labels that are used to filter out or define
the boundaries of a group of data points when creating dashboards,
dashboard panels, alerts, and teams. An example scope is shown below:

In the example above, the scope is defined by two labels with operators
and values defined. The table below defines each of the available
operators.
Operator | Description |
---|
is | The value matches the defined label value exactly. |
is not | The value does not match the defined label value exactly. |
in | The value is among the comma separated values entered. |
not in | The value is not among the comma separated values entered. |
contains | The label value contains the defined value. |
does not contain | The label value does not contain the defined value. |
starts with | The label value starts with the defined value. |
The scope editor provides dynamic filtering capabilities. It restricts
the scope of the selection for subsequent filters by rendering valid
values that are specific to the previously selected label. Expand the
list to view unfiltered suggestions. At run time, users can also supply
custom values to achieve more granular filtering. The custom values are
preserved. Note that changing a label higher up in the hierarchy might
render the subsequent labels incompatible. For example, changing the
kubernetes_namespace_name
> kubernetes_deployment_name
hierarchy
to swarm_service_name
> kubernetes_deployment_name
is invalid as
these entities belong to different orchestrators and cannot be logically
grouped.
Dashboards and Panels
Dashboard scopes define the criteria for what metric data will be
included in the dashboard’s panels. The current dashboard’s scope can be
seen at the top of the dashboard:

By default, all dashboard panels abide by the scope of the overall
dashboard. However, an individual panel scope can be configured for a
different scope than the rest of the dashboard.
For more information on Dashboards and Panels, refer to the
Dashboards documentation.
Alerts
Alert scopes are defined during the creation process, and specify what
areas within the infrastructure the alert is applicable for. In the
example alerts below, the first alert has a scope defined, whereas the
second alert does not have a custom scope defined. If no scope is
defined, the alert is applicable to the entire infrastructure.

For more information on Alerts, refer to the
Alerts documentation.
Teams
A team’s scope determines the highest level of data that team members
have visibility for:
If a team’s scope is set to Host
, team members can see all
host-level and container-level information.
If a team’s scope is set to Container, team members can only see
container-level information.
A team’s scope only applies to that team. Users that are members of
multiple teams may have different visibility depending on which team is
active.
For more information on teams and configuring team scope, refer to the
Manage Teams and Roles
documentation.
Segments
Aggregated data can be split into smaller sections by segmenting the
data with labels. This allows for the creation of multi-series
comparisons and multiple alerts. In the first image, the metric is not
segmented:

In the second image, the same metric has been segmented by
container_id
:

Line and Area panels can display any number of segments for any
given metric. The example image below displays the sysdig_connection_net_in_bytes
metric
segmented by both container_id
and host_hostname
:

For more information regarding segmentation in dashboard panels, refer
to the Configure Panels
documentation. For more information regarding configuring alerts, refer
to the Alerts
documentation.
The Meaning of n/a
Sysdig Monitor imports data related to entities such as hosts,
containers, processes, and so on, and reports them in tables or panels
on the Explore and Dashboards UI, as well as in events, so across the UI
you see varieties of data. The term n/a can appear anywhere on the UI
where some form of data is displayed.
n/a is a term that indicates data that is not available or that it does
not apply to a particular instance. In Sysdig parlance, the term
signifies one or more entities defined by a particular label, such as
hostname or Kubernetes service, for which the label is invalid. In other
words, n/a collectively represent entities whose metadata is not
relevant to aggregation and filtering techniques—Grouping, Scoping, and
Segmenting. For instance, a list of Kubernetes services might display
the list of all the services as well as n/a that includes all the
containers without the metadata describing a Kubernetes service.
You might encounter n/a sporadically in Explore UI as well as in
drill-down panels or dashboards, events, and likely elsewhere on the
Sysdig Monitor UI when no relevant metadata is available for that
particular display. How n/a should be treated depends on the nature of
your deployment. The deployment will not be affected by the entities
marked n/a.
The following are some of the cases that yield n/a on the UI:
Labels are partially available or not available. For example, a host
has entities that are not associated with a monitored Kubernetes
deployment, or a monitored host has an unmonitored Kubernetes
deployment running.
Labels that do not apply to the grouping criteria or at the
hierarchy level. For example:
Containers that are not managed by Kubernetes. The containers
managed by Kubernetes are identified with their
container_name
labels.
In certain groupings by DaemonSet, Deployments render N/A and
vice versa. Not all containers belong to both DaemonSet and
Deployment objects concurrently. Likewise, a Kubernetes
ReplicaSet grouping with the kubernetes_replicaset_name
label
will not show StatefulSets.
In
a kubernetes_cluster_name > kubernetes_namespace_name > kubernetes_deployment_name
grouping, the entities without the kubernetes_cluster_name
label yield n/a.
Entities are incorrectly labeled in the infrastructure.
Kubernetes features that are yet to be in sync with Sysdig
Monitoring.
The format is not applicable to a particular record in the database.
4.2 -
Understanding Default, Custom, and Missing Metrics
Default Metrics
Default metrics include various kinds of metadata which Sysdig Monitor
automatically knows how to label, segment, and display.
For example:
System metrics for hosts, containers, and processes (CPU used, etc.)
Orchestrator metrics (collected from Kubernetes, Mesos, etc.)
Network metrics (e.g. network traffic)
HTTP
Platform metrics (in some cases)
Default metrics are collected mainly from two sources: syscalls and
Kubernetes.
Custom Metrics
About Custom Metrics
Custom metrics generally refer to any metrics that the Sysdig Agent
collects from some third-party integration. The type of infrastructure
and applications integrated determine the custom metrics that the Agent
collects and reports to Sysdig Monitor. The supported custom metrics
are:
Each metric comes with a set of custom labels, and additional labels can
be user-created. Sysdig Monitor simply collects and reports them with
minimal or no internal processing. The limit currently enforced is 3000
metrics per host. Use the metrics_filter
option in the dragent.yaml
file to remove unwanted metrics or to choose the metrics to report when
hosts exceed this limit. For more information on editing the
dragent.yaml
file, see Understanding the Agent Config
Files.
Unit for Custom Metrics
Sysdig Monitor detects the default unit of custom metrics automatically
with the delimiter suffix in the metrics name. For example,
custom_expvar_time_seconds
results in a base unit set to seconds. The
supported base units are byte, percent, and time. Custom metrics name
should carry one of the following delimiter suffixes in order for Sysdig
Monitor to identify and configure the accurate unit type.
Custom metrics will not be auto-detected and the unit will be incorrect
unless this naming convention is followed. For instance,
custom_byte_expvar
will not yield the correct unit, that is MiB.
Editing the Unit Scale
You have the flexibility to change the unit scale either by editing the
panel on the Dashboard or in the Explore.
Explore
From the Search Metrics and Dashboard drop-down, select the custom
metrics you want to edit the unit selection for, then click More
Options. Select the desired unit scale from the Metric Format
drop-down and click Save.

Dashboard
Select the Dashboard Panel
associated with the custom metrics you want to modify. Select the
desired unit scale from the Metrics drop-down and click Save.

Display Missing Data
Data can be missing for a few different reasons:
Sysdig Monitor allows you to configure the behavior of missing data in
Dashboards. Though metric type determines the default behavior, you can
configure how to visualize missing data and define it at the per-query
level. Use the No Data Display drop-down in the Options menu in
the panel configuration, and the No Data Message text box under the Panel tab. See Create a New
Panel for more information.
Consider the following guidelines:
Use the No Data Message text box under the Panel tab to enter a custom message when no data is available
to render on the panels. This custom message, which could include links in markdown format and line breaks,
is shown when queries return no data and reports no errors.
The No Data Display drop-down has only two options for the
Stacked Area timechart: gap and show as zero.
For form-based timechart panels, the default option for a metrics
selection that does not contain a StatsD metric is gap
.
Adding a StatsD metric to a query in a form-based timechart panel
will default the selected No Data Display type to the show as
zero , which is the default option for form-based StatsD metrics.
You can change this selection to any other type.
The default display option is gap for PromQL Timechart panels.
The options for No Data Display are:
gap: The default option for form-based timechart panel, where a
query metrics selection does not contain a StatsD metric. gap is
the best visualization type for most use cases because it is easy to
spot indicating a problem.

show as zero: The best option for StatsD metrics which are only
submitted sporadically. For example, batch jobs and count of errors.
This is the default display option for StatsD metrics in form-based
panels.

We do not recommend this option as setting zero could be misleading.
For example, this setting will report the value for free disk space
as 0% when the disk or host disappears, but in reality, the value is
unknown.
connect - solid: Use for measuring the value of a metric,
typically a gauge, where you want to visualize the missing samples
flattened.

The leftmost and rightmost visible data points can be connected as
Sysdig does not perform the interpolation.
connect - dotted: Use it for measuring the value of a metric,
typically a gauge, where you want to visualize the missing samples
flattened.

The leftmost and rightmost visible data points can be connected as
Sysdig does not perform the interpolation.
4.3 -
Metric Limits
Sysdig ensures that you see the most relevant metric information
relevant to your monitored environment. To achieve this, limits are
enforced on the number of metrics that the datastore can
store. Different limits apply to different metric types and agent
versions.
The default metric limits per agent is different from the subscription limit imposed on custom time series entitlement. Your entitlement limits per agent could be lower than the metric limits. For more information, see Time Series Billing.
View Metric Limits
The metric limits are automatically set by the Sysdig backend components
based on your plan, agent version, and backend configuration.
Use the Sysdig Agent Health & Status dashboard under Host
Infrastructure templates to view metric limit for your account and the current usage per host for each
metric type.

The metric limits are exposed to the UI through the following agent
metrics.
Metrics | Description |
---|
statsd_dragent_metricCount_limit_appCheck | The maximum number of unique appCheck timeseries that are allowed in an individual sample from the agent per node. |
statsd_dragent_metricCount_limit_statsd | The maximum number of unique statsd timeseries that are allowed in an individual sample from the agent per node. |
statsd_dragent_metricCount_limit_jmx | The maximum number of unique JMX timeseries that are allowed in an individual sample from the agent per node. |
statsd_dragent_metricCount_limit_prometheus | The maximum number of unique Prometheus timeseries that are allowed in an individual sample from the agent per node. |
Learn More
4.4 -
Sysdig Info Metrics
Sysdig provides Prometheus compatible Info metrics to show infrastructure (sysdig_*_info
) and Kubernetes (kube_*_info
) labels. The info metric are gauges with a value of 1 and will have the _info
suffix .
For example, querying sysdig_host_info
in PromQL Query will provide all labels associated with the host, such as:
agent_id
agent_tag_cluster
host_hostname
domain
host
host_domain
host_mac
instance_id
Although info metrics are available, all the metrics that are ingested by Sysdig agents are automatically enriched with the metadata and you don’t need to do PromQL joins. For more information, see Run PromQL Queries Faster with Extended Label Set
4.5 -
Manage Metric Scale
Sysdig provides several knobs for managing metric scale.
There are three primary ways in which you could include/exclude metrics,
should you encounter unwanted metrics limits.
Include/exclude custom metrics by name filters.
See Include/Exclude Custom
Metrics.
Include/exclude metrics emitted by certain containers, Kubernetes
annotations, or any other container label at collection time.
See Prioritize/Include/Exclude Designated
Containers.
Exclude metrics from unwanted ports.
See Blacklist Ports.
4.6 -
Data Aggregation
Sysdig Monitor allows users to adjust the aggregation settings when
graphing or creating alerts for a metric, informing how Sysdig rolls up
the available data samples in order to create the chart or evaluate the
alert. There are two forms of aggregation used for metrics in Sysdig:
time aggregation and group aggregation.
Time aggregation is always performed before group aggregation.
Time Aggregation
Time aggregation comes into effect in two overlapping situations:
Charts can only render a limited number of data points. To look at a
wide range of data, Sysdig Monitor may need to aggregate granular
data into larger samples for visualization.
Sysdig Monitor rolls up historical data over time.
Sysdig retains rollups based on each aggregation type, to allow
users to choose which data points to utilize when evaluating older
data.
Sysdig agents collect 1-second samples and report data at 10-second
resolution. The data is stored and reported every 10-second with the
available aggregations (average, rate, min, max, sum) to make them
available via the Sysdig Monitor UI and the API. For time series charts
covering five minutes or less, data points are drawn at this 10-second
resolution, and any time aggregation selections will have no effect.
When an amount of time greater than five minutes is displayed, data
points are drawn as an aggregate for an appropriate time interval. For
example, for a chart covering one hour, each data point would reflect a
one minute interval.
At time intervals of one minute and above, charts can be configured to
display different aggregates for the 10-second metrics used to calculate
each datapoint.
Aggregation Type | Description |
---|
average | The average of the retrieved metric values across the time period. |
rate | The average value of the metric across the time period evaluated. |
maximum | The highest value during the time period evaluated. |
minimum | The lowest value during the time period evaluated. |
sum | The combined sum of the metric across the time period evaluated. |
In the example images below, the
kubernetes_deployment_replicas_available
metrics first uses the
average
for time aggregation:

Then uses the sum
for time aggregation:

Rate and average are very similar and often provide the same result.
However, the calculation of each is different.
If time aggregation is set to one minute, the agent is supposed
to retrieve six samples (one every 10 seconds).
In some cases, samples may not be there, due to disconnections
or other circumstances. For this example, four samples are
available. If this was the case, the average
would be
calculated by dividing by four, while the rate
would be
calculated by dividing by six.
Most metrics are sampled once for each time interval, resulting in
average and rate returning the same value. However, there will be a
distinction for any metrics not reported at every time interval. For
example, some custom statsd metrics.
Rate is currently referred to as timeAvg
in the Sysdig Monitor API
and advanced alerting language.
By default, average is used when displaying data points for a time
interval.
Group Aggregation
Metrics applied to a group of items (for example, several containers,
hosts, or nodes) are averaged between the members of the group by
default. For example, three hosts report different CPU usage for one
sample interval. The three values will be averaged, and reported on the
chart as a single datapoint for that metric.
There are several different types of group aggregation:
Aggregation Type | Description |
---|
average | The average value of the interval’s samples. |
maximum | The maximum value of the interval’s samples. |
minimum | The minimum value of the interval’s samples. |
sum | The combined value of all of the interval’s samples. |
If a chart or alert is segmented, the group aggregation settings will be
utilized for both aggregations across the whole group, and aggregation
within each individual segmentation.
For example, the image below shows a chart for CPU% across the
infrastructure:

When segmented by proc_name
, the chart shows one CPU% line for each
process:

Each line provides the average value for every process with the same
name. To see the difference, change the group aggregation type to sum:

The metric aggregation value showed beside the metric name is for the
time aggregation. While the screenshot shows AVG
, the group
aggregation is set to SUM
.
Aggregation Examples
The tables below provide an example of how each type of aggregation
works. The first table provides the metric data, while the second
displays the resulting value for each type of aggregation.

In the example below, the CPU% metric is applied to a group of servers
called webserver
. The first chart shows metrics using average
aggregation for both time and group. The second chart shows the metrics
using maximum aggregation for both time and group.

For each one minute interval, the second chart renders the highest CPU
usage value found from the servers in the webserver
group and from all
of the samples reported during the one minute interval. This view can be
useful when searching for transient spikes in metrics over long periods
of time, that would otherwise be missed with average aggregation.
The group aggregation type is dependent on the segmentation. For a view
showing metrics for a group of items, the current group aggregation
setting will revert to the default setting, if the Segment By
selection is changed.
4.7 -
Deprecated Metrics and Labels
Below is the list of metrics and labels that are discontinued with the introduction of new metric store. We made an effort to not deprecate any metrics or labels that are used in existing alerts, but in case you encounter any issues, contact Sysdig Support.
We have applied automatic mapping of all net.*.request.time.worst
metrics to net.*.request.time
, because the maximum aggregation gives equivalent results and it was almost exclusively used in combination with these metrics.
Deprecated Metrics
The following metrics are no longer supported.
net.request.time.file
net.request.time.file.percent
net.request.time.local
net.request.time.local.percent
net.request.time.net
net.request.time.net.percent
net.request.time.nextTiers
net.request.time.nextTiers.percent
net.request.time.processing
net.request.time.processing.percent
net.request.time.worst.in
net.request.time.worst.out
net.incomplete.connection.count.total
net.http.request.time.worst
net.mongodb.request.time.worst
net.sql.request.time.worst
net.link.clientServer.bytes
net.link.delay.perRequest
net.link.serverClient.bytes
Deprecated Labels
The following labels are no longer supported:
net.connection.client
net.connection.client.pid
net.connection.direction
net.connection.endpoint.tcp
net.connection.udp.inverted
net.connection.errorCode
net.connection.l4proto
net.connection.server
net.connection.server.pid
net.connection.state
net.role
cloudProvider.resource.endPoint
host.container.mappings
host.ip.all
host.ip.private
host.ip.public
host.server.port
host.isClientServer
host.isInstrumented
host.isInternal
host.procList.main
proc.id
proc.name.client
proc.name.server
program.environment
program.usernames
mesos_cluster
mesos_node
mesos_pid
In addition to this list, the composite labels ending with ‘.label’ string will no longer be supported. For example kubernetes.service.label
will be deprecated, but kubernetes.service.label.*
labels are supported.
4.8 -
Troubleshooting Metrics
Troubleshooting metrics include program metrics, connection-level network metrics, Kubernetes troubleshooting metrics, HTTP URL metrics, and some SQL metrics. They are reported on a granular 10s level and are stored for 4 days. Below is the list of troubleshooting metrics and the labels that you can use to segment them.
Program Level Metrics
sysdig_program_cpu_cores_used
sysdig_program_cpu_cores_used_percent
sysdig_program_cpu_used_percent
sysdig_program_memory_used_bytes
sysdig_program_net_in_bytes
sysdig_program_net_out_bytes
sysdig_program_net_connection_in_count
sysdig_program_net_connection_out_count
sysdig_program_net_connection_total_count
sysdig_program_net_error_count
sysdig_program_net_request_count
sysdig_program_net_request_in_count
sysdig_program_net_request_out_count
sysdig_program_net_request_time
sysdig_program_net_request_in_time
sysdig_program_net_tcp_queue_len
sysdig_program_proc_count
sysdig_program_thread_count
sysdig_program_up
In addition to the user-defined labels and standard set of labels Sysdig provides, you can use following labels to segment program metrics: program_cmd_line
, program_name
.
Connection-Level Network Metrics
sysdig_connection_net_in_bytes
sysdig_connection_net_out_bytes
sysdig_connection_net_total_bytes
sysdig_connection_net_connection_in _count
sysdig_connection_net_connection_out _count
sysdig_connection_net_connection_total _count
sysdig_connection_net_request_in_count
sysdig_connection_net_request_out_count
sysdig_connection_net_request_count
sysdig_connection_net_request_in_time
sysdig_connection_net_request_out_time
sysdig_connection_net_request_time
In addition to the user-defined labels and standard set of labels Sysdig provides, you can use following labels to segment connection level metrics:
net_local_service
, net_remote_service
, net_local_endpoint
, net_remote_endpoint
, net_client_ip
, net_server_ip, net_protocol
Kubernetes Troubleshooting Metrics
kube_workload_status_replicas_misscheduled
kube_workload_status_replicas_scheduled
kube_workload_status_replicas_updated
kube_pod_container_status_last_terminated_reason
kube_pod_container_status_ready
kube_pod_container_status_restarts_total
kube_pod_container_status_running
kube_pod_container_status_terminated
kube_pod_container_status_terminated_reason
kube_pod_container_status_waiting
kube_pod_container_status_waiting_reason
kube_pod_init_container_status_last_terminated_reason
kube_pod_init_container_status_ready
kube_pod_init_container_status_restarts_total
kube_pod_init_container_status_running
kube_pod_init_container_status_terminated
kube_pod_init_container_status_terminated_reason
kube_pod_init_container_status_waiting
kube_pod_init_container_status_waiting_reason
HTTP URL Metrics
sysdig_host_net_http_url_error_count
sysdig_host_net_http_url_request_count
sysdig_host_net_http_url_request_time
sysdig_container_net_http_url_error_count
sysdig_container_net_http_url_request_count
sysdig_container_net_http_url_request_time
In addition to the user-defined labels and standard set of labels Sysdig provides, you can use net_http_url
label to segment HTTP URL level metrics.
SQL Query Metrics
sysdig_host_net_sql_query_error_count
sysdig_host_net_sql_query_request_count
sysdig_host_net_sql_query_request_time
sysdig_host_net_sql_querytype_error_count
sysdig_host_net_sql_querytype_request_count
sysdig_host_net_sql_querytype_request_time
sysdig_container_net_sql_query_error_count
sysdig_container_net_sql_query_request_count
sysdig_container_net_sql_query_request_time
sysdig_container_net_sql_querytype_error_count
sysdig_container_net_sql_querytype_request_count
sysdig_container_net_sql_querytype_request_time
In addition to the user-defined labels and standard set of labels Sysdig provides, you can use net_sql_querytype
label to segment SQL querytype metrics by query type.
4.9 -
Prometheus Metrics Types
Sysdig Monitor transforms Prometheus metrics into usable, actionable
entries in two ways:
Calculated Metrics
The Prometheus metrics that are scraped by the Sysdig agent and
transformed into the traditional StatsD model are called calculated
metrics. In calculated metrics, the delta is stored with the previous
value. This delta is what Sysdig uses on the classic backend for metrics
analyzing and visualization. While generating the calculated metrics,
the gauge metrics are kept as they are, but the counter metrics are
transformed.
Prometheus calculated metrics cannot be used in PromQL.
The Histogram and Summary metrics are transformed into a different
format called Prometheus histogram and summary metrics respectively. The
transformations include:
All of the quantiles are transformed into a different metric, with
the quantile added as a suffix.
The count and sum of these summary metrics are exposed as different
metrics with names slightly changed. _
(underscore) in the name is
replaced with a period .
. For more information, see Mapping
Classic Metrics and PromQL
Metrics.
Prometheus calculated metrics (legacy metrics) are scheduled to be
deprecated in the coming months.
Raw Metrics
In Sysdig parlance, the Prometheus metrics that are scraped (by the
Sysdig agent), collected, sent, stored, visualized, and presented
exactly as Prometheus exposes them are called raw metrics. Raw metrics
are used with PromQL.
Sysdig counter is a StatsD type counter, where the difference in value
is kept, but not the raw value of the counter, whereas Prometheus raw
metrics are counters that are always monotonically increasing. A rate
function needs to be applied on Prometheus raw metrics to make sense of
it.
Time Aggregations Over Prometheus Metrics
The following time aggregations are supported for both the metric types:
Average: Returns an average of a set of data points, keeping all the
labels.
Maximum and Minimum: Returns a maximal or minimal value, keeping all
the labels.
Sum: Returns a sum of the values of data points, keeping all the
labels.
Rate (timeAvg
): Returns a sum of changes to the counter across
data points in a given time period and divides by time, keeping all
the labels as they are. For Prometheus raw metrics, timeAvg
is
calculated by taking the difference and dividing it by time.
Prometheus Calculated Metrics
Prometheus calculated metrics are treated as gauges by Sysdig, and there
the following time aggregations are available:
Rate (timeAvg
) is not available because they are not applicable to
gauge metrics.
Prometheus Raw Metrics
For the gauge type, the following types are available:
For the counter type, the following types are available:
5 -
Dashboards
Sysdig users can create customized dashboards to display the most useful
or relevant views and metrics for the infrastructure in a single
location. This feature-rich dashboards support both form-based and
PromQL-based queries and offer several user experience enhancements:
Multiple data queries per panel
Basic (form-based) and advanced (PromQL) data queries
Compare basic query result against historical data
Improved granularity of data shown in dashboards. For example, a 1-hour selection shows metrics with 10 seconds intervals.
Display up-to-date metrics without time re-alignment.
Query support:
Allows to query multiple metrics
Render the results of a query (time series) as line, bars,
stacked area, stairs, text, and so on.
Ability to scope and segment each query separately
Inherit, augment, or override the dashboard scope
Metric descriptor based units with the ability to override
Assign Y-axis automatically based on query unit type with the
ability to override
Each dashboard is composed of a series of panels configured to display
specific data in a number of different formats. Learn more about how
dashboards and panels are created, organized, and managed in the
following sections:
5.1 -
About the Dashboard UI
The main components of the Dashboard UI include widgets, time
navigation, and panels.

Dashboards support time series (Timechart), Histogram, Number graphs,
Table, Text, and Toplist.
Timechart, Number and Toplist graph support both form-based and advanced (PromQL)
queries, whereas Histogram and Table panels support building
only form-based queries. Form-based Number, Table, Histogram,
and Toplist panels can show either the latest value for an entity or the entire range of values.
Time Navigation
Dashboard is designed around time. After a query has been executed,
Sysdig Monitor polls the infrastructure data every 10 seconds and
refreshes the metrics on the Dashboard panel. You select how to view
this gathered data by choosing a Preset interval and a time Range.

Presets
Presets are a way of visualizing data that Sysdig Monitor gathers every
10 minutes. Select a preset to determine the data sample to be
displayed. Overview supports the following presets:
10 Minutes
1 Hour
6 Hour
12 Hour
1 Day
4 Day
1 Week
2 Weeks
A preset that is 10 minutes or less is refreshed every 30 seconds. A
preset that is greater than 10 minutes is refreshed at every 10 second intervals.
Presets work in conjunction with Range selections. Selecting a
particular preset interval refreshes Range selection and reloads the
data subsequently. For example:
10 Minutes: Resets the Range to December 9, 2.20 pm - December 9,
2.30 pm.
6 Hour: Resets the Range to December 9, 8.30 am - December 9, 2.30
pm.
1 Day: Resets the Range to December 8, 2.30 pm - December 9, 2.30
pm.
Range
Range shows both date and time interval as well as the selected Presets
in parenthesis. The Range indicated on the UI is determined by Presets.
The time given is the closest time interval and by default, it is the
current date and time preset by 1 hour.
Click on the Range tab to open a calendar to select a range.

See Presets to understand how Range works with Presets.
Live
The Live badge shows if the data shown is Live or Paused.
Live: the data is continuously updating based on the 10-minute
polling of the Sysdig back end. The Overview feed is normally always
Live.
Paused: When a specific row is selected, the data refresh pauses
and the rows will not be updated with new data coming in.
Dashboards support UTC and PDT time formats. Use the toggle button next
to Range to change the time format for the slot shown in Range. The
default is PDT.
Panel Properties
Query
With the Dashboard, you can construct queries in two ways:
Form-Based and Advanced. As you construct your query and type in a
keyword in the Metrics field, auto-complete offers suggestions for
the metrics in the query.
Use the UI fields to construct queries. Form-based data queries consist
of one metric with time and group aggregation,
Segmentation, Display, Unit for both incoming data as well as displaying
data on the Y-Axis, and Scope. You can choose to inherit the Dashboard
scope.

Form-based queries support both Sysdig dot notation and
Prometheus-compatible underscore notation.
PromQL Query
The PromQL field supports only PromQL queries. Manually enter a PromQL
query as follows:

Each query starts with a group aggregator, followed by a time
aggregator, then the metrics and segmentation. For example:
topk(10,avg(avg_over_time(sysdig_program_cpu_cores_used{$__scope}[$__interval])) by (program_name, container_name))
Alternatively, you can build a form-based query and translate it to PromQl by using the Translate to PromQL option.
For more information, see Build PromQL Panels from Form Query.
$__interval
You can use $__interval
within a PromQL query to use the most
appropriate sampling depending on the time range you have selected. This
configuration ensures that the most granular data is accessible while
downsampling when you select a long time range to panels load as fast as
possible.
Scope variables
You can configure scope variables at the dashboard level to quickly
filter metrics based on Cluster, Namespace, Workload, and more.

When using PromQL queries, you can select the scope by using dynamic
variables. This configuration is significant when troubleshooting as it
allows you to switch context quickly without reconfiguring queries.

$__scope
You can use $__scope
within a PromQL query to apply a selected scope. It allows you to apply the whole scope instead of applying each scope variable individually to the query. See [en/docs/sysdig-monitor/dashboards/dashboard-scope/#using-__Scope](Using $_scope)
Smart Autocompletion and Syntax Highlighting
Autocomplete suggests metrics, operators, and functions, while syntax
highlighting helps highlight problems within a PromQL query. This is
invaluable in dynamic environments and allows you to craft the right
queries faster.
Define Axes
Sysdig Monitor provides the flexibility to add two Y-axes on the graph.
You can also determine whether you want to use them at all. Having the
option to add an extra Y-axis help when you decide to add an extra
query.
Specify the following for both Y-Axis and Y-Axis Right:
Show: Select to show the Y-Axis on the graph.
Scale: Specify the scale in which you want the data to be shown
on the graph.
Unit: Specify the unit of scale for the incoming data.
Display Format: Specify the unit of scale for the data to be
displayed on the Y-Axis.
Y-Max: Specify the highest value to be displayed on the Y-Axis.
Consider this as the highest point on the range. You can specify the
limits as numeric values. However, the type of values that you
specify must match the type of values along the axis. Y-Max should
be always greater than Y-Min.
Y-Min: Specify the lowest value to be displayed on the Y-Axis.
Consider this as the lowest point on the range. You can specify both
limits or you can specify one limit and let the axes automatically
calculate the other.
Define Legend
Determine whether you want a legend with a descriptive label for each
plotted time series. Specify the location and layout. Determine the
value to be displayed should be the most recently calculated data.
For the labels, the legend uses the text you have specified in the
Query Display Name and Timeseries Name fields.

Enable Show to show the legend or create a legend if one does not
exist.
Right positions the legend in the upper right corner of the panel.
Bottom positions the legend in the lower-left corner of the panel.
Define Panel
Specify the Panel heading and description by using the Panel tab.
The description you enter appears as the panel information as follows:

5.2 -
Using PromQL
PromQL is available only in Sysdig SaaS editions. The feature is not yet
supported by Sysdig on-premises installations.
The Prometheus Query Language (PromQL) is the defacto standard for
querying Prometheus metric data. PromQL is designed to allow the user to
select and aggregate time-series data.
Sysdig Monitor’s PromQL support includes all of the features, functions,
and aggregations in standard open-source PromQL. The PromQL language is
documented at Prometheus Query
Basics.
For new functionalities released as part of agent v10.0.0, see Collect Prometheus Metrics.
Construct a PromQL Query
In the Dashboard Panel, select the PromQL type to query data using
PromQL.

Display: Specify the following:
Type: Select the type of chart. The supported types are Stacked Area and Line. This option is currently not supported for other visualization types.
Query Display Name: A meaningful display name for the legend. The
text you enter replaces the metric name displayed in the legend. The
default legend title is the metric name. The default legend title is the query itself.
Timeseries Name: A display name of the time series for the query using text and any label values returned with the metric.
Query: Enter one or more PromQL queries directly. For example:
sum(rate(sysdig_container_net_in_bytes{$__scope}[$__interval])) by (container_id,agent_id)
Specify the following:
Metrics: Search the desired metric. The field supports
auto-complete. Enter the text and the rest of the text you type
is predicted so you can filter the metric easily. In the
example: sysdig_container_net_in_bytes
.
Segmentation: This is the process of categorizing aggregated
data with labels to provide precise control over the data.
Choose an appropriate value for segmenting the aggregated PromQL
data. In this example, container_id
and agent_id
.
The PromQL query field supports the following reserved variables. The
variables are replaced in the UI in real-time. The expressions are translated into PromQL format and applied to the query while fetching the data.
$__range
: Represents the time range
currently selected in the time navigation. In the Live mode, the
value is constantly updated to reflect the new time range.
$__interval
: Represents a time
interval and is automatically configured based on the time
range.
$_scope
: Represents the selected scope that is applied to a PromQL query. The defined scope is applied by using the filter functionality of PromQL similar to how scope variables are applied. It allows you to apply whole scope to the queries, instead of applying each scope variable individually.
Options: Specify the following:
- Unit and Y-Axes: Specify the unit of scale and display format.
- No Data Display: Determine how to display null data on the dashboard.
Axes: Determine scale, unit, display format, and gauge for the
Y-axes.
Legend: Determine the position of the legend in the Dashboard.
Panel: Specify a name and add details about the panel.
See Create a New Panel
for details.
You can use the Translate to PromQL option to quickly build a PromQL-based panel from form queries. To do so,
Build a form query, as given in Building a Form-Based Query. For example, let us build a Toplist for the metric, sysdig_program_cpu_cores_used
, segmented by program_name
and container_name
.
For Sorting, choose Top.
Click Translate to PromQL.
If a PromQL query is already defined, you will see a message similar to the following:
In the scenario, you are overriding manually-created or manually-modified queries in the PromQL tab.
- Click Continue to proceed.The PromQL Toplist panel will be displayed on screen.
Apply a Dashboard Scope to a PromQL Query
The dashboard scope is automatically applied only to form-based panels.
To scope a panel built from a PromQL query, you must use a scope
variable within the query. The variable will take the value of the
referenced scope parameter, and the PromQL panel will change
accordingly.
There are two predefined variables available:
$__interval
represents the time interval defined based on the time
range. This will help to adapt the time range for different
operations, such as rate
and avg_over_time
, and prevent
displaying empty graphs due to the change in the granularity of the
data.
$__range
represents the time interval defined for the dashboard.
This is used to adapt operations like calculating average for a time
frame selected.
The following examples show how to use scope variables within PromQL
queries.
Example: CPU Used Percent
The following query returns the CPU used percent for all the hosts,
regardless of the scope configured at the dashboard level, with a mobile
average depending on the time span defined.
avg_over_time(sysdig_host_cpu_used_percent[$__interval])
To scope this query, you must set up an appropriate scope variable. A
key step is to provide a variable name that is referenced as part of the
query.

In this example, hostname
is used as the variable name. The host can
then be referenced using $hostname
as follows:
avg_over_time(sysdig_host_cpu_used_percent{host_name=$hostname}[$__interval])
Depending on the operator specified while configuring scope values, you
might need to use a different operator within the query. If you are not
using the correct operator for the scope type, the system will perform
the query but will show a warning as the results may not be the expected
ones.
| | sysdig_host_cpu_used_percent{host_name=$hostname}
|
| | sysdig_host_cpu_used_percent{host_name=~$hostname}
|
Enrich Metrics with Labels
Running PromQL queries in Sysdig Monitor by default returns only a
minimum set of labels. To enrich the return results of PromQL queries
with additional labels, such as Kubernetes cluster name, you need to use
a vector matching operation. The vector matching operation in Prometheus
is similar to the SQL-like join operation.
Info Metrics
Prometheus returns different information metrics that have a value of 1
with several labels. The information that the info metrics return might
not be useful as it is. However, joining the labels of an info metric
with a non-info metric can provide useful information, such as the value
of metric X across an infrastructure/application/deployment.
Vector Matching Operation
The vector matching operation is similar to an SQL join. You use a
vector matching operation to build a PromQL query that can return
metrics with information from your infrastructure. Vector matching helps
filter and enrich labels, usually adding information labels to the
metrics you are trying to visualize.
See Mapping Between Classic Metrics and PromQL
Metrics for a list of info
metrics.
Example 1: Return a Metric Filtered by Cluster
This example shows a metric returned by an application, say
myapp_guage, running on Kubernetes. The query attempts at getting an
aggregated value of a cluster, by having one cluster selected in the
scope. We assume that previously you have set a $cluster
variable in
your scope.
To do so, run the following query to return the myapp_guage metrics:
sum (myapp_gauge * on (container_id) kube_pod_container_info{cluster=$cluster})
The query performs the following operations, not necessarily in this
order:
The kube_pod_container_info
info metrics is filtered, selecting
only those timeseries and the associated cluster values you want to
see. The selection is based on the cluster
label.
The myapp_gauge
metric is matched with the
kube_pod_container_info
metric where the container_id
label has
the same value, multiplying both the values. Because the info metric
has the value 1, multiplying the values doesn’t change the result.
As the info metric has already been filtered by a cluster, only
those values associated with the cluster will be kept.
The resultant timeseries with the value of myapp_gauge
are then
aggregated with the sum function and the result is returned.
Example 2: Calculate the GC Latency
This example shows calculating the GC latency in a go application
deployed on a specific Kubernetes namespace.
To calculate the GC latency, run the following query:
go_gc_duration_seconds * on (container_id,host_mac) group_left(pod,namespace) kube_pod_container_info{namespace=~$namespace}
The query is performing the following operations:
The kube_pod_container_info
info metrics are filtered based on the
namespace variable.
The metrics associated with go_gc_duration_seconds
is matched in a
many-to-one way with the filtered kube_pod_container_info
.
The pod and namespace labels are added from the
kube_pod_container_info
metric to the result. The query keeps only
those metrics that have the matching container_id
and host_mac
labels on both sides.
The values are multiplied and the resulting metrics are returned.
The new metrics will only have the values associated with
go_gc_duration_seconds
because the info metric value is always 1.
You can use any Prometheus metric in the query. For example, the query
above can be rewritten for a sample Apache metric as follows:
appinfo_apache_net_bytes * on (container_id) group_left(pod, namespace) kube_pod_container_info{namespace=~$namespace}
Example 3: Calculate Average CPU Used Percent in AWS Hosts
This example shows calculating the average CPU used percent per AWS
account and region, having the hosts filtered by account and region.
avg by(region,account_id) (sysdig_host_cpu_used_percent * on (host_mac) group_left(region,account_id) sysdig_cloud_provider_info{account_id=~$AWS_account, region=~$AWS_region})
The query performs the following operations:
Filters the sysdig_cloud_provider_info
metric based on the
account_id
and region
labels that come from the dashboard scope
as variables.
Matches the sysdig_host_cpu_used_percent
metrics with
sysdig_cloud_provider_info
. Only those metrics with the same
host_mac
label on both sides are extracted, adding region
and
account_id
labels to the resulting metrics.
Calculates the average of the new metrics by account_id
and
region
.
Example 4: Calculate Total CPU Usage in Deployments
This example shows calculating the total CPU usage per deployment. The
value can also be filtered by cluster, namespace, and deployment by
using the dashboard scope.
sum by(cluster,namespace,owner_name) ((sysdig_container_cpu_cores_used * on(container_id) group_left(pod,namespace,cluster) kube_pod_container_info) * on(pod,namespace,cluster) group_left(owner_name) kube_pod_owner{owner_kind="Deployment",owner_name=~$deployment,cluster=~$cluster,namespace=~$namespace})
sysdig_container_cpu_cores_used
can be replaced by any metric that
has the container_id
label.
To connect the sysdig_container_cpu_cores_used
metric with the
pod, use kube_pod_container_info
and then, use
kube_pod_owner
to connect the pod to other kubernetes objects.
The query performs the following:
sysdig_container_cpu_cores_used * on(container_id) group_left(pod,namespace,cluster) kube_pod_container_info
:
The sysdig_container_cpu_cores_used
metric value is multiplied
with kube_pod_container_info
(which has the value of 1), by
matching container_id
and by keeping the pod, namespace and
cluster labels as it is.
_name_='sysdig_container_cpu_cores_used',container='<label>', container_id='<label>',container_type='DOCKER`,host_mac='<label>'
The new metrics will be
cluster='<label>',container='<label>', container_id='<label>',container_type='DOCKER`,host_mac='<label>',namespace='<label>, pod='<label>'
The value extracted from the previous result is multiplied with
kube_pod_owner
(which has the value of 1) by matching on the
pod, namespace, and cluster labels and keeping the owner name from
the value of kube_pod_owner
. The owner can be deployment,
replicaset, service, daemonset, or statefulset object.
The name of the deployment to filter upon is extracted from the
kube_pod_owner
metrics.
The pod, namespace, and cluster names are extracted from the
kube_pod_container_info
metrics.
The new metrics will be:
cluster='<matched_label>',container='<matched_container_label>', container_id='<label>',container_type='DOCKER`,host_mac='<label>',namespace='<label>, owner_name ='<label>', pod='<label>'
The kube_pod_owner
will have a label owner_name
that is the name
of the object that owns the pod. This value is extracted by
filtering:
kube_pod_owner{owner_kind="Deployment",owner_name=~$deployment,cluster=~$cluster,namespace=~$namespace}
The owner_kind
provides the deployment name and the origin of
owner_name
, that is the dashboard scope.
The sum aggregation is applied and the time series are aggregated by
cluster, namespace, and deployment.
The following table helps understand the labels applied in each step of
the query:
sysdig_container_cpu_cores_used * on(container_id) group_left(pod,namespace,cluster) kube_pod_container_info)
| No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No |
(sysdig_container_cpu_cores_used * on(container_id) group_left(pod,namespace,cluster) kube_pod_container_info) * on(pod,namespace,cluster) group_left(owner_name) kube_pod_owner{owner_kind="Deployment",owner_name=~$deployment,cluster=~$cluster,namespace=~$namespace}
| No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
sum by(cluster,namespace,owner_name) ((sysdig_container_cpu_cores_used * on(container_id) group_left(pod,namespace,cluster) kube_pod_container_info) * on(pod,namespace,cluster) group_left(owner_name) kube_pod_owner{owner_kind="Deployment",owner_name=~$deployment,cluster=~$cluster,namespace=~$namespace})
| No | No | No | No | No | No | Yes | Yes | Yes |
Sysdig Monitor supports percentages only as 0-100 values. In calculated
ratios, you can skip multiplying the whole query times 100 by selecting
percentage as a 0-1 value.
Learn More
5.3 -
Dashboard Scope
Dashboard and panel scope defines what data is valid for aggregation and
display within the dashboard. The scope can be set at a dashboard-wide
level, or overridden for individual panels, by any user type except for
View Only users.
The current scope is displayed in the top left-hand corner of the module
screen:

For more information on how scopes work, refer to the Grouping,
Scoping, and Segmenting
Metrics documentation.
To configure the scope of an existing dashboard:
From the Dashboard
module, select the relevant dashboard from the
dashboard list.
Click the Edit Scope
link in the top right of the module screen:

Open the first level drop-down menu.
Select the labels either by clicking the desired label,
or searching for the label, then clicking it.
Select one or more labels values from the drop-down.
Scope editor restricts the scope of the selection for subsequent
filters by rendering values that are specific to the selected
labels. For example, if the value of the
kube_namespace_name
label is kube-system
, the values of
the subsequent label, container_name
will be filtered by
kube-system.
This means the containers rendered for filtering are
only those that are part of the kube-system namespace.
Optional: Dashboard Templating.
Dashboard scope values can be defined as variables, allowing users
to create a template, and use one dashboard for multiple outputs.
Optional: Add additional label/value combinations to further
refine the scope.
Click Save to save the new scope, or click Cancel
button to revert the changes.
To reset the dashboard scope to the entire infrastructure, or to
update an existing dashboard’s scope to the entire infrastructure,
click Clear All.
To configure the scope of an existing dashboard panel:
From the Dashboard module, select the relevant dashboard from the dashboard list.
Click Edit (pencil) icon:
From the query field associated with the metric, click Scope.
Select the labels either by clicking the desired label,
or searching for the label, then clicking it.
Select one or more label values.
Optionally, apply panel scope to all the queries.
Click Save to confirm the changes.
Using $__scope
The Scope variable is indicated by $_scope
and can be used in PromQL queries. The variable represents a scope that you have already defined. When you insert the $_scope
variable to a PromQL expression, the selected scope is applied to the query you have built. The scope variable allows you to apply the whole scope to the query, instead of applying each scope variable individually.

If you select Entire Infrastructure as the scope, no scope will be applied.
5.4 -
There are two parts to creating a dashboard - creating the dashboard
itself, and creating the panels that display the information.
5.4.1 -
Create a New Dashboard
To create a dashboard with the following:
Using the Get Started Wizard.
Using a dashboard template.
Dashboard templates are essentially immutable dashboards that can’t
be edited, and the scope is fixed. You can copy them and customize
as desired. See Dashboard
Templates.
Using directly the Dashboard tab. This section helps you
navigate to the default Panel editor screen.
Get Started Wizard
Clicking the Create Dashboard takes you to the default panel editor
screen.
Dashboard Tab
On the Dashboards tab, click Add Dashboard.
Select one of the following:
From Dashboard Template: Copy from a dashboard template.
Blank Dashboard: When you create a new dashboard, you are
dropped into the panel editor. It is the default dashboard for
the avg(avg(sysdig_container_cpu_used_percent))
metrics.
Specify a name for the dashboard, build a query, and save.
For information on running queries, see the following:
The new dashboard will now be added to the side panel under My
Dashboards and is ready for configuration.
5.4.2 -
Dashboard Templates
Sysdig provides a number of pre-built dashboards, designed around
various supported applications, network topologies, infrastructure
layouts, and services. These can be used to jump-start the dashboard
building process, as templates for further configuration.
Templates come with a series of panels already configured, based on the
information most relevant users. The example below uses the Container
dashboard template:

The default dashboard includes number panels for CPU and Memory usage,
total, in the network, and out of network bytes, and line graphs
comparing in the network and out of network bytes, as well as byte usage
by application/port, process, and by the host.
To learn more, see Dashboard
Templates.
5.4.3 -
To view the current dashboard in full-screen mode:
Click the Settings (three dots) icon for the dashboard, and select
the Full Screen option:
Dashboards cannot be configured in full-screen mode. They are read-only
until the full-screen mode is exited.
To exit full-screen mode, either press the ESC keyboard key or click the
Exit (cross) icon.
The size of individual panels can be altered by moving the mouse cursor
over the bottom right corner of a panel, until the diagonal resize
cursor appears, pressing and holding the left mouse button, and
increasing or decreasing the size of the panel by moving the cursor
while pressed. The changes can be saved by clicking the Save Layout
link, or reverted by clicking the Revert Changes link.
To configure the size of every panel in the dashboard:
On the Dashboards tab, select the relevant dashboard from the
left-hand panel.
Click the Settings (three dots) icon for the dashboard.
Select Layout to open the drop-down menu.
Select the desired panel size.
If the new size is correct, click the Save Layout link.
Otherwise, select Revert Changes
.
Configuring this setting overrides all custom panel sizes.
Move Panels
To move a panel to a new position in the dashboard, move the mouse
cursor over the top of the panel, until the hand cursor appears. Press
and hold the left mouse button, and move the panel by moving the cursor
while pressing the button. The changes can be saved by clicking the
Save Layout
link, or reverted by clicking the Revert Changes
link.
5.4.4 -
Delete a Dashboard
The owner or the administrator of a shared dashboard can delete it. If
users duplicate that dashboard, they become the owner of the new one and
are allowed to freely delete it.
For information on access rights, see Access Levels in
Dashboard.
To delete an existing dashboard:
On the Dashboard tab, select the relevant dashboard from the
left-hand panel.
Click the Settings (three dots) icon for the dashboard.
Select Delete Dashboard.
Click the Yes, Delete the Dashboard button to confirm the
change.
5.5 -
Learn more about types, creating, and managing panels in the following
sections:
5.5.1 -
Create a New Panel
Sysdig Monitor supports both form-based and PromQL-based queries. You
simply run a query and Sysdig Monitor builds a Dashboard that you can
customize according to your preferences.
To create a new panel, you can do one of the following:
Create a new dashboard.
When you create a new dashboard, it opens to a pre-built panel. You
can run a new query and build the dashboard.
Use a dashboard template.
Dashboard templates are essentially immutable dashboards that can’t
be edited, and the scope is fixed. You can copy them and customize
as desired. See Dashboard
Templates.
Add a new panel to an existing dashboard.
For a PromQL panel, use the Translate to PromQL option.
To create a new panel:
On the Dashboard tab, select the relevant dashboard from the
drop-down.
Click the Add Panel icon.
The default panel editor opens up.
Set up the panel:
Build either a form-based query or a PromQL-based query.
Define right and left Y-axes.
Define the legend.
Specify a unique title and a brief description for the panel, and enter a custom message to report no data.
Click Save to save the changes.
Each type of visualization has different settings and the query fields
are determined by the type. For demonstration purposes, this topic
explains the steps to create a Line chart.
On the Dashboards tab, click Add dashboard.
Clicking the (+) icon opens a default panel editor.
Select a visualization type. To do so, click the Timechart tab.
For more information on types of visualization, see Types of
Panels.
Select the appropriate time
presets from the time
navigation.
Select a metric from the drop-down as follows:
You can either scroll down or type the first few letters of the
metrics. As you enter the first few letters the drop-down lists the
matching entries.
Specify Time Aggregation and Group Rollup.
Specify an appropriate segmentation:
You can enter the number of entities and the order in which they are
displayed in the legend.
Not applicable to Number panels.
Specify the display text in the Display field.
The text appears as a title for the legend:
(optional) Specify the scope for the panel you are creating.
You can either choose to inherit the dashboard scope as it is or
apply the scope to one or all the queries.
Specify the unit of scale and the display format for Y-Axis.
This option is currently available only for Timeseries panels.
Determine how to display null data on the dashboard.
You can display no data as a gap, a zero value, a dotted line, or a solid line in the graph. See Display Missing Data.
Optionally, compare the data against historical data.
When segmentation is applied, comparing metrics against historical
data is not supported.
Building a PromQL Query
To run a PromQL query:
Do one of the following:
Click the PromQL button.
The PromQL panel appears.Enter the query in the PromQL field:
In this example, the rate of memory heaps released in bytes in an
interval of 5 minutes is calculated and then the total rate is
calculated in each Kubernetes cluster.
Select the desired time
window.
Specify a descriptive title for the legend and a name for the time
series.
You can specify a variable as shown in the image. The variable name
is replaced with the Kubernetes cluster names in the legend.
Specify the unit for incoming data and how it should be displayed.
For example, you can specify the incoming data to be gathered in
kilobytes and displayed as megabytes.
Also, determine the location of the Y-Axis on the graph. When you
have additional queries, the flexibility to place an additional
Y-axis on the graph comes in handy.
Determine how to display null data on the dashboard.
You can display no data as a gap, a zero value, a dotted line, or a
solid line in the graph. See Display Missing
Data.
Click Save to save the changes.
5.5.2 -
Types of Panels
This topic introduces you to the types of panels in the New Dashboard.
5.5.2.1 -
Timechart Panel
A Timechart is a graph produced by applying statistical aggregation to a
label over an interval. The X-axis of a timechart will always be time.
Timecharts allow you to see the change in metric value over time. The
amount of data visualized on a graph is dependent on the time selection
selected within the Dashboard. You can aggregate metrics from multiple
sources into a single line, or graph a line per combination of segment
labels.
Time aggregation: For example, the average value
of cpu.used.percent
metric is computed for each entity over 1 hour at
1-minute intervals.
Group Rollup: For each host.hostName
the values from time
aggregation are averaged over the scope and the top 10 segments are
shown on the chart.
The only supported panel type now in time series is the Line chart.
Line Chart
The Line panel show change over time in a selected window. Time is
plotted on the horizontal axis and the change that is measured is
plotted on the vertical axis.
The image below shows the trend of resource consumption of top
resource-hogging hosts in the last one hour.

For information on configuring a chart, see Create a New
Panel.
Stacked Area
An area chart is distinguished from a line chart by the addition of
shading between lines.

For information on configuring a chart, see Create a New
Panel.
5.5.2.2 -
Number Panel
Number panels allow you to view a single value for a given entity, along
with optionally comparing the current value to historical values. Use
the Number panel when the number is the most important aspect of the
metric you’re trying to display, such as unique visitors to a website.
Do not use this panel to see a trend, rather use it when you need to see
the average of a value over the given time range. This is also useful
for counting entities, such as the number of nodes in a cluster.
For information on configuring a panel, see Create a New
Panel.
Major Features
The default preset for the Number visualization is 1 hour.
The global default values for the threshold are overridable. The new
value can be reset back to the global default.
A comparison between two threshold values determines color-coding
directions.
The Compare To functionality can be toggled between enabled and
disabled.

When the Compare To value is set, the preview is updated
accordingly showing the comparison value and an arrow denoting the
metric has increased or decreased.
The unit displayed for Thresholds is determined by the query.

5.5.2.3 -
Table Panel
The Table panel displays metric data in tabular form. In this view, you
can review metric values and their associated labels in a single view.
Use Table panels for such quantitative analysis where you can see actual
values instead of visual representations. Similar to a spreadsheet, you
can look at a combination of metric values and their segments. This is
useful when you don’t necessarily care about the change in metric over
time, or want to run reports to download as CSV/JSON for offline
analysis.
The panel displays the value returned by the metric query specified in
the Query tab. The value is determined by the data source and the
query. Each datapoint will have an associated raw and an option to add
columns for additional metric values.

Major features include but not limited to :
Queries
The first query you build cannot be removed.
With subsequent queries are built, you cannot remove all the
queries except the first one.
Changing the unit of the query changes the unit in the table as
well.
Changing the display format on the query reflects on the row
values.
Segmentation
Scope
- The selected scope determines the values displayed on the table.
Metric / Labels Columns
Sorting
Column sorting is based on the selected column header and the
type of sorting (ascending and descending).
When another column is sorted, the table is resorted by that
column, resetting the previous sorting.
Resizing
Grab the header column by the borderline to resize the columns.
Browser window resizes shouldn’t reset the resize of the columns
if you have resized any columns.
When resizing the browser window, table columns are resized to
cover the full width. An exception is when you have already
resized columns. In such cases, other columns that you have not
resized are resized on browser window resize.
The last column in the table is not resizable.
Export
The table by default shows a maximum of 50 rows.
Clicking on Export all results… below the table opens the
Export Data window.
Export data in either JSON or CVS format to a file. The default
name of the file is the panel name. Renaming the default
filename is permissible.
For information on configuring a chart, see Create a New
Panel.
5.5.2.4 -
Text
The example below uses a text panel as a reminder list of the testing
steps for a procedure.

Text Panel Markdown
# H1
## H2
### H3
#### H4
##### H5
###### H6
H1
======
H2
------
Emphasis
*italics* or _italics_
**bold** or __bold__
**combined _emphasis_**
~~strikethrough~~
Lists
1. First ordered list item
2. Second item
* Unordered sub-list.
Sub-paragraph within the list item.
1. Third item
8. First ordered sub-list item.
103. Fourth item
General guidelines:
The list item number does not matter. As shown in the example below,
the formatting defines the lists.
List items can contain properly indented paragraphs, using white
space.
Unordered list can use: *
, -, or +.
Linebreaks
This is the first sentence.
This line is separated from the one above by two newlines, so it will be a *separate paragraph*.
This line is also a separate paragraph.
This line is only separated by a single newline, so it's a separate line in the *same paragraph*.
Trailing spaces can be used for line-breaks without creating a new paragraph. This behavior is contrary to the typical GFM line break behavior, where trailing spaces are not required.
5.5.2.5 -
Toplist
A Toplist chart displays the specified number of entities, such as
containers, with the most or least of any metric value. This is useful
for “ranking” metric values in order, for example, considering hosts
that have the highest amount of pods running or the highest consumers of
CPU or memory in your infrastructure.
Major Features
Toplist supports executing multiple queries.
Segmentation is supported for all queries.
Text displayed on the bars in the chart is based on queries and
segmentation.
If there is a single query without segmentation, the query name is displayed.
If there is a single query and multiple segmentations are selected, segmentation texts
separated by > sign are displayed.
If there are multiple queries, the query name is displayed on the bar.
Segmentation
You can use multiple objects to simultaneously segment a single metric.
For example, cpu.used.percent
segmented by kubernetes.cluster.name
,
kubernetes.namespace.name
, and kubernetes.deployment.name
.

In this example, deployments are sequentially listed in the order of
resource consumption. Use Display to toggle between descending
(Top) and ascending order (Bottom).
5.5.2.6 -
Histogram
Sysdig Monitor handles three types of Histograms:
Histogram panel type on the Dashboard: Histogram panels allow you to
visualize the distribution of metric values for large data
collection. You should select a segmentation, and optionally, the
number of buckets.
Use Histogram for any metric, Sysdig native or custom, counter or
gauge, segmented by a dimension/label. The histogram panel helps
understand value across different segments. For example, CPU usage
percent by pods across your cluster gives you the aggregated value
across the selected time.

Legacy Prometheus histogram collection: This implementation of
legacy Prometheus Histograms is deprecated in SaaS 3.2.6 release.
To create a Histogram, use the Prometheus
integration to collect
histogram metrics and use the
PromQL panel with the
histogram_quantile
function.
Prometheus histograms (collected as raw metrics): The legacy
Prometheus histogram collection is replaced by the new Prometheus
histogram. You can natively collect histogram metrics, and for
visualization, use timechart:
For example, run the following query to build a timechart:
sum(histogram_metrics_bucket{kubernetes_cluster_name="prod"}) by (le)
5.5.3.1 -
Create Panel Alerts
Alerts can be created directly from a form-based panel in a New
Dashboard. If the panel has more than one query, you must select the
query to use as the base for the alert.
To create an alert:
Click the More Options (three dots) icon.
Select Create Alert.

Configure the
alert,
and click the Create button.
5.5.3.2 -
Export Panel Data
Table and Timechart panels in New Dashboard allow exporting data to a
CSV or JSON file. This file could serve as a backup of your data or for
programmatical use.
You can export data using the following:
To export while creating or editing a Table panel:
Select Table from the Visualization type.

The panel opens to the Columns tab.
Below the table, click Export all results….
The Export Data window is displayed.

Select the format.
Specify a filename.
The default name of the file is the panel name. You can rename the
file that you are about to download.
Click Export to save the data into the file.
Exporting might take several minutes to complete.
5.5.3.3 -
Copy Panels to a Different Dashboards
Copy a Single Panel
To copy a single panel to a different dashboard:
From the Explore
tab, select the desired drill-down view.
Hover over the desired panel, select the Settings
(ellipsis) icon,
and select Copy Panel
.

Open the drop-down menu and select the desired dashboard, or use the
text-field to search through existing dashboards.

To copy the panel to a new dashboard, enter a name for the new
dashboard in the text-field instead.
Click the Copy and Open
button to save the changes and navigate to
the configured dashboard.
Copy All Panels
To copy all panels in a drill-down view to a dashboard:
From the Explore
tab, select the desired drill-down view.
Select the More Options
(three dots) icon.
Select Copy to Dashboard
:

Open the drop-down menu and select the desired dashboard, or use the
text-field to search through existing dashboards.
To copy the panel to a new dashboard, enter a name for the new
dashboard in the text-field instead.
Click the Copy and Open
button to save the changes and navigate to
the configured dashboard.
Create a Panel Alert
Alerts can be created directly from a dashboard panel:
Click the More Options
(three dots) icon.
Select CreateAlert
.
Configure the alert, and click the Create
button.
5.5.3.4 -
Duplicate a Panel
Hover over the desired panel, click the Settings
(ellipsis) icon, and
select Duplicate Panel
.
5.5.3.5 -
Delete an Existing Panel
To delete a panel from a dashboard:
Hover over the desired panel, click the Settings
(ellipsis) icon,
and select Delete Panel
.
Click the Yes, delete
panel button to confirm, or the Cancel
button to keep the panel.
5.6 -
Managing Dashboards
This section helps you effectively use dashboards and share them with
your team.
5.6.1 -
Dashboards Types
Dashboards are organized into the following main categories
My Favorites: The dashboards marked as favorites by the current
user.
Shared By My Team: The dashboards created by other users in the team and
shared with the current user.
Featured: A curated list of Kubernetes dashboards.
My Dashboards: The dashboards created by the current user.
Dashboard Templates: Out-of-the-box templates that you can copy and
use. A dashboard created from a template inherits the template name.
5.6.2 -
Set a Default Dashboard
A default dashboard can be configured by setting the default entry point
for a team, unifying a team’s Sysdig Monitor experience, and allowing
users to focus their immediate attention on the most relevant
information for them. For more information on configuring a default
entry point, refer to the Configure an Entry Page or Dashboard for a
Team section of the Sysdig
Platform documentation.
5.6.3 -
Display Dashboard Specific Events
Sysdig Monitor allows users to configure dashboards to display
infrastructure events relevant to a dashboard’s panels within the panels
themselves. This allows users an even more in-depth view of the status
of their environment. To configure how events are displayed:
On the Dashboard tab, select the relevant dashboard from the
dashboard list.
Click the Dashboard Settings (three dots) icon and select
Events Display:
Enable the Show Events slider to show events in the dashboard
panels.
Configure the available parameters, and click the Close
button.
Option | Description |
---|
Filter | Defines specific events, or a scope of events, to display. |
Scope | Determines whether the range of events displayed includes those for dashboard scope or team scope. |
Severity | Determines whether only high severity events or all events are displayed. |
Event Type | Determines what types of events to be displayed. The supported events types are alert, custom events, containers, or Kubernetes. |
Status | Determines the state of events displayed. The supported status are Triggered, Resolved, Acknowledged, Un-acknowledged. |
5.6.4 -
Sharing New Dashboards
Dashboards can be shared internally among team members, with other
teams, within the wider organization, or publicly, by configuring a
public URL for the dashboard.
As an owner of a dashboard, you can share the dashboard with any team
and provide the Viewer or Collaborator access permission.
Access Levels in Dashboard
The RBAC-based permissions determine how users can interact with
Dashboards. They establish what capabilities are allowed or denied for a
user or a team. For more information on RBAC rules, see RBAC Rules for
Dashboards.
The table below summarizes the various ways a dashboard can be shared
and effective permissions for users.
| Who can share/copy | Dashboard Instance | Team/User who has access | Can Read | Can Edit |
---|
Share with current Team | Dashboard Creator | Same dashboard instance | Current team members only | All members of the team | Edit users of the team |
Share publicly as URL | Any Edit User of the team | Same dashboard instance | Anyone with URL (does not have to by Sysdig user) | Anyone | Anyone with URL (does not have to by Sysdig user) with Scope variables |
Copy to My Teams | Any Edit User of the team | Duplicate Copy of the dashboard | Current team members only | All members of the team | Edit users of the team |
Share a Dashboard with Teams
Dashboards can be shared across a user’s current team or a selected set
of teams, allowing other team members to view the dashboard, as well as
edit the panels if they have edit permissions within the team.
If a dashboard has been shared with another team, a user within that
team can then copy it to make it their own if they wish.
To share a dashboard:
Select the dashboard you want to share.
Click the Dashboard Settings (three dots) icon and select
Dashboard Settings.
In the Dashboard Settings page, use the Shared With
drop-down.

Select one of the three options:
Not Shared: If selected, the specified Dashboard cannot be
shared with a team or selected team the owner is a member of.
All Teams: If selected, the owners of the Dashboard can
share with all the teams that they are part of.
Selected Teams: If selected, the owner of the Dashboard can
share with a selected list of teams. You can select one of the
available teams in the drop-down, and select member permission:
Enable Public Sharing
Dashboards can be shared outside of the internal team by using public
URLs. This allows external users to review the dashboard metrics while
restricting access to changing panels and configurations.
The scope parameters, including scope variables, are included in the
Dashboard URL. External users with a valid link can change the scope
parameters without having to sign in. They can edit either on the UI or
in the URL. The scope parameters are passed to the standard request
header, consisting of a question mark, followed by the parameter name,
an equal sign, and the parameter value. To edit a parameter in the URL,
simply replace it with the desired one.
Select the dashboard you want to share.
Click the Dashboard Settings (three dots) icon and select
Dashboard Settings.

In the Dashboard Settings page, enable the Public
Sharing slider.

When enabled, the dashboard is visible with scope parameters to
anyone with the link. If this setting is disabled, the link will no
longer work, and the setting will need to be re-enabled and shared
again in order for the dashboard to be accessed.
Copy the public sharing URL for sharing.
5.6.4.1 -
RBAC Rules for Dashboards
The table below summarizes the role-based permissions.
Owner PermissionsUser Roles | Administrator | A user owning a dashboard will now have three different team sharing options: For the last two options, the owner can pick the type of access: Collaborator (with edit rights) or View only. |
Regular User (non-administrator user) | | |
Team Roles | Advanced user | |
Standard user | | |
Team manager | | |
View-only user | Not applicable. | |
Owner Permissions
When a user decides to share a dashboard with a set of teams, they’ll
only be able to pick teams that they are members of.
The table below summarizes what you can do with a shared dashboard.
User PermissionsUser Role | Administrator | Edit An admin can still edit a shared dashboard even if it's shared in view-only mode. | Edit |
Regular User (non-administrator user) | View Only | | |
Team Role | Advanced user | | |
Advanced user | | | |
Team manager | | | |
View-only user | View Only | | |
User Permissions
5.6.4.2 -
Transfer Dashboard Ownership
Dashboards have a single owner. Sysdig Monitor allows administrators and
dashboard owners with administrator permissions to transfer the
ownership of a dashboard within the UI.
There are several reasons for assigning a new owner to dashboards.
General Guidelines
When a user is deleted, any shared dashboards they own or have
created will be preserved by default.
The administrator can transfer only the dashboards that are shared
by other users. Private dashboards cannot be seen and therefore
cannot be transferred.
Transferring ownership can only happen one dashboard at a time.
When editing a user, the administrator can specify to transfer
dashboards to a new owner.
Before changing the dashboard ownership,
It is a good practice to ensure that the new owner is part of
the team the previous owner is part of. The administrator can
preview the teams that will no longer be part of before
confirming the transfer.
The new owner need not be part of any teams the previous owner
was part of. In this case, the dashboard will be transferred to
the new owner but will no longer be shared with any team. The
dashboard will become a private dashboard.
A shared dashboard will be visible only to the teams that the
new owner is not part of.
Transfer Ownership as an Admin
Log in to the Monitor UI.
Select Settings > Users.
Select the user you want to change the ownership.

Select one or multiple Dashboards that you want to assign a new
owner.
Click Transfer Ownership.
The Transfer Dashboard Ownership page is displayed.
Select a new user from the drop-down.
If the user that you selected is not part of the teams that the
Dashboard is shared with, you will see a prompt stating the
Dashboard will be unshared with the teams that the new owner is not
part of.
If you are satisfied with the changes, click Transfer.
Transfer Ownership as a User
On the Dashboards tab, select the relevant dashboard from the
left-hand panel.
Click the Settings (three dots) icon for the dashboard.
Select Transfer Ownership.

The Transfer Dashboard Ownership page is displayed.
Select a new user from the drop-down.
If everything looks ok, click Transfer.
The teams indicated with cross-out text are the ones that had access
to the dashboard earlier and will lose access to it after the
transfer.
The dashboard will also be visible to all the teams that the new
owner is part of. If you are not part of the teams that the new
owner is a member of, you will no longer have the visibility to the
dashboard.
5.7 -
Dashboard Templates
Sysdig provides a number of pre-defined dashboards to assist users in
monitoring their environments and applications. Dashboard templates are
essentially immutable dashboards that can’t be edited, and the scope is
fixed. They are useful as is to get a quick overview of infrastructure,
but you can use them as a template and can copy them to customize.
This section outlines the main dashboards that are available
out-of-the-box.
AWS CloudWatch
Dashboards | PromQL | Notes |
---|
ALB Overview | No | |
AWS ALB | Yes | |
AWS EBS | Yes | |
AWS ECS Fargate Overview | Yes | |
AWS ECS Fargate Service Detail | Yes | |
AWS ELB | Yes | |
AWS Lambda Function Detail | Yes | |
AWS Lambda Overview | Yes | |
AWS RDS | Yes | |
AWS S3 | Yes | |
AWS SQS | Yes | |
DynamoDB Overview | No | |
DynamoDB Overview By Operation | No | |
EC2 Overview | No | |
ECS Overview | No | |
ECS Projects | No | |
ECS Services | No | |
ECS Task Families | No | |
ECS Tasks | No | |
ELB Overview | No | |
ElastiCache Overview | No | |
RDS Overview | No | |
SQS Overview | No | |
AWS MetricsStream
Dashboards | PromQL | Notes |
---|
AWS ALB | Yes | |
AWS EBS | Yes | |
AWS ELB | Yes | |
AWS Fargate | Yes | |
AWS Lambda | Yes | |
AWS RDS | Yes | |
AWS S3 | Yes | |
AWS SQS | Yes | |
Applications
Dashboards | PromQL | Notes |
---|
ActiveMQ | No | |
Apache (legacy) | No | |
Apache App Overview | Yes | |
Apache CouchDB | No | |
Apache HBase | No | |
Apache Kafka | No | |
Apache ZooKeeper | No | |
Cassandra | Yes | |
Cassandra By Node | No | |
Cassandra Overview | No | |
Ceph | Yes | |
Consul | Yes | |
Consul | No | |
Consul Envoy | Yes | |
Couchbase | No | |
Docker Engine | Yes | |
ElasticSearch Cluster | Yes | |
ElasticSearch Infra | Yes | |
Elasticsearch | No | |
Fluentd | Yes | |
Fluentd | No | |
Gearman | No | |
Go | No | |
Go Internals | Yes | |
HAProxy | No | |
HAProxy Ingress Overview | Yes | |
HAProxy Ingress Service Details | Yes | |
HDFS | No | |
HTTP | No | |
Harbor | Yes | |
Istio 1.0 Overview | No | |
Istio 1.0 Service | No | |
Istio 1.5 Overview | No | |
Istio 1.5 Service | No | |
Istio v1.5 Service | Yes | |
Istio v1.5 Workload | Yes | |
JVM | No | |
Keda | Yes | |
Kyoto Tycoon | No | |
Memcached | No | |
Memcached | Yes | |
Microsoft SqlServer Overview | Yes | |
MongoDB (Server) | No | |
MongoDB Database Details | Yes | |
MongoDB Instance Health | Yes | |
MySQL | Yes | |
MySQL Server | No | |
NTP | Yes | |
Nginx | Yes | |
Nginx (legacy) | No | |
Nginx Ingress | Yes | |
OPA Gatekeeper | Yes | |
Oracle DB | Yes | |
PHP-FPM | No | |
Percona TokuMX | No | |
PgBouncer | No | |
Php-fpm | Yes | |
Portworx Cluster | Yes | |
Portworx Volumes | Yes | |
Postfix | No | |
PostgreSQL Database Details | Yes | |
PostgreSQL Instance Health | Yes | |
PostgreSQL Server | No | |
RabbitMQ | No | |
Rabbitmq Overview | Yes | |
Rabbitmq Usage | Yes | |
Redis | Yes | |
Redis (legacy) | No | |
Riak | No | |
Riak CS | No | |
Solr Cluster | No | |
Solr Host | No | |
Tomcat | No | |
Varnish | No | |
VoltDB | No | |
Windows Node Overview | Yes | |
Windows Node Overview (Legacy) | Yes | |
etcd | No | |
lighttpd | No | |
Compliance & Security
Dashboards | PromQL | Notes |
---|
Docker Compliance Report | No | |
Kubernetes Compliance Report (v1.4) | No | |
Security Summary | No | |
Containers
Dashboards | PromQL | Notes |
---|
Container CPU & Memory Limits | No | |
Container Disk Usage & Performance | No | |
Container Network | No | |
Container Resource Usage | No | |
Docker
Dashboards | PromQL | Notes |
---|
Overview | No | |
Projects | No | |
Services | No | |
Swarm Overview | No | |
Swarm Services | No | |
Swarm Tasks | No | |
Host Infrastructure
Dashboards | PromQL | Notes |
---|
File System Usage & Performance | No | |
Host Resource Usage | No | |
Memory Usage | No | |
Network | No | |
Sysdig Agent Health & Status | Yes | |
K8s Control Plane
Dashboards | PromQL | Notes |
---|
Kubernetes API Server | Yes | |
Kubernetes Controller Manager | Yes | |
Kubernetes CoreDNS | Yes | |
Kubernetes Etcd | Yes | |
Kubernetes Kubelet | Yes | |
Kubernetes Proxy | Yes | |
Kubernetes Scheduler | Yes | |
Kubernetes
Dashboards | PromQL | Notes |
---|
CPU Allocation Optimization | No | Deprecated |
Cluster / Namespace Available Resources | Yes | |
Cluster Capacity Planning | Yes | |
Cluster Overview | No | Deprecated |
Cluster and Node Capacity | No | Deprecated |
Container Resource Usage & Troubleshooting | Yes | |
DaemonSet Overview | No | Deprecated |
Deployment Overview | No | Deprecated |
Health Overview | No | Deprecated |
Horizontal Pod Autoscaler | Yes | |
Horizontal Pod Autoscaler (legacy) | No | Deprecated |
Job Overview | No | |
KSM Cluster / Namespace Available Resources | Yes | |
KSM Container Resource Usage & Troubleshooting | Yes | |
KSM Pod Status & Performance | Yes | |
KSM Workload Status & Performance | Yes | |
Kubernetes Jobs | Yes | |
Memory Allocation Optimization | No | Deprecated |
Namespace Overview | No | Deprecated |
Node Overview | No | Deprecated |
Node Status & Performance | Yes | |
PVC and Storage | Yes | |
Pod Overview | No | Deprecated |
Pod Rightsizing & Workload Capacity Optimization | Yes | |
Pod Scheduling Troubleshooting | Yes | |
Pod Status & Performance | Yes | |
ReplicaSet Overview | No | Deprecated |
Resource Quota | No | Deprecated |
Service Golden Signals | No | Deprecated |
Service Health | No | Deprecated |
StatefulSet Overview | No | Deprecated |
Workload Status & Performance | Yes | |
Workloads CPU Usage and Allocation | No | Deprecated |
Workloads Memory Usage and Allocation | No | Deprecated |
Marathon
Dashboards | PromQL | Notes |
---|
Applications | No | |
Groups | No | |
Master Node | No | |
Overview | No | |
Mesos
Dashboards | PromQL | Notes |
---|
Frameworks | No | |
Master Node | No | |
Overview | No | |
Slave Node | No | |
Tasks | No | |
OpenShift
Dashboards | PromQL | Notes |
---|
OpenShift HAProxy Ingress Overview | Yes | |
OpenShift HAProxy Ingress Service Details | Yes | |
OpenShift v3 API Server | Yes | |
OpenShift v3 Controller Manager | Yes | |
OpenShift v3 Kubelet | Yes | |
OpenShift v4 API Server | Yes | |
OpenShift v4 Controller Manager | Yes | |
OpenShift v4 CoreDNS | Yes | |
OpenShift v4 Etcd | Yes | |
OpenShift v4 Kubelet | Yes | |
OpenShift v4 Scheduler | Yes | |
Rancher
Dashboards | PromQL | Notes |
---|
Rancher API Server | Yes | |
Rancher Controller Manager | Yes | |
Rancher CoreDNS | Yes | |
Rancher Etcd | Yes | |
Rancher Kubelet | Yes | |
Rancher Proxy | Yes | |
Rancher Scheduler | Yes | |
Sysdig Secure
Dashboards | PromQL | Notes |
---|
Serverless Agents Fargate Usage | No | |
Sysdig Admission Controller | Yes | |
Troubleshooting
Dashboards | PromQL | Notes |
---|
MongoDB Troubleshooting | No | |
Network Connections Table | No | |
Process Resource Usage | No | |
SQL Troubleshooting | No | |
Top Processes | No | |
6 -
Alerts
Alert is the responsive component of Sysdig Monitor. Alerts notify you
when an event/issue occurs that requires attention. Events and issues
are identified based on changes in the metric values collected by Sysdig
Monitor. The Alerts module displays out-of-the-box alerts and a wizard
for creating and editing alerts as needed.
About Sysdig Alert
Sysdig Monitor can generate notifications based on certain conditions or
events you configure. Using the alert feature, you can keep a tab on
your infrastructure and find out about problems as they happen, or even
before they happen with the alert conditions you define. In Sysdig
Monitor, metrics serve as the central configuration artifact for alerts.
A metric ties one or more conditions or events to the measures to take
when the condition is met, or an event happens. Alerts work across
Sysdig modules including Explore, Dashboard, Events, and Overview.
Alert Types
The types of alerts available in Sysdig Monitor:
Downtime: Monitor any
type of entity, such as a host, a container, or a process, and alert
when the entity goes down.
Metric: Monitor
time-series metrics, and alert if they violate user-defined
thresholds.
PromQL: Monitor
metrics through a PromQL query.
Event: Monitor
occurrences of specific events, and alert if the total number of
occurrences violates a threshold. Useful for alerting on container,
orchestration, and service events like restarts and unauthorized
access.
Anomaly Detection:
Monitor hosts based on their historical behaviors, and alert when
they deviate from the expected pattern.
Group Outlier: Monitor
a group of hosts and be notified when one acts differently from the
rest. Group Outlier Alert is supported only on hosts.
The following tools help with alert creation:
Alert Library: Sysdig
Monitor provides a set of alerts by default. Use it as it is or as a
template to create your own.
Sysdig
API:
Use Sysdig’s Python client to create, list, delete, update and
restore alerts. See
examples.
Guidelines for Creating Alerts
Decide What to monitor | Determine what type of problem you want to be alerted on. See Alert Types to choose a type of problem. |
Define how it will be monitored | Specify exactly what behavior triggers a violation. For example, Marathon App is down on the Kubernetes Cluster named Production for ten minutes. |
Decide Where to monitor | Narrow down your environment to receive fine-tuned results. Use Scope to choose an entity that you want to keep a close watch on. Specify additional segments (entities) to give context to the problem. For example, in addition to specifying a Kubernetes cluster, add a namespace and deployment to refine your scope. |
Define when to notify | Define the threshold and time window for assessing the alert condition. Single Alert fires an alert for your entire scope, while Multiple Alert fires if any or every segment breach the threshold at once.
Multiple Alerts include all the segments you specified to uniquely identify the location and thus provides a full qualification of where the problem occurred. The higher the number of segments the easier to uniquely identify the affected entities.
A good analogy for multiple alerts is alerting on cities. For example, creating multiple alerts on San Francisco would trigger an alert which will include information such as the country that it is part of is the USA and the continent is North America. Trigger gives you control over how notifications are created. For example, you may want to receive a notification for every violation, or want only a single notification for a series of consecutive violations.
|
Decide how notifications are sent | Alert supports customizable notification channels, including email, mobile push notifications, OpsGenie, Slack, and more. To see supported services, see Set Up Notification Channels. |
To create alerts, simply:
Choose an alert
type.
Configure alert
parameters.
Configure the notification
channels you want to
use for alert notification.
Sysdig sometimes deprecates outdated metrics. Alerts that use these
metrics will not be modified or disabled, but will no longer be updated.
See Deprecated
Metrics and Labels.
Use the Alert wizard to create or edit alerts.
Open the Alert Wizard
There are multiple ways to access the Alert wizard:
From Explore
Do one of the following:

From Dashboards
Click the More Options (three dots) icon for a panel, and
select Create Alert.

From Alerts
Do one of the following:
From Overview
From the Events panel on the Overview screen, select a custom or an
Infrastructure type event. From the event description screen, click
Create Alert from Event.

Create an Alert
Configure notification
channels before you begin,
so the channels are available to assign to the alert. Optionally, you
can add a custom subject and body information into individual alert
notifications.
Configuration slightly defers for each Alert type. See respective pages
to learn more. This section covers general instructions to help you
acquainted with and navigate the Alerts user interface.
To configure an alert, open the Alert wizard and set the following
parameters:
Create the alert:
Type: Select the desired Alert
Types.

Each type has different parameters, but they follow the same
pattern:
Name: Specify a meaningful name that can uniquely
represent the Alert that you are creating. For example, the
entity that an alert targets, such as
Production Cluster Failed Scheduling pods
.
Group (optional): Specify a meaningful group name for
the alert you are creating. Group name helps you narrow down
the problem area and focus on the infrastructure view that
needs your attention. For example, you can enter Redis
for alerts related to Redis services. When the alert
triggers you will know which service in your workload
requires inspection. Alerts that have no group name will be
added to the Default Group. Group name is editable. Edit
the alert to do so.
An alert can belong to only one group. An alert created from
an alert template will have the group already configured by
the Monitor
Integrations.
You can see the existing alert groups on the Alerts
details page.

See
Groupings
for more information on how Sysdig handles infrastructure
views.
Description (optional): Briefly expand on the alert name
or alert condition to give additional context for the
recipient.
Priority: Select a priority. High, Medium, Low,
and Info. You can later sort by the
severity by
using the top navigation pane.

Specify the parameters in the Define, Notify, and
Act sections.
To alert on multiple metrics using boolean logic, click Create
multi-condition alerts. See Multi-Condition
Alerts.

Scope: Everywhere, or a more limited scope to filter a specific
component of the infrastructure monitored, such as a Kubernetes
deployment, a Sysdig Agent, or a specific service.
Trigger: Boundaries for assessing the alert condition, and
whether to send a single alert or multiple alerts. Supported time
scales are minute, hour, or day.
Single alert: Single Alert fires an alert for your entire
scope.
Multiple alerts: Multiple Alert fires if any or every
segment breaches the threshold at once.
Multiple alerts are triggered for each segment you specify. The
specified segments will be represented in alerts. The higher the
number of segments the easier to uniquely identify the affected
entities.
For detailed description, see respective sections on Alert Types.
(2) Notify
Notification Channel: Select from the configured
notification channels in the list. Supported channels are:
Email
Slack
Amazon SNS Topic
Opsgenie
Pagerduty
VictorOps
Webhook
You can view the list of notification channels configured for
each alert on the Alerts page.

Notification Options: Set the time interval at which
multiple alerts should be sent.
Format Message: If applicable, add message format details.
See Customize
Notifications.
(3) Act
Click Create.
Optional: Customize Notifications
You can optionally customize individual notifications to provide context
for the errors that triggered the alert. All the notification channels
support this added contextual information and customization flexibility.
Modify the subject, body, or both of the alert notification with the
following:
Plaintext: A custom message stating the problem. For example,
Stalled Deployment.
Hyperlink: For example, URL to a Dashboard.
Dynamic Variable: For example, a hostname. Note the conventions:
All variables that you insert must be enclosed in double curly
braces, such as {{file_mount}}
.
Variables are case sensitive.
The variables should correspond to the segment values you
created the alert for. For example, if an alert is segmented
byhost.hostName
andcontainer.name
, the corresponding
variables will be{{host.hostName}}
and {{container.name}}
respectively. In addition to these segment variables,
__alert_name__
and __alert_status__
are supported. No other
segment variables are allowed in the notification subject and
body.
Notification subjects will not show up on the Event feed.
Using a variable that is not a part of the segment will trigger
an error.
The segment variables used in an alert are turned to the current
system values upon sending the alert.
The body of the notification message contains a Default Alert Template.
It is the default alert notification generated by Sysdig Monitor. You
may add free text, variables, or hyperlinks before and after the
template.
You can send a customized alert notification to the following channels:
Email
Slack
Amazon SNS Topic
Opsgenie
Pagerduty
VictorOps
Webhook
Multi-Condition Alerts
Multi-condition alerts are advanced alert threshold created on complex
conditions. To do so, you define alert thresholds as custom boolean
expressions that can involve multiple conditions. Click Create
multi-condition alerts to enable adding conditions as boolean
expressions.

These advanced alerts require specific syntax, as described in the
examples below.
Each condition has five parts:
Metric Name : Use
the exact metric names. To avoid typos, click the HELP
link to
access the drop-down list of available metrics. Selecting a metric
from the list will automatically add the name to the threshold
expression being edited.
Group Aggregation
(optional): If no group aggregation type is selected, the
appropriate default for the metric will be applied (either sum or
average). Group aggregation functions must be applied outside of
time aggregation functions.
Time aggregation :
It’s the historical data rolled up over a selected period of time.
Operator: Both logical and relational operators are supported.
Value: A static numerical value against which a condition is
evaluated.
The table below displays supported time aggregation functions, group
aggregation functions, and relational operators:
Time Aggregation Function | Group Aggregation Function | Relational Operator |
---|
timeAvg() | avg() | = |
min() | min() | < |
max() | max() | > |
sum() | sum() | <= |
| | >= |
| | != |
The format is:
condition1 AND condition2
condition1 OR condition2
NOT condition1
The order of operations can also be altered via parenthesis:
NOT (condition1 AND (condition2 OR condition3))
Conditions take the following form:
groupAggregation(timeAggregation(metric.name)) operator value
Example Expressions
Several examples of advanced alerts are given below:
timeAvg(cpu.used.percent) > 50 AND timeAvg(memory.used.percent) > 75
timeAvg(cpu.used.percent) > 50 OR timeAvg(memory.used.percent) > 75
timeAvg(container.count) != 10
min(min(cpu.used.percent)) <= 30 OR max(max(cpu.used.percent)) >= 60
sum(file.bytes.total) > 0 OR sum(net.bytes.total) > 0
timeAvg(cpu.used.percent) > 50 AND (timeAvg(mysql.net.connections) > 20 OR timeAvg(memory.used.percent) > 75)
6.1 -
Manage Alerts
Alerts can be managed individually, or as a group, by using the
checkboxes on the left side of the Alert UI and the customization bar.
The columns of the table can also be configured, to provide you with the
necessary data for your use cases.

Select a group of alerts and perform several batch operations, such as
filtering, deleting, enabling, disabling, or exporting to a JSON object.
Select individual alerts to perform tasks such as creating a copy for a
different team.
View Alert Details
The bell button next to an alert indicates that you have not resolved
the corresponding events. The Activity Over Last Two Weeks column
visually notifies you with an event chart showing the number of events
that were triggered over the last two weeks. The color of the event
chart represents what severity level they are.
To view alert details, click the corresponding alert row. The slider
with the alert details will appear. Click an individual event to Take
Action. You can do one of the following:
Acknowledge: Mark that the event has been acknowledged by the
intended recipient.
Create Silence from Event: If you no longer want to be notified,
use this option. You can choose the scope for alert
silence. When silenced,
alerts will still be triggered but will not send you any
notifications.
Explore: Use this option to troubleshoot by using the PromQL
Query.
The event feed will be empty and The Activity Over Last Two Weeks
column will have no event chart if no events are reported in the past
two weeks.
Enable/Disable Alerts
Alerts can be enabled or disabled using the slider or the customization
bar. You can perform these operations on a single alert or on multiple
alerts as a batch operation.
From the Alerts module, check the boxes beside the relevant alerts.
Click Enable Selected or Disable Selected as necessary.
Use the slider beside the alert to disable or enable individual alerts.

Edit an Existing Alert
To edit an existing alert:
Do one of the following::
Click the Edit button beside the alert.

Click an alert to open the detail view, then click Edit on
the top right corner

Edit the alert, and click Save to confirm the changes.
Copy an Alert
Alerts can be copied within the current team to allow for similar alerts
to be created quickly, or copied to a different team to share alerts.
Copy an Alert to the Current Team
To copy an alert within the current team:
Highlight the alert to be copied.
The detail view is displayed.

Click Copy.
The Copy Alert screen is displayed.
Select Current from the drop-down.
Click Copy and Open.
The particular alert in the edit mode appears.
Make necessary changes and save the alert.
Copy an Alert to a Different Team
Highlight the alert to be copied.
The detail view is displayed.
Click Copy.
The Copy Alert screen is displayed.
Select the teams that the alert should be copied to.

Click Send Copy.
Search for an Alert
Search Using Strings
The Alerts table can be searched using partial or full strings. For
example, the search below displays only events that contain
kubernetes
:

Filter Alerts
The alert feed can be filtered in multiple ways, to drill-down into the
environment’s history and refine the alert displayed. The feed can be
filtered by severity or status. Examples of each are shown below.
The example below shows only high and medium severity:

The example below shows the alerts that are invalid:

Export Alerts as JSON
A JSON file can be exported to a local machine, containing JSON snippets
for each selected alert:
Click the checkboxes beside the relevant alerts to be exported.
Click Export JSON.

Delete Alerts
Open the Alert page and use one of the following methods to delete
alerts :
Hover on a specific alert and click Delete.

Hover on one or more alerts, click the checkbox, then click
Delete on the bulk-action toolbar.

Click an alert to see the detailed view, then click Delete on
the top right corner.

6.2 -
Silence Alert Notifications
Sysdig Monitor allows you to silence alerts for a given scope for a
predefined amount of time. When silenced, alerts will still be triggered
but will not send any notifications. You can schedule silencing in
advance. This helps administrators to temporarily mute notifications
during planned downtime or maintenance and send downtime notifications
to selected channels.
With an active silence, the only notifications you will receive are
those indicating the start time and the end time of the silence. All
other notifications for events from that scope will be silenced. When a
silence is active, creating an alert triggers the alert but no
notification will be sent. Additionally, a triggering event will be
generated stating that the alert is silenced.
See Working with Alert
APIs for programmatically
silencing alert notifications.
When you create a new silence, it is by default enabled and scheduled.
When the start time arrives for a scheduled silence, it becomes active
and the list shows the time remaining. When the end time arrives, the
silence becomes completed and cannot be enabled again.
To configure a silence:
Click Alerts on the left navigation on the Monitor UI.
Click the Silence tab.
The page shows the list of all the existing silences.
Click Set a Silence.
The Silence for Scope window is displayed.

Specify the following:
Scope: Specify the entity you want to apply the scope as.
For example, a particular workload or namespace, from
environments that may include thousands of entities.
Begins: Specify one of the following: Today,
Tomorrow, Pick Another Day. Select the time from the
drop-down.
Duration: Specify how long notifications should be
suppressed.
Name: Specify a name to identify the silence.
Notify: Select a channel you want to notify about the
silence.
Click Save.
Silence Alert Notifications from Event Feed
You can also create and edit silences and view silenced alert events on
the Events feeds across the Monitor UI. When you create a silence, the
alert will still be triggered and posted on the Events feed and in the
graph overlays but will indicate that the alert has been silenced.
If you have an alert with no notification channel configured, events
generated from that alert won’t be marked as silenced. They won’t be
visually represented in the events feed as well with the crossed bell
icon and the option to silence events.
To do so,
On the event feed, select the alert event that you want to silence.
On the event details slider, click Take Action.

Click Create Silence from Event.
The Silence for Scope window is displayed.
Continue configuring the silence as described in
4.
Manage Silences
Silences can be managed individually, or as a group, by using the
checkboxes on the left side of the Silence UI and the customization
bar. Select a group of silences and perform batch delete operations.
Select individual silences to perform tasks such as enabling, disabling,
duplicating, and editing.
Change States
You can enable or disable a silence by sliding the state bar next to the
silences. There are two kinds of silences that will show as enabled:
active (a running silence) and a scheduled silence (which will start in
the future). Its starting date is back in time but the end date is yet
to happen. A clock icon visually represents an active silence.

Completed silences cannot be re-enabled once a silenced period is
finished. However, you can duplicate it with all the data but you need
to set a new silencing period.
A silence can be disabled only when:
Filter Silences
Use the search bar to filter silences. You can either perform a simple
auto-complete text search or use the categories. The feed can be
filtered by the following categories: Active, Scheduled,
Completed.
For example, the following shows the completed silences that start with
“cl”.

Duplicate a Silence
Do one of the following to duplicate a silence:
Click the Duplicate hover-the-row button on the menu.

Click the row for the Silence for Scope window to open. On the
window, make necessary changes if required and click Duplicate.
Edit Silence
You can edit scheduled silences. For the active ones, you can only
extend the time. You cannot edit completed silences.
To edit a silence, do one of the following:
Click the row for the Silence for Scope window to open. Make
necessary changes and click Update.
Click the Edit hover-the-row button on the menu. The Silence
for Scope window will be displayed.

Make necessary changes and click Update.
Extend the Time Duration
For the active silences, you can extend the duration to one of the
following:
1 Hour
2 Hours,
6 Hours,
12 Hours
24 Hours
To do so, click the extend the time duration button on the menu and
choose the duration. You can extend the time of an active silence even
from the Silence for Scope window.

Extending the time duration will notify the configured notification
channels that the downtime is extended. You can also extend the time
from a Slack notification of a silence by clicking the given link. It
opens the Silence for Scope window of the running silence where you
can make necessary adjustments.
You cannot extend the duration of completed silences.
6.3 -
Alerts Library
To help you get started quickly, Sysdig provides a set of curated alert
templates called Alerts Library. Powered by Monitor Integrations
, Sysdig automatically
detects the applications and services running in your environment and
recommends alerts that you can enable.
Two types of alert templates are included in Alerts Library:

Recommended: Alert suggestions based on the services that are
detected running in your infrastructure.
All templates: You can browse templates for all the services.
For some templates, you might need to configure Monitor
Integrations.
Access Alerts Library
Log in to Sysdig Monitor.
Click Alerts from the left navigation pane.
On the Alerts tab, click Library.

Import an Alert
Locate the service that you want to configure an alert for.
To do so, either use the text search or identify from a list of
services.
For example, click Redis.

Eight template suggestions are displayed for 14 Redis services
running on the environment.
From a list of template suggestions, choose the desired template.
The Redis page shows the alerts that are already in use and that you
can enable.
Enable one or more alert templates. To do so, you can do one of the
following:
Click Enable Alert.
Bulk enable templates. Select the check box corresponding to the
alert templates and click Enable Alert on the top-right
corner.
Click on the alert template to display the slider. Click the
Enable Alert on the slider.
On the Configure Redis Alert page, specify the Scope and select
the Notification channels.

Click Enable Alert.
You will see a message stating that the Redis Alert has been
successfully created.
Use Alerts Library
In addition to importing an alert, you can also do the following with
the Alerts Library:
Identify Alert templates associated with the services running in
your infrastructure.

Bulk import Alert templates. See Import an
Alert.
View alerts that are already configured.
Filter Alert templates. Enter the search string to display the
matching results.

Discover the workloads where a service is running. To do so, click
on the Alert template to display the slider. On the slider, click
Workloads.
View the alerts in use. To do so, click on an Alert template to
display the slider. On the slider, click Alerts in use.

Configure an alert.
Additional alert configuration, such as changing the alert name,
description, and severity can be done after the import.
6.4 -
Downtime Alert
Sysdig Monitor continuously surveils any type of entity in your
infrastructure, such as a host, a container, a process, or a service,
and sends notifications when the monitored entity is not available or
responding. Downtime alert focuses mainly on unscheduled downtime of
your infrastructure.

In this example, a Kubernetes cluster is monitored and the alert is
segmented on both cluster and namespace. When a Kubernetes cluster in
the selected availability zone goes down, notifications will be sent
with necessary information on both cluster and affected namespace.
The lines shown in the preview chart represent the values for the
segments selected to monitor. The popup is a color-coded legend to show
which segment (or combination of segments if there is more than one) the
lines represent. You can also deselect some segment lines to prevent
them from showing in the chart. Note that there is a limit of 10 lines
that Sysdig Monitor ever shows in the preview chart. For downtime
alerts, segments are actually what you select for the “Select entity
to monitor” option.
Define a Downtime Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert.
Severity: Set a severity level for your alert. The
Priority—High, Medium, Low, and Info—are reflected
in the Alert list, where you can sort by the severity of the Alert.
You can use severity as a criterion when creating alerts, for
example: if there are more than 10 high severity events, notify.
Specify multiple segments: Selecting a single segment might not
always supply enough information to troubleshoot. Enrich the
selected entity with related information by adding additional
related segments. Enter hierarchical entities so you have the
bottom-down picture of what went wrong and where. For example,
specifying a Kubernetes Cluster alone does not provide the context
necessary to troubleshoot. In order to narrow down the issue, add
further contextual information, such as Kubernetes Namespace,
Kubernetes Deployment, and so on.
Specify Entity
Select an entity whose downtime you want to monitor for.
In this example, you are monitoring the unscheduled downtime of a
host.
Specify additional segments:

The specified entities are segmented on and notified with the
default notification template as well as on the Preview. In this
example, data is segmented on Kubernetes cluster name and namespace
name. When a cluster is affected, the notification will not only
include the affected cluster details but also the associated
namespaces.
Filter the environment on which this alert will apply. An alert will
fire when a host goes down in the availability zone, us-east-1b.

Use in or contain operators to match multiple different possible
values to apply scope.
The contain and not contain operators help you retrieve values
if you know part of the values. For example, us retrieves values
that contain strings that start with “us”, such as “us-east-1b”,
“us-west-2b”, and so on.
The in and not in operators help you filter multiple values.
You can also create alerts directly from Explore and Dashboards for
automatically populating this scope.
Define the threshold and time window for assessing the alert condition.
Supported time scales are minute, hour, or day.

If the monitored host or Kubernetes cluster is not available or not
responding for the last 10 minutes, recipients will be notified.
You can set any value for % and a value greater than 1 for the time
window. For example, If you choose 50% instead of 100%, a notification
will be triggered when the entity is down for 5 minutes in the selected
time window of 10 minutes.
Use Cases
Your e-commerce website is down during the peak hours of Black
Friday, Christmas, or New Year season.
Production servers of your data center experience a critical outage
MySQL database is unreachable
File upload does not work on your marketing website.
6.5 -
PromQL Alerts
Sysdig Monitor enables you to use PromQL to define metric expressions
that you can alert on. You define the alert conditions using the
PromQL-based metric expression. This way, you can combine different
metrics and warn on cases like service-level agreement breach, running
out of disk space in a day, and so on.
Examples
For PromQL alerts, you can use any metric that is available in PromQL,
including Sysdig native
metrics. For more details
see the various integrations available on
promcat.io.
Low Disk Space Alert
Warn if disk space falls below a specified quantity. For example disk
space is below 10GB in the 24h hour:
predict_linear(sysdig_fs_free_bytes{fstype!~"tmpfs"}[1h], 24*3600) < 10000000000
Slow Etcd Requests
Notify if etcd
requests are slow. This example uses the
promcat.io integration.
histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m]) > 0.15
High Heap Usage
Warn when the heap usage in ElasticSearch is more than 80%. This example
uses the promcat.io
integration.
(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80
Guidelines
Sysdig Monitor does not currently support the following:
Interact with the Prometheus alert manager or import alert manager
configuration.
Provide the ability to use, copy, paste, and import predefined alert
rules.
Convert the alert rules to map to the Sysdig alert editor.
Create a PromQL Alert
Set a meaningful name and description that help recipients easily
identify the alert.
Set a Priority
Select a priority for the alert that you are creating. The supported
priorities are High, Medium, Low, and Info. You can also
view and sort events in the dashboard and explore UI, as well as sort
them by severity.
Define a PromQL Alert
PromQL: Enter a valid PromQL expression. The query will be executed
every minute. However, the alert will be triggered only if the query
returns data for the specified duration.

In this example, you will be alerted when the rate of HTTP requests has
doubled over the last 5 minutes.
Duration: Specify the time window for evaluating the alert condition
in minutes, hour, or day. The alert will be triggered if the query
returns data for the specified duration.
Define Notification
Notification Channels: Select from the configured notification
channels in the list.
Re-notification Options: Set the time interval at which multiple
alerts should be sent if the problem remains unresolved.
Notification Message & Events: Enter a subject and body. Optionally,
you can choose an existing template for the body. Modify the subject,
body, or both for the alert notification with a hyperlink, plain text,
or dynamic variables.
Import Prometheus Alert Rules
Sysdig Alert allows you to import Prometheus rules or create new rules
on the fly and add them to the existing list of alerts. Click the
Upload Prometheus Rules option and enter the rules as YAML in the
Upload Prometheus Rules YAML editor. Importing your Prometheus alert
rules will convert them to PromQL-based Sysdig alerts. Ensure that the
alert rules are valid YAML.

You can upload one or more alert rules in a single YAML and create
multiple alerts simultaneously.

Once the rules are imported to Sysdig Monitor, the alert list will be
automatically sorted by last modified date.

Besides the pre-populated template, each rule specified in the Upload
Prometheus Rules YAML editor requires the following fields:
See the following examples to understand the format of Prometheus Rules
YAML. Ensure that the alert rules are valid YAML to pass validation.
Example: Alert Prometheus Crash Looping
To alert potential Prometheus crash looping. Create a rule to alert when
Prometheus restart more than twice in the last 10 minutes.
groups:
- name: crashlooping
rules:
- alert: PrometheusTooManyRestarts
expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[10m]) > 2
for: 0m
labels:
severity: warning
annotations:
summary: Prometheus too many restarts (instance {{ $labels.instance }})
description: Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n
Example: Alert HTTP Error Rate
To alert HTTP requests with status 5xx (> 5%) or high latency:
groups:
- name: default
rules:
# Paste your rules here
- alert: NginxHighHttp5xxErrorRate
expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
for: 1m
labels:
severity: critical
annotations:
summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
description: Too many HTTP requests with status 5xx
- alert: NginxLatencyHigh
expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node)) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Nginx latency high (instance {{ $labels.instance }})
description: Nginx p99 latency is higher than 3 seconds
Learn More
6.6 -
Metric Alerts
Sysdig Monitor keeps a watch on time-series metrics, and alert if they
violate user-defined thresholds.

The lines shown in the preview chart represent the values for the
segments selected to monitor. The popup is a color-coded legend to show
which segment (or combination of segments if there is more than one) the
lines represent. You can also deselect some segment lines to prevent
them from showing in the chart. Note that there is a limit of 10 lines
that Sysdig Monitor ever shows in the preview chart.
Defining a Metric Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert
Specify multiple segments: Selecting a single segment might not
always supply enough information to troubleshoot. Enrich the
selected entity with related information by adding additional
related segments. Enter hierarchical entities so you have the
bottom-down picture of what went wrong and where. For example,
specifying a Kubernetes Cluster alone does not provide the context
necessary to troubleshoot. In order to narrow down the issue, add
further contextual information, such as Kubernetes Namespace,
Kubernetes Deployment, and so on.
Specify Metrics
Select a metric that this alert will monitor. You can also define how
data is aggregated, such
as avg, max, min or sum. To alert on multiple metrics using boolean
logic, switch to multi-condition
alert.
Filter the environment on which this alert will apply.
Filter the environment on which this alert will apply. An alert will
fire when a host goes down in the availability zone, us-east-1b.

Use advanced operators to include, exclude, or pattern-match groups,
tags, and entities. See Multi-Condition
Alerts.
You can also create alerts directly from Explore and Dashboards for
automatically populating this scope.
Define the threshold and time window for assessing the alert condition.
Single Alert fires an alert for your entire scope, while Multiple Alert
fires if any or every segment breach the threshold at once.
Metric alerts can be triggered to notify you of different aggregations:
on average | The average of the retrieved metric values across the time period. Actual number of samples retrieved is used to calculate the value. For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as on average, the alert will be calculated by summing the 3 recorded values and dividing by 3. |
as a rate | The average value of the metric across the time period evaluated. The expected number of values is used to calculate the rate to trigger the alert. For example, if new data is retrieved in the 7th minute of a 10-minutes sample and the alert is defined as as a rate, the alert will be calculated by summing the 3 recorded values and dividing by 10 ( 10 x 1 minute samples). |
in sum | The combined sum of the metric across the time period evaluated. |
at least once | The trigger value is met for at least one sample in the evaluated period. |
for the entire time | The trigger value is met for a every sample in the evaluated period. |
as a rate of change | The trigger value is met the change in value over the evaluated period. |
For example, if the file system used percentage goes above 75 for the
last 5 minutes on an average, multiple alerts will be triggered. The mac
address of the host and mount directory of the file system will be
represented in the alert notification.

Usecases
6.7 -
Event Alerts
Monitor occurrences of specific events, and alert if the total number of
occurrences violates a threshold. Useful for alerting on container,
orchestration, and service events like restarts and deployments.
Alerts on events support only one segmentation label. An alert is
generated for each segment.

Defining a Metric Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert.
Severity: Set a severity level for your alert. The Priority:
High
, Medium
, Low
, and Info
are reflected in the Alert list,
where you can sort by the severity by using the top navigation pane.
You can use severity as a criterion when creating events and alerts,
for example: if there are more than 10 high severity events, notify.
Source Tag: Supported source tags are Kubernetes,
Docker, and Containerd.
Trigger: Specify the trigger condition in terms of the number of
events for a given duration.
Event alert support only one segmentation label. If you choose
Multiple Alerts, Sysdig generates only one alert for a selected
segment.
Specify Event
Specify the name, tag, or description of an event.

Specify a Source Tag.
Filter the environment on which this alert will apply. Use advanced
operators to include, exclude, or pattern-match groups, tags, and
entities. You can also create alerts directly from Explore and
Dashboards for automatically populating this scope.

In this example, failing a liveness probe in the
agent-process-whitelist-cluster cluster triggers an alert.
Define the threshold and time window for assessing the alert condition.
Single Alert fires an alert for your entire scope, while Multiple Alert
fires if any or every segment breach the threshold at once.

If the number of events triggered in the monitored entity is greater
than 5 for the last 10 minutes, recipients will be notified through the
selected channel.
6.8 -
Anomaly Detection Alerts
Anomaly refers to an outlier in a given data set polled from an
environment. It is a deviation from a conformed pattern. Anomaly
detection is about identifying these anomalous observations. A set of
data points collectively, a single instance of data or context-specific
abnormalities help detect anomalies. For example, unauthorized copying
of a directory from a container, high CPU or memory consumption, and so
on.

Define an Anomaly Detection Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert
Severity: Set a severity level for your alert. The Priority:
High
, Medium
, Low
, and Info
are reflected in the Alert list,
where you can sort by the severity by using the top navigation pane.
You can use severity as a criterion when creating events and alerts,
for example: if there are more than 10 high severity events, notify.
Specify multiple segments: Selecting a single segment might not
always supply enough information to troubleshoot. Enrich the
selected entity with related information by adding additional
related segments. Enter hierarchical entities so you have the
bottom-down picture of what went wrong and where. For example,
specifying a Kubernetes Cluster alone does not provide the context
necessary to troubleshoot. In order to narrow down the issue, add
further contextual information, such as Kubernetes Namespace,
Kubernetes Deployment, and so on.
Specify Entity
Select one or more metrics whose behavior you want to monitor.
Filter the environment on which this alert will apply. An alert will
fire when the value returned by one of the selected metrics does not
follow the pattern in the availability zone, us-east-1b.

You can also create alerts directly from Explore and Dashboards for
automatically populating this scope.
Trigger gives you control over how notifications are created and help
prevent flooding your notification channel with notifications. For
example, you may want to receive a notification for every violation, or
only want a single notification for a series of consecutive violations.
Define the threshold and time window for assessing the alert condition.
Supported time scales are minute, hour, or day.

If the monitored host or Kubernetes cluster is not available or not
responding for the last 5 minutes, recipients will be notified.
You can set any value for % and a value greater than 1 for the time
window. For example, If you choose 50% instead of 100%, a notification
will be triggered when the entity is down for 2.5 minutes in the
selected time window of 5 minutes.
6.9 -
Group Outlier Alerts
Sysdig Monitor observes a group of hosts and notifies you when one acts
differently from the rest.

Define a Group Outlier Alert
Guidelines
Set a unique name and description: Set a meaningful name and
description that help recipients easily identify the alert
Severity: Set a severity level for your alert. The Priority:
High
, Medium
, Low
and Info
are reflected in the Alert list,
where you can sort by the severity by using the top navigation pane.
You can use severity as a criterion when creating events and alerts,
for example: if there are more than 10 high severity events, notify.
Specify Entity
Select one or more metrics whose behavior you want to monitor.
Filter the environment on which this alert will apply. An alert will
fire when the value returned by one of the selected metrics does not
follow the pattern in the availability zone, us-east-1b.

You can also create alerts directly from Explore and Dashboards for
automatically populating this scope.
Trigger gives you control over how notifications are created and help
prevent flooding your notification channel with notifications. For
example, you may want to receive a notification for every violation, or
only want a single notification for a series of consecutive violations.
Define the threshold and time window for assessing the alert condition.
Supported time scales are minute, hour, or day.

If the monitored host or Kubernetes cluster is not available or not
responding for the last 5 minutes, recipients will be notified.
You can set any value for % and a value greater than 1 for the time
window. For example, If you choose 50% instead of 100%, a notification
will be triggered when the entity is down for 2.5 minutes in the
selected time window of 5 minutes.
Usecases
Load balancer servers have uneven workloads
Changes in applications or instances deployed in different
availability zones.
Network hogging hosts in a cluster
7 -
Events
The Sysdig Monitor Events module displays a comprehensive and unified
list of events, both monitoring and security, that have occurred within
the environment, as a live events feed. The feed displays events created
by triggered alerts, pulled from infrastructure services, initiated by
Sysdig Security such as policy and image scanning, or defined by users,
and allows users to review, track, and resolve issues. Each event is
enriched with rich metadata and the entire relationship within the
system under purview is built when searched for events. With a unified
Event stream, Sysdig Monitor eliminates the need for standalone tools
for security and monitoring alerts.
Learn more about Sysdig Monitor Events in the following sections:
7.1 -
Event Types
There are three primary types of events displayed in the Sysdig Secure
Events feed: alert events, infrastructure events, and custom events.
Note that image scanning and security events are displayed in the
Sysdig Secure interface.
Alert Events
Alert events are triggered by user-configured alerts. For more
information on configuring alerts, refer to the Sysdig Monitor
Alerts documentation.
Infrastructure Events
Events can be collected from supported services within the production
environment. The Sysdig agent automatically discovers these services and
is configured to collect event data for a select group of events by
default. Additional events can be added to the list by configuring the
dragent.yaml
file.
Sysdig currently supports event monitoring for the following
infrastructure services:
Events marked with *
are enabled by default. For more information on
configuring additional infrastructure events, refer to the
Enable/Disable Event Data.
Docker Events
The following Docker events are supported.
docker:
container:
- attach # Container Attached (information)
- commit # Container Committed (information)
- copy # Container Copied (information)
- create # Container Created (information)
- destroy # Container Destroyed (warning)
- die # Container Died (warning)
- exec_create # Container Exec Created (information)
- exec_start # Container Exec Started (information)
- export # Container Exported (information)
- kill # Container Killed (warning)*
- oom # Container Out of Memory (warning)*
- pause # Container Paused (information)
- rename # Container Renamed (information)
- resize # Container Resized (information)
- restart # Container Restarted (warning)
- start # Container Started (information)
- stop # Container Stopped (information)
- top # Container Top (information)
- unpause # Container Unpaused (information)
- update # Container Updated (information)
image:
- delete # Image Deleted (information)
- import # Image Imported (information)
- pull # Image Pulled (information)
- push # Image Pushed (information)
- tag # Image Tagged (information)
- untag # Image Untaged (information)
volume:
- create # Volume Created (information)
- mount # Volume Mounted (information)
- unmount # Volume Unmounted (information)
- destroy # Volume Destroyed (information)
network:
- create # Network Created (information)
- connect # Network Connected (information)
- disconnect # Network Disconnected (information)
- destroy # Network Destroyed (information)
Kubernetes Events
The following Kubernetes events are supported.
kubernetes:
node:
- TerminatedAllPods # Terminated All Pods (information)
- RegisteredNode # Node Registered (information)*
- RemovingNode # Removing Node (information)*
- DeletingNode # Deleting Node (information)*
- DeletingAllPods # Deleting All Pods (information)
- TerminatingEvictedPod # Terminating Evicted Pod (information)*
- NodeReady # Node Ready (information)*
- NodeNotReady # Node not Ready (information)*
- NodeSchedulable # Node is Schedulable (information)*
- NodeNotSchedulable # Node is not Schedulable (information)*
- CIDRNotAvailable # CIDR not Available (information)*
- CIDRAssignmentFailed # CIDR Assignment Failed (information)*
- Starting # Starting Kubelet (information)*
- KubeletSetupFailed # Kubelet Setup Failed (warning)*
- FailedMount # Volume Mount Failed (warning)*
- NodeSelectorMismatching # Node Selector Mismatch (warning)*
- InsufficientFreeCPU # Insufficient Free CPU (warning)*
- InsufficientFreeMemory # Insufficient Free Mem (warning)*
- OutOfDisk # Out of Disk (information)*
- HostNetworkNotSupported # Host Ntw not Supported (warning)*
- NilShaper # Undefined Shaper (warning)*
- Rebooted # Node Rebooted (warning)*
- NodeHasSufficientDisk # Node Has Sufficient Disk (information)*
- NodeOutOfDisk # Node Out of Disk Space (information)*
- InvalidDiskCapacity # Invalid Disk Capacity (warning)*
- FreeDiskSpaceFailed # Free Disk Space Failed (warning)*
pod:
- Pulling # Pulling Container Image (information)
- Pulled # Ctr Img Pulled (information)
- Failed # Ctr Img Pull/Create/Start Fail (warning)*
- InspectFailed # Ctr Img Inspect Failed (warning)*
- ErrImageNeverPull # Ctr Img NeverPull Policy Violate (warning)*
- BackOff # Back Off Ctr Start, Image Pull (warning)
- Created # Container Created (information)
- Started # Container Started (information)
- Killing # Killing Container (information)*
- Unhealthy # Container Unhealthy (warning)
- FailedSync # Pod Sync Failed (warning)
- FailedValidation # Failed Pod Config Validation (warning)
- OutOfDisk # Out of Disk (information)*
- HostPortConflict # Host/Port Conflict (warning)*
replicationController:
- SuccessfulCreate # Pod Created (information)*
- FailedCreate # Pod Create Failed (warning)*
- SuccessfulDelete # Pod Deleted (information)*
- FailedDelete # Pod Delete Failed (warning)*
Custom Events
Additional events can be collected by the Sysdig agent and displayed in
the Events module, but require more comprehensive configuration steps.
These custom events can be integrated via:
For brief sample scripts regarding configuring other custom events,
refer to the Custom
Events. For more
information, contact Sysdig Support.
LogDNA Events
Sysdig provides the ability to view LogDNA alerts as Sysdig events.
If you are both a LogDNA and Sysdig Monitor user, you can send alerts
from the LogDNA platform to Sysdig Monitor as Sysdig events. These
events will provide a link redirecting you to the LogDNA for further
investigation. Similar to other types of Sysdig Events, you can create
alerts based on the LogDNA events.

The log data provided by LogDNA carries additional details about system
health. The ability to view relevant LogDNA events in Sysdig helps you
debug and monitor the health of a system efficiently.
For example, if the number of logs generated during a deployment is
higher than expected, you get notified with your Sysdig Events feed.
There is no configuration required on the Sysdig Monitor side. For
information on configuring LogDNA to send alerts to Sysdig Monitor, see
Sysdig Alert
Integration.
7.2 -
Custom Events
Sysdig Monitor can ingest any custom event created, including code
deploys, auto-scaling activities, and business level actions. These
events will be automatically overlayed on charts and graphs for easy
correlation of all performance data. The sections below outline the
different ways custom events can be sent to Sysdig Monitor.
Application Integrations
Sysdig Monitor supports event integrations with certain applications by
default. The Sysdig agent will automatically discover these services and
begin collecting event data from them. For more information, refer to
the Events documentation.
Sysdig Monitor Slackbot
Sysdigbot,
the Sysdig Monitor Slackbot, allows users to post custom events directly
to the Sysdig Cloud through chats with a Slack bot.
Prebuilt Python Script
The Sysdig python script provides a way to send events to Sysdig Monitor
directly from the command line, using the following command structure:
python post_event.py SYSDIG_TOKEN NAME [-d DESCRIPTION] [-s SEVERITY] [-c SCOPE] [-t TAGS] [-h]
For more information, refer to the Sysdig Github
repository.
Python Sample Client
The Sysdig Monitor python client acts as a wrapper around the Sysdig
Monitor REST API, exposing most of the REST API functionality to provide
an easy to use and install python interface. The post_event()
function
can be used to send events to Sysdig Monitor from any custom script. An
example script is shown below:
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(os.path.realpath(sys.argv[0])), '..'))
from sdcclient import SdcClient
# Parse arguments
sdc_token = sys.argv[1]
name = sys.argv[2]
# Instantiate the SDC client
sdclient = SdcClient(SDC_TOKEN)
# Post the event using post_event(self, name, description=None, severity=None, event_filter=None, tags=None)
res = sdclient.post_event(NAME)
Curl Sample Client
The Sysdig Monitor REST API offers the full functionality of the Sysdig
Monitor app over API, allowing custom events to be sent directly to the
Sysdig Cloud over the REST API. The example below is a curl request:
#!/bin/bash
SDC_ACCESS_TOKEN='626abc7-YOUR-TOKEN-HERE-3a3ghj432'
ENDPOINT='app.sysdigcloud.com'
curl -X POST -s https://'"${ENDPOINT}"'/api/v2/events -H 'Content-Type: application/json; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H "Authorization: Bearer ${SDC_ACCESS_TOKEN}" --data-binary '{"event": {"name": "Jenkins - start wordpress deploy", "description": "deploy", "severity": "MEDIUM", "scope": "host.hostName = \"ip-10-1-1-1\" and build = \"89\""}}}'
sleep 5s
See also Enable/Disable Event
Data.
7.3 -
Severity and Status
Event Severity
Event severity is broken down into four categories in the Sysdig Monitor
UI, to better visualize issue priority, and allow for easier filtering
practices.
Scripts that used the former severity values (0-7) will continue to work
as expected, as the new categories are simplified groupings of those
values.
The image below outlines the severity value breakdown:

Event Status
There are two primary event states: triggered, and resolved. In
addition, there are two additional statuses available to improve
filtering practices.
Triggered | The circumstances that triggered the event remain in place (for example, the node remains down). |
Resolved | The circumstances that triggered the event are no longer in place (for example, the metric value has returned to within a normal range). |
Acknowledged | Manual label to assist in further filtering the events feed. The acknowledged label is a purely visual marker. It does not reflect the current state (triggered/resolved) of the event. Custom events cannot be marked as acknowledged. |
Unacknowledged | Manual label to assist in further filtering the events feed. All events are marked as unacknowledged by default. |
Silenced | List of silenced event alerts. For more information, see Silence Alert Notifications. |
For more information on filtering the Events feed, refer to Filtering
and Searching Events.
7.4 -
Event Scope
By default, Events feed displays events from the entire environment.
However, the feed can be configured to only show events from a
particular scope within that environment. The scope of the event feeds
can be configured by labels.
Labels refer to a set of meaningful key-value pair (whitelist) that is
defined by Sysdig Monitor. As a user, you have the ability to configure
the whitelist. For example, if you are using ECS and have custom
container labels you have defined, you have the ability to configure the
whitelist and add the labels you need. Once done, all the infrastructure
events related to containers are enriched with these labels and the
event scope will display associated metadata.
For more information on scoping, refer to the Grouping, Scoping, and
Segmenting Metrics
documentation.
To configure the events feed scope:
From the Events
module, click the Edit Scope
link.

Open the top-level drop-down menu.
Select the desired label, either by scrolling through the list, or
by typing the name/partial name into the search bar, and selecting
it.

Open the Operator
drop-down menu, and select the relevant option.
Open the Value
drop-down menu, and select the relevant options.
Optional: Open the next level drop-down menu, and repeat steps
3-5.

Optional: Repeat step 6 for each additional layer of scope
required.
Individual layers of the scope can be removed if necessary, by
clicking the Delete
(x) icon beside the relevant layer.
Click the Apply
button to save the new scope.
Filter Events by Scope
Events are by default filtered by scope in Dashboards and
Explore to show the most relevant events associated with the
selected scope. This capability enables you to quickly narrow down the
potential problems in the area under purview. However, you can turn the
filtering off and see Events from the complete scope. To do so in
Explore:
On the Explore module, click the Options (three dots) icon
and select Events.

The Events panel appears. you can do the following:
Determine whether to show events or not.
Determine the maximum number of events to be displayed in the
Explore table.
Filter events by
Type: The types of events supported are custom events
and alerts. See Event
Types for more
information.
State: The types of events supported are triggered and
resolved. See Severity and
Status for more
information.
Severity: The supported severity levels are all severity
types, high severity, and both high and medium levels. See
Severity and
Status for more
information.
Resolution: The supported resolutions are both
acknowledged and unacknowledged, acknowledged only, and
unacknowledged only. See Severity and
Status for more
information.
Determine whether to show events by scope. Use the toggle button
to turn off filtering by scope.
If you disable this option, the Explore table will show feed
for all the events in the infrastructure, including those that
are irrelevant to the selected scope. Leave the Filter events
by selected scope option enabled to see only the relevant
events.
Click Save.
Similarly, you can turn off filtering events by scope in
Dashboards.
Reset the Environment Scope
To reset the scope to the entire environment:
From the Events module, click the Edit Scope
link.
Click the Clear All
link.

Click the Apply
button to save the changes.
7.5 -
Event alerts can be created (for custom events) and configured (for
alert events, and custom events with a previously created alert) from
the Event Details
panel:
From the Events module, select the event from the feed to open the
Event Details
panel.
Open the Configure Alert
panel:
For existing alerts, click the Edit Alert
link.
For new alerts, click the Create Alert from Event
button.
Configure the alert as necessary. For more information on
configuring alerts, refer to the
Alerts documentation.
New alerts will be auto-filled with information from the custom
event.
Click the Create button
for new alerts, or the Save
button for
existing alerts.
7.6 -
Filtering and Searching Events
Filter Events
The events feed can be filtered in multiple ways, to drill-down into the
environment’s history and refine the events displayed. The feed can be
filtered by severity, type, and/or status. Examples of each are shown
below.
The example below shows only high and medium severity events:

The example below shows only Kubernetes events:

The example below shows only events that are Unacknowledged:
The Acknowledged label is a purely visual marker, and does not reflect
the current state (triggered/resolved) of the event. By default, all
events are Unacknowledged.

The example below shows medium severity Alert events that remain
Triggered, but have been acknowledged:

Search for an Event
The event feeds can be searched by using the search icon in the top bar:

7.7 -
Review Events
Events can be reviewed in detail by clicking on the event listing in the
feed:

To review the environment at the time of the event in detail, click the
Explore
button to navigate to the Explore module. The Explore module
will automatically drill-down to the impacted environment objects.
The Event Details Panel
The Event Details panel contains detailed information about the event.
This information is different, depending on whether the event is an
Alert event or a Custom event.
Alert Events
The example below is of an Alert event:

Metadata | Description |
---|
Event ID | The unique ID of the event. |
Severity | The severity of the event (High, Medium, Low, Info). |
State | The current state of the event (Triggered, Resolved) |
Duration | The length of time the event lasted. |
Acknowledged | Whether the event has been acknowledged or not. |
Trigger | The cause of the event (for example, the metric that exceeded the defined range, and the value it reached). |
Entity | The entity on which the event occurred. |
Start Time | The date and time the event started. |
End Time | The date and time the event ended. |
Alert Name | The name of the alert that was triggered. |
Type | The type of alert. |
Metrics | The metric/s that were affected. |
Trigger Condition | The condition that was met to trigger the alert. |
Scope | The scope of the alert. |
Segment | The segmentation applied to the alert. |
To configure the alert that created the event, click the Edit Alert
link in the Event Details panel. For more information about alerts,
refer to the Alerts
documentation.
Security Events
Policy
The example shows an event notifying a potentially unauthorized terminal
shell in a container. For more information on Policy alerts, see Secure
Events.

Metadata | Description |
---|
Event ID | The unique ID of the event. |
Severity | The severity of the event (High, Medium, Low, Info). |
Date / Time | The date and time the event occurred. |
Host | The hostname and physical address (MAC) |
Container | The container name, unique identifier, and image. |
Summary | A detailed description of what occurred. |
Scanning
The example is a high severity event alerting a change in the scan
result of an elasticsearch image on Quay. For more information on
Scanning, see Scanning
Alerts.

Metadata | Description |
---|
Event ID | The unique ID of the event. |
Severity | The severity of the event (High, Medium, Low, Info). |
Date / Time | The date and time the event occurred. |
Image Registry | The repository where the image resides (for example, Quay). |
Tag | The image name associated with the image. |
Image ID | The unique identifier of the image. |
Digest | A content-addressable identifier which contains the SHA256 hash of the image’s JSON configuration object. |
Infrastructure and Custom Events
Infrastructure and custom events display the same set of information in
the Event Details
panel. The example below is a Docker event:

Metadata | Description |
---|
Event ID | The unique ID of the event. |
Severity | The severity of the event (High, Medium, Low, Info). |
Date / Time | The date and time the event occurred. |
Source | The source of the event (for example, Docker). |
Scope | The scope of the event. |
Description | A detailed description of what occurred. |
8 -
Monitoring Integrations
Integrations for Sysdig Monitor include a number of platforms, orchestrators, and a wide range of applications designed to extend Monitor capabilities and collect metrics from these systems. Sysdig collects metrics from Prometheus, JMX, StatsD, Kubernetes, and a number of applications to provide a 360-degree view of your infrastructure. Many metrics are collected out of the box; you can also extend the integration or create custom metrics to receive curated insights into your infrastructure stack.
Key Benefits
Collects the richest data set for cloud-native visibility and security.
Polls data, auto-discover context in order to provide operational and security insights.
Simplifies deploying your monitoring integrations by providing guided configuration, curated list of enterprise-grade images, integration with CI/CD workflow.
Extends the power of Prometheus metrics with additional insight from other metrics types and infrastructure stack.
Employs Prometheus alert and events and provides ready-to-use dashboards for Kubernetes monitoring needs.
Exposes application metrics using Java JMX and MBeans monitoring.
Key Integrations
Inbound
Monitoring Integrations
Describes how to configure Monitoring Integration in your infrastructure and receive deeper insight into the health and performance of your services across platforms and the cloud.
Prometheus Metrics
Describes how Sysdig agent enables automatically collecting metrics from services that expose native Prometheus metrics as well as from applications with Prometheus exporters, how to set up your environment, and scrape Prometheus metrics seamlessly.
Agent Installation
Learn how to install Sysdig agents on supported platforms.
AWS CloudWatch
Illustrates how to configure Sysdig to collect various types of CloudWatch metrics.
Java Management Extention (JMX) Metrics
Describes how to configure your Java virtual machines so Sysdig
Agent can collect JMX metrics using the JMX protocol.
StatsD Metrics
Describes how the Sysdig agent collects custom StatsD metrics with
an embedded StatsD server.
Node.JS Metrics
Illustrates how Sysdig is able to monitor node.js applications by linking a library to the node.js codebase.
Monitor Log Files
Learn how to search a string by using the chisel script called logwatcher.
(legacy) Integrate Applications
Describes the monitoring capabilities of Sysdig agent with application check scripts or ‘app checks’.
Oubound
Notification Channels
Learn how to add, edit, or delete a variety of notification channel types, and how to disable or delete notifications when they are not needed, for example, during scheduled downtime.
S3 Capture Storage
Learn how to configure Sysdig to use an AWS S3 bucket or custom S3 storage for storing Capture files.
For Sysdig instances deployed on IBM Cloud Monitoring with
Sysdig, an additional form of metrics
collection is offered: Platform metrics. Rather than being collected by
the Sysdig agent, when enabled, Platform metrics are reported to Sysdig
directly by the IBM Cloud infrastructure.
Enable this feature by logging into the IBM Cloud console and selecting
“Enable” for IBM Platform metrics under the Configure your resource
section when creating a new IBM Cloud Monitoring with a Sysdig instance,
as described
here.
8.1 -
Monitoring Integration provides an at-a-glance summary of workloads
running in your infrastructure and a deeper insight into the health and
performance of your services across platforms and the cloud. You can
easily identify the workloads in your team scope, the service discovered
(such as etcd) within each workload, and configure the Prometheus
exporter integration to collect and visualize time series metrics.
Monitoring Integration also powers Alerts
Library.

The following indicates integration status for each service
integrations:
Reporting Metrics: The integration is configured correctly and
is reporting metrics.
Needs Attention: An integration has stopped working and is no
longer reporting metrics or requires some other type of attention.
Pending Metrics: An integration has recently been configured and
has been waiting to receive metrics.
Configure Integration: The integration needs to be configured,
and therefore no metrics are reported.
Ensure that you meet the prerequisites given in Guidelines for Monitoring Integrations to make the best use of this feature.
Access Monitoring Integrations
Log in to Sysdig Monitor.
Select Integration > Monitoring Integration in the management section of the
left-hand sidebar.

The Integrations page is displayed. Continue with Configure an
Integration.
Locate the service that you want to configure an integration for. To
do so, identify the workload and drill down to the grouping where
the service is running.
To locate the service, you can use one of the following:
- Text search
- Type filtering
- Left navigation to filter the workload and then use text search
or type filtering
- Use the Configure Integration option on the top, and locate the service using text search or type filtering
Click Configure Integration.
- Click Start Installation.

- Review the prerequisites.
- Do one of the following:
- Dry Run: Use kubectl command to install the service. Follow the on-screen instructions to complete the tasks successfully.
- Patch: Install directly on your workload. Follow the on-screen instructions to complete the tasks successfully.
- Manual: Use an exporter and install the service manually. Click Documentation to learn more about the service exporter and integrate with Sysdig Monitor
Click Validate to validate the installation.
Make sure that the wizard shows the Installation Complete screen.

Click Close to close the window.
Show Unidentified Workloads
The services that Sysdig Monitor cannot discover can technically still
be monitored through the Unidentified Workloads option. You can view
the workloads with these unidentified services or applications and see
their status. To do so, use the Unidentified Workloads slider at the
top right corner of the Integration page.
Learn More
8.1.1 -
Guidelines for Monitoring Integrations
If you are directed to this page from the Sysdig Monitor app, your agent deployment might include a configuration that causes either of the following:
- Prohibits the use of Monitoring Integrations
- Affect the current metrics you are already collecting
Ensure that you meet the prerequisites to successfully use Monitoring Integrations. For technical assistance, contact Sysdig Support.
Prerequisites
Upgrade Sysdig agent to v12.0.0
If you have clusters with more than 50 nodes and you don’t have the prom_service_discovery
option enabled:
- Enabling the latest Prometheus features might create an additional connection to the Kubernetes API server from each Sysdig agent in your environment. The surge in agent connections can increase the CPU and memory load in your API servers. Therefore, ensure that your API servers are suitably sized to handle the increased load in large clusters.
- If you encounter any problems contact Sysdig Support.
Remove the following manual configurations in the dragent.yaml
file because they might interfere with those provided by Sysdig:
use_promscrape
promscrape_fastproto
prom_service_discovery
prometheus.max_metrics
prometheus.ingest_raw
prometheus.ingest_calculated
The sysdig_sd_configs
configuration is no longer supported. Remove the existing prometheus.yaml
if it includes the sysdig_sd_configs
configuration.
If you are not currently using Prometheus metrics in Sysdig Monitor, you can skip the following steps:
If you are using a custom Prometheus process_filter
in dragent.yaml
to trigger scraping, see Migrating from Promscrape V1 to V2.
If you are using service annotations or container labels to find scrape targets, you may need to create new scrape_configs
in prometheus.yaml
, preferably based on Kubernetes pods service discovery. This configuration can be complicated in certain environments and therefore we recommend that you contact Sysdig support for help.
Learn More
8.1.2 -
Each Monitoring Integration holds a specific job that scrapes its metrics and sends them to Sysdig Monitor. To optimize metrics scraping for building dashboards and alerts in Sysdig Monitor, Sysdig offers default jobs for these integrations. Periodically, the Sysdig agent connects with Sysdig Monitor and retrieves the default jobs and make the Monitoring Integrations available for use. See the list of the available integrations and corresponding jobs.
You can find all the jobs in the /opt/draios/etc/promscrape.yaml
file in the sysdig-agent
container in your cluster.
Supported Monitoring Integrations
Integration | Out of the Box | Enabled by default | Job name in config file |
---|
Apache | | ✔ | apache-exporter-default, apache-grok-default |
Ceph | ✔ | ✔ | ceph-default |
Consul | ✔ | ✔ | consul-server-default, consul-envoy-default |
ElasticSearch | | ✔ | elasticsearch-default |
Fluentd | ✔ | ✔ | fluentd-default |
HaProxy | ✔ | ✔ | haproxy-default |
Harbor | ✔ | ✔ | harbor-exporter-default, harbor-core-default, harbor-registry-default, harbor-jobservice-default |
Kubernetes API Server | ✔ | | kubernetes-apiservers-default |
Kubernetes Control Plane | ✔ | ✔ | kube-dns-default, kube-scheduler-default, kube-controller-manager-default |
Kubernetes Etcd | ✔ | ✔ | etcd-default |
Kubelet | ✔ | | k8s-kubelet-default |
Kube-Proxy | ✔ | | kubernetes-kube-proxy-default |
Kubernetes Persistent Volume Claim | ✔ | | k8s-pvc-default |
Kubernetes Storage | ✔ | | k8s-storage-default |
Keda | ✔ | ✔ | keda-default |
Memcached | | ✔ | memcached-default |
MongoDB | | ✔ | mongodb-default |
MySQL | | ✔ | mysql-default |
Nginx | | ✔ | nginx-default |
Nginx Ingress | ✔ | ✔ | nginx-ingress-default |
NTP | | ✔ | ntp-default |
Open Policy Agent - Gatekeeper | ✔ | ✔ | opa-default |
Php-fpm | | ✔ | php-fpm-default |
Portworx | ✔ | ✔ | portworx-default, portworx-openshift-default |
PostgreSQL | | ✔ | postgres-default |
Prometheus Default Job | ✔ | ✔ | k8s-pods |
RabbitMQ | ✔ | ✔ | rabbitmq-default |
Redis | | ✔ | redis-default |
Sysdig Admission Controller | ✔ | ✔ | sysdig-admission-controller-default |
Enable and Disable Integrations
Some integrations are disabled by default due to the potential high cardinality of their metrics. To enable them, contact Sysdig Support. The same applies to disabling integrations by default in all your clusters.
Customize a Default Job
The default jobs offered by Sysdig for integrations are optimized to scrape the metrics for building dashboards and alerts in Sysdig Monitor. Instead of processing all the metrics available, you can determine which metrics to include or exclude for your requirements. To do so, you can overwrite the default configuration in the prometheus.yaml
file. The prometheus.yaml
file is located in the sysdig-agent
ConfigMap in the sysdig-agent
namespace.
You can overwrite the default job for a specific integration by adding a new job to the prometheus.yaml
file with the same name as the default job that you want to replace. For example, if you want to create a new job for the Apache integration, create a new job with the name apache-default
. The jobs defined by the user has precedence over the default ones.
See Supported Monitoring Integrations for the complete list of integrations and corresponding job names.
Use Sysdig Annotations in Exporters
Sysdig provides a set of Helm charts that helps you configure the exporters for the integrations. For more information on installing Monitor Integrations, see the Monitoring Integrations
option in the Sysdig Monitor. Additionally, the Helm charts are publicly available in the Sysdig Helm repository.
If exporters are already installed in your cluster, you can use the standard Prometheus annotations and the Sysdig agent will automatically scrape them.
For example, if you use the annotation given below, the incoming metrics will have the information about the pod that generates the metrics.
spec:
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: '9100'
prometheus.io/scrape: 'true'
If you use an exporter, the incoming metrics will be associated with the exporter pod, not the application pod.
To change this behavior, you can use the Sysdig-provided annotations and configure the exporter on the agent.
Annotate the Exporter
Use the following annotations to configure the exporter:
spec:
template:
metadata:
annotations:
promcat.sysdig.com/port: '9187'
promcat.sysdig.com/target_ns: my-namespace
promcat.sysdig.com/target_workload_type: deployment
promcat.sysdig.com/target_workload_name: my-workload
promcat.sysdig.com/integration_type: my-integration
port
: The port to scrape for metrics on the exporter.target_ns
: The namespace of the workload corresponding to the application (not the exporter).target_workload_type
: The type of the workload of the application (not the exporter). The possible values are deployment
, statefulset
, and daemonset
.target_workload_name
: The name of the workload corresponding to the application (not the exporter).integration_type
: The type of the integration. The job created in the Sysdig agent use this value to find the exporter.
Edit the prometheus.yaml
file to configure a new job in Sysdig agent. The file is located in the sysdig-agent
ConfigMap in the sysdig-agent
namespace.
You can use the following example template:
- job_name: my-integration
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: 'my-integration' # Use here the integration type that you defined in your annotations
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
target_label: kube_namespace_name
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
target_label: kube_workload_type
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
target_label: kube_workload_name
- action: replace
replacement: true
target_label: sysdig_omit_source
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
Exclude a Deployment from Being Scraped
If you want the agent to exclude a deployment from being scraped, use the following annotation:
spec:
template:
metadata:
annotations:
promcat.sysdig.com/omit: 'true'
Learn More
8.2 -
Sysdig Monitor Integrations
8.2.1 -

Apache
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Apache] No Instance Up | No instances up | Prometheus |
[Apache] Up Time Less Than One Hour | Instance with UpTime less than one hour | Prometheus |
[Apache] Time Since Last OK Request More Than One Hour | Time since last OK request higher than one hour | Prometheus |
[Apache] High Error Rate | High error rate | Prometheus |
[Apache] High Rate Of Busy Workers In Instance | Low workers in open_slot state | Prometheus |
List of dashboards:
List of metrics:
- apache_accesses_total
- apache_connections
- apache_cpuload
- apache_duration_ms_total
- apache_http_last_request_seconds
- apache_http_response_codes_total
- apache_scoreboard
- apache_sent_kilobytes_total
- apache_up
- apache_uptime_seconds_total
- apache_workers
8.2.2 -

Ceph
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Ceph] Ceph Manager is absent | Ceph Manager has disappeared from Prometheus target discovery. | Prometheus |
[Ceph] Ceph Manager is missing replicas | Ceph Manager is missing replicas. | Prometheus |
[Ceph] Ceph quorum at risk | Storage cluster quorum is low. Contact Support. | Prometheus |
[Ceph] High number of leader changes | Ceph Monitor has seen a lot of leader changes per minute recently. | Prometheus |
List of dashboards:
List of metrics:
- ceph_cluster_total_bytes
- ceph_cluster_total_used_bytes
- ceph_health_status
- ceph_mgr_status
- ceph_mon_metadata
- ceph_mon_num_elections
- ceph_mon_quorum_status
- ceph_osd_apply_latency_ms
- ceph_osd_commit_latency_ms
- ceph_osd_in
- ceph_osd_metadata
- ceph_osd_numpg
- ceph_osd_op_r
- ceph_osd_op_r_latency_count
- ceph_osd_op_r_latency_sum
- ceph_osd_op_r_out_bytes
- ceph_osd_op_w
- ceph_osd_op_w_in_bytes
- ceph_osd_op_w_latency_count
- ceph_osd_op_w_latency_sum
- ceph_osd_recovery_bytes
- ceph_osd_recovery_ops
- ceph_osd_up
- ceph_pool_max_avail
Related blog posts:
8.2.3 -

Consul
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Consul] KV Store update time anomaly | KV Store update time anomaly | Prometheus |
[Consul] Transaction time anomaly | Transaction time anomaly | Prometheus |
[Consul] Raft transactions count anomaly | Raft transactions count anomaly | Prometheus |
[Consul] Raft commit time anomaly | Raft commit time anomaly | Prometheus |
[Consul] Leader time to contact followers too high | Leader time to contact followers too high | Prometheus |
[Consul] Flapping leadership | Flapping leadership | Prometheus |
[Consul] Too many elections | Too many elections | Prometheus |
[Consul] Server cluster unhealthy | Server cluster unhealthy | Prometheus |
[Consul] Zero failure tolerance | Zero failure tolerance | Prometheus |
[Consul] Client RPC requests anomaly | Consul client RPC requests anomaly | Prometheus |
[Consul] Client RPC requests rate limit exceeded | Consul client RPC requests rate limit exceeded | Prometheus |
[Consul] Client RPC requests failed | Consul client RPC requests failed | Prometheus |
[Consul] License Expiry | Consul License Expiry | Prometheus |
[Consul] Garbage Collection pause high | Consul Garbage Collection pause high | Prometheus |
[Consul] Garbage Collection pause too high | Consul Garbage Collection pause too high | Prometheus |
[Consul] Raft restore duration too high | Consul Raft restore duration too high | Prometheus |
[Consul] RPC requests error rate is high | Consul RPC requests error rate is high | Prometheus |
[Consul] Cache hit rate is low | Consul Cache hit rate is low | Prometheus |
[Consul] High 4xx RequestError Rate | High 4xx RequestError Rate | Prometheus |
[Consul] High Request Latency | Envoy High Request Latency | Prometheus |
[Consul] High Response Latency | Envoy High Response Latency | Prometheus |
[Consul] Certificate close to expire | Certificate close to expire | Prometheus |
List of dashboards:
List of metrics:
- consul_autopilot_failure_tolerance
- consul_autopilot_healthy
- consul_client_rpc
- consul_client_rpc_exceeded
- consul_client_rpc_failed
- consul_consul_cache_bypass
- consul_consul_cache_entries_count
- consul_consul_cache_evict_expired
- consul_consul_cache_fetch_error
- consul_consul_cache_fetch_success
- consul_kvs_apply_sum
- consul_raft_apply
- consul_raft_commitTime_sum
- consul_raft_fsm_lastRestoreDuration
- consul_raft_leader_lastContact
- consul_raft_leader_oldestLogAge
- consul_raft_rpc_installSnapshot
- consul_raft_state_candidate
- consul_raft_state_leader
- consul_rpc_cross_dc
- consul_rpc_queries_blocking
- consul_rpc_query
- consul_rpc_request
- consul_rpc_request_error
- consul_runtime_gc_pause_ns
- consul_runtime_gc_pause_ns_sum
- consul_system_licenseExpiration
- consul_txn_apply_sum
- envoy_cluster_membership_change
- envoy_cluster_membership_healthy
- envoy_cluster_membership_total
- envoy_cluster_upstream_cx_active
- envoy_cluster_upstream_cx_connect_ms_bucket
- envoy_cluster_upstream_rq_active
- envoy_cluster_upstream_rq_pending_active
- envoy_cluster_upstream_rq_time_bucket
- envoy_cluster_upstream_rq_xx
- envoy_server_days_until_first_cert_expiring
- go_build_info
- go_gc_duration_seconds
- go_gc_duration_seconds_count
- go_gc_duration_seconds_sum
- go_goroutines
- go_memstats_buck_hash_sys_bytes
- go_memstats_gc_sys_bytes
- go_memstats_heap_alloc_bytes
- go_memstats_heap_idle_bytes
- go_memstats_heap_inuse_bytes
- go_memstats_heap_released_bytes
- go_memstats_heap_sys_bytes
- go_memstats_lookups_total
- go_memstats_mallocs_total
- go_memstats_mcache_inuse_bytes
- go_memstats_mcache_sys_bytes
- go_memstats_mspan_inuse_bytes
- go_memstats_mspan_sys_bytes
- go_memstats_next_gc_bytes
- go_memstats_stack_inuse_bytes
- go_memstats_stack_sys_bytes
- go_memstats_sys_bytes
- go_threads
- process_cpu_seconds_total
- process_max_fds
- process_open_fds
8.2.4 -

Elasticsearch
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Elasticsearch] Heap Usage Too High | The heap usage is over 90% | Prometheus |
[Elasticsearch] Heap Usage Warning | The heap usage is over 80% | Prometheus |
[Elasticsearch] Disk Space Low | Disk available less than 20% | Prometheus |
[Elasticsearch] Disk Out Of Space | Disk available less than 10% | Prometheus |
[Elasticsearch] Cluster Red | Cluster in Red status | Prometheus |
[Elasticsearch] Cluster Yellow | Cluster in Yellow status | Prometheus |
[Elasticsearch] Relocation Shards | Relocating shards for too long | Prometheus |
[Elasticsearch] Initializing Shards | Initializing shards takes too long | Prometheus |
[Elasticsearch] Unassigned Shards | Unassigned shards for long time | Prometheus |
[Elasticsearch] Pending Tasks | Elasticsearch has a high number of pending tasks | Prometheus |
[Elasticsearch] No New Documents | Elasticsearch has no new documents for a period of time | Prometheus |
List of dashboards:
- ElasticSearch_Cluster
- ElasticSearch_Infra
List of metrics:
- elasticsearch_cluster_health_active_primary_shards
- elasticsearch_cluster_health_active_shards
- elasticsearch_cluster_health_initializing_shards
- elasticsearch_cluster_health_number_of_data_nodes
- elasticsearch_cluster_health_number_of_nodes
- elasticsearch_cluster_health_number_of_pending_tasks
- elasticsearch_cluster_health_relocating_shards
- elasticsearch_cluster_health_status
- elasticsearch_cluster_health_unassigned_shards
- elasticsearch_filesystem_data_available_bytes
- elasticsearch_filesystem_data_size_bytes
- elasticsearch_indices_docs
- elasticsearch_indices_indexing_index_time_seconds_total
- elasticsearch_indices_indexing_index_total
- elasticsearch_indices_merges_total_time_seconds_total
- elasticsearch_indices_search_query_time_seconds
- elasticsearch_indices_store_throttle_time_seconds_total
- elasticsearch_jvm_gc_collection_seconds_count
- elasticsearch_jvm_gc_collection_seconds_sum
- elasticsearch_jvm_memory_committed_bytes
- elasticsearch_jvm_memory_max_bytes
- elasticsearch_jvm_memory_pool_peak_used_bytes
- elasticsearch_jvm_memory_used_bytes
- elasticsearch_os_load1
- elasticsearch_os_load15
- elasticsearch_os_load5
- elasticsearch_process_cpu_percent
- elasticsearch_transport_rx_size_bytes_total
- elasticsearch_transport_tx_size_bytes_total
8.2.5 -

Fluentd
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Fluentd] No Input From Container | No Input From Container. | Prometheus |
[Fluentd] High Error Ratio | High Error Ratio. | Prometheus |
[Fluentd] High Retry Ratio | High Retry Ratio. | Prometheus |
[Fluentd] High Retry Wait | High Retry Wait. | Prometheus |
[Fluentd] Low Buffer Available Space | Low Buffer Available Space. | Prometheus |
[Fluentd] Buffer Queue Length Increasing | Buffer Queue Length Increasing. | Prometheus |
[Fluentd] Buffer Total Bytes Increasing | Buffer Total Bytes Increasing. | Prometheus |
[Fluentd] High Slow Flush Ratio | High Slow Flush Ratio. | Prometheus |
[Fluentd] No Output Records From Plugin | No Output Records From Plugin. | Prometheus |
List of dashboards:
List of metrics:
- fluentd_input_status_num_records_total
- fluentd_output_status_buffer_available_space_ratio
- fluentd_output_status_buffer_queue_length
- fluentd_output_status_buffer_total_bytes
- fluentd_output_status_emit_count
- fluentd_output_status_emit_records
- fluentd_output_status_flush_time_count
- fluentd_output_status_num_errors
- fluentd_output_status_retry_count
- fluentd_output_status_retry_wait
- fluentd_output_status_rollback_count
- fluentd_output_status_slow_flush_count
8.2.6 -

Haproxy-ingress
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Haproxy-Ingress] Uptime less than 1 hour | This alert detects when all of the instances of the ingress controller have an uptime of less than 1 hour. | Prometheus |
[Haproxy-Ingress] Frontend Down | This alert detects when a frontend has all of its instances down for more than 10 minutes. | Prometheus |
[Haproxy-Ingress] Backend Down | This alert detects when a backend has all of its instances down for more than 10 minutes. | Prometheus |
[Haproxy-Ingress] High Sessions Usage | This alert triggers when the backend sessions overpass the 85% of the sessions capacity for 10 minutes. | Prometheus |
[Haproxy-Ingress] High Error Rate | This alert triggers when there is an error rate over 15% for over 10 minutes in a proxy. | Prometheus |
[Haproxy-Ingress] High Request Denied Rate | These alerts detect when there is a denied rate of requests over 10% for over 10 minutes in a proxy. | Prometheus |
[Haproxy-Ingress] High Response Denied Rate | These alerts detect when there is a denied rate of responses over 10% for over 10 minutes in a proxy. | Prometheus |
[Haproxy-Ingress] High Response Rate | This alert triggers when a proxy has a mean response time higher than 250ms for over 10 minutes. | Prometheus |
List of dashboards:
- HAProxy_Ingress_Overview
- HAProxy_Ingress_Service_Details
List of metrics:
- haproxy_backend_bytes_in_total
- haproxy_backend_bytes_out_total
- haproxy_backend_client_aborts_total
- haproxy_backend_connect_time_average_seconds
- haproxy_backend_current_queue
- haproxy_backend_http_requests_total
- haproxy_backend_http_responses_total
- haproxy_backend_limit_sessions
- haproxy_backend_queue_time_average_seconds
- haproxy_backend_requests_denied_total
- haproxy_backend_response_time_average_seconds
- haproxy_backend_responses_denied_total
- haproxy_backend_sessions_total
- haproxy_backend_status
- haproxy_frontend_bytes_in_total
- haproxy_frontend_bytes_out_total
- haproxy_frontend_connections_total
- haproxy_frontend_denied_connections_total
- haproxy_frontend_denied_sessions_total
- haproxy_frontend_request_errors_total
- haproxy_frontend_requests_denied_total
- haproxy_frontend_responses_denied_total
- haproxy_frontend_status
- haproxy_process_active_peers
- haproxy_process_current_connection_rate
- haproxy_process_current_run_queue
- haproxy_process_current_session_rate
- haproxy_process_current_tasks
- haproxy_process_jobs
- haproxy_process_ssl_connections_total
- haproxy_process_start_time_seconds
8.2.7 -

Harbor
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Harbor] Harbor Core Is Down | Harbor Core Is Down | Prometheus |
[Harbor] Harbor Database Is Down | Harbor Database Is Down | Prometheus |
[Harbor] Harbor Registry Is Down | Harbor Registry Is Down | Prometheus |
[Harbor] Harbor Redis Is Down | Harbor Redis Is Down | Prometheus |
[Harbor] Harbor Trivy Is Down | Harbor Trivy Is Down | Prometheus |
[Harbor] Harbor JobService Is Down | Harbor JobService Is Down | Prometheus |
[Harbor] Project Quota Is Raising The Limit | Project Quota Is Raising The Limit | Prometheus |
[Harbor] Harbor p99 latency is higher than 10 seconds | Harbor p99 latency is higher than 10 seconds | Prometheus |
[Harbor] Harbor Error Rate is High | Harbor Error Rate is High | Prometheus |
List of dashboards:
List of metrics:
- go_build_info
- go_gc_duration_seconds
- go_gc_duration_seconds_count
- go_gc_duration_seconds_sum
- go_goroutines
- go_memstats_buck_hash_sys_bytes
- go_memstats_gc_sys_bytes
- go_memstats_heap_alloc_bytes
- go_memstats_heap_idle_bytes
- go_memstats_heap_inuse_bytes
- go_memstats_heap_released_bytes
- go_memstats_heap_sys_bytes
- go_memstats_lookups_total
- go_memstats_mallocs_total
- go_memstats_mcache_inuse_bytes
- go_memstats_mcache_sys_bytes
- go_memstats_mspan_inuse_bytes
- go_memstats_mspan_sys_bytes
- go_memstats_next_gc_bytes
- go_memstats_stack_inuse_bytes
- go_memstats_stack_sys_bytes
- go_memstats_sys_bytes
- go_threads
- harbor_artifact_pulled
- harbor_core_http_request_duration_seconds
- harbor_jobservice_task_process_time_seconds
- harbor_project_member_total
- harbor_project_quota_byte
- harbor_project_quota_usage_byte
- harbor_project_repo_total
- harbor_project_total
- harbor_quotas_size_bytes
- harbor_task_concurrency
- harbor_task_queue_latency
- harbor_task_queue_size
- harbor_up
- process_cpu_seconds_total
- process_max_fds
- process_open_fds
- registry_http_request_duration_seconds_bucket
- registry_http_request_size_bytes_bucket
- registry_http_requests_total
- registry_http_response_size_bytes_bucket
- registry_storage_action_seconds_bucket
8.2.8 -

K8s-etcd
This integration is enabled by default.
List of dashboards:
List of metrics:
- etcd_debugging_mvcc_db_total_size_in_bytes
- etcd_disk_backend_commit_duration_seconds_bucket
- etcd_disk_wal_fsync_duration_seconds_bucket
- etcd_grpc_proxy_cache_hits_total
- etcd_grpc_proxy_cache_misses_total
- etcd_network_client_grpc_received_bytes_total
- etcd_network_client_grpc_sent_bytes_total
- etcd_network_peer_received_bytes_total
- etcd_network_peer_received_failures_total
- etcd_network_peer_round_trip_time_seconds_bucket
- etcd_network_peer_sent_bytes_total
- etcd_network_peer_sent_failures_total
- etcd_server_has_leader
- etcd_server_id
- etcd_server_leader_changes_seen_total
- etcd_server_proposals_applied_total
- etcd_server_proposals_committed_total
- etcd_server_proposals_failed_total
- etcd_server_proposals_pending
- go_build_info
- go_gc_duration_seconds
- go_gc_duration_seconds_count
- go_gc_duration_seconds_sum
- go_goroutines
- go_memstats_buck_hash_sys_bytes
- go_memstats_gc_sys_bytes
- go_memstats_heap_alloc_bytes
- go_memstats_heap_idle_bytes
- go_memstats_heap_inuse_bytes
- go_memstats_heap_released_bytes
- go_memstats_heap_sys_bytes
- go_memstats_lookups_total
- go_memstats_mallocs_total
- go_memstats_mcache_inuse_bytes
- go_memstats_mcache_sys_bytes
- go_memstats_mspan_inuse_bytes
- go_memstats_mspan_sys_bytes
- go_memstats_next_gc_bytes
- go_memstats_stack_inuse_bytes
- go_memstats_stack_sys_bytes
- go_memstats_sys_bytes
- go_threads
- grpc_server_handled_total
- grpc_server_started_total
- process_cpu_seconds_total
- process_max_fds
- process_open_fds
- sysdig_container_cpu_cores_used
- sysdig_container_memory_used_bytes
8.2.9 -

Keda
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Keda] Errors in Scaled Object | Errors detected in scaled object | Prometheus |
List of dashboards:
List of metrics:
- keda_metrics_adapter_scaled_object_errors
- keda_metrics_adapter_scaler_metrics_value
- kubernetes.hpa.replicas.current
- kubernetes.hpa.replicas.desired
- kubernetes.hpa.replicas.max
- kubernetes.hpa.replicas.min
8.2.10 -

Memcached
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Memcached] Instance Down | Instance is not reachable | Prometheus |
[Memcached] Low UpTime | Uptime of less than 1 hour in a Memcached instance | Prometheus |
[Memcached] Connection Throttled | Connection throttled because max number of requests per event process reached | Prometheus |
[Memcached] Connections Close To The Limit 85% | The mumber of connections are close to the limit | Prometheus |
[Memcached] Connections Limit Reached | Reached the number of maximum connections and caused a connection error | Prometheus |
List of dashboards:
List of metrics:
- memcached_commands_total
- memcached_connections_listener_disabled_total
- memcached_connections_yielded_total
- memcached_current_bytes
- memcached_current_connections
- memcached_current_items
- memcached_items_evicted_total
- memcached_items_reclaimed_total
- memcached_items_total
- memcached_limit_bytes
- memcached_max_connections
- memcached_up
- memcached_uptime_seconds
8.2.11 -

Mongodb
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[MongoDB] Instance Down | Mongo server detected down by instance | Prometheus |
[MongoDB] Uptime less than one hour | Mongo server detected down by instance | Prometheus |
[MongoDB] Asserts detected | Mongo server detected down by instance | Prometheus |
[MongoDB] High Latency | High latency in instance | Prometheus |
[MongoDB] High Ticket Utilization | Ticket usage over 75% in instance | Prometheus |
[MongoDB] Recurrent Cursor Timeout | Recurrent cursors timeout in instance | Prometheus |
[MongoDB] Recurrent Memory Page Faults | Recurrent cursors timeout in instance | Prometheus |
List of dashboards:
- MongoDB_Database_Details
- MongoDB_Instance_Health
List of metrics:
- mongodb_asserts_total
- mongodb_connections
- mongodb_extra_info_page_faults_total
- mongodb_instance_uptime_seconds
- mongodb_memory
- mongodb_mongod_db_collections_total
- mongodb_mongod_db_data_size_bytes
- mongodb_mongod_db_index_size_bytes
- mongodb_mongod_db_indexes_total
- mongodb_mongod_db_objects_total
- mongodb_mongod_global_lock_client
- mongodb_mongod_global_lock_current_queue
- mongodb_mongod_global_lock_ratio
- mongodb_mongod_metrics_cursor_open
- mongodb_mongod_metrics_cursor_timed_out_total
- mongodb_mongod_op_latencies_latency_total
- mongodb_mongod_op_latencies_ops_total
- mongodb_mongod_wiredtiger_cache_bytes
- mongodb_mongod_wiredtiger_cache_bytes_total
- mongodb_mongod_wiredtiger_cache_evicted_total
- mongodb_mongod_wiredtiger_cache_pages
- mongodb_mongod_wiredtiger_concurrent_transactions_out_tickets
- mongodb_mongod_wiredtiger_concurrent_transactions_total_tickets
- mongodb_network_bytes_total
- mongodb_network_metrics_num_requests_total
- mongodb_op_counters_total
- mongodb_up
- net.error.count
8.2.12 -

Mysql
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[MySQL] Mysql Down | MySQL instance is down | Prometheus |
[MySQL] Mysql Restarted | MySQL has just been restarted, less than one minute ago | Prometheus |
[MySQL] Mysql Too any Connections (>80%) | More than 80% of MySQL connections are in use | Prometheus |
[MySQL] Mysql High Threads Running | More than 60% of MySQL connections are in running state | Prometheus |
[MySQL] Mysql HighOpen Files | More than 80% of MySQL files open | Prometheus |
[MySQL] Mysql Slow Queries | MySQL server mysql has some new slow query | Prometheus |
[MySQL] Mysql Innodb Log Waits | MySQL innodb log writes stalling | Prometheus |
[MySQL] Mysql Slave Io Thread Not Running | MySQL Slave IO thread not running | Prometheus |
[MySQL] Mysql Slave Sql Thread Not Running | MySQL Slave SQL thread not running | Prometheus |
[MySQL] Mysql Slave Replication Lag | MySQL Slave replication lag | Prometheus |
List of dashboards:
List of metrics:
- mysql_global_status_aborted_clients
- mysql_global_status_aborted_connects
- mysql_global_status_buffer_pool_pages
- mysql_global_status_bytes_received
- mysql_global_status_bytes_sent
- mysql_global_status_commands_total
- mysql_global_status_connection_errors_total
- mysql_global_status_innodb_buffer_pool_read_requests
- mysql_global_status_innodb_buffer_pool_reads
- mysql_global_status_innodb_log_waits
- mysql_global_status_innodb_mem_adaptive_hash
- mysql_global_status_innodb_mem_dictionary
- mysql_global_status_innodb_page_size
- mysql_global_status_questions
- mysql_global_status_select_full_join
- mysql_global_status_select_full_range_join
- mysql_global_status_select_range_check
- mysql_global_status_select_scan
- mysql_global_status_slow_queries
- mysql_global_status_sort_merge_passes
- mysql_global_status_sort_range
- mysql_global_status_sort_rows
- mysql_global_status_sort_scan
- mysql_global_status_table_locks_immediate
- mysql_global_status_table_locks_waited
- mysql_global_status_table_open_cache_hits
- mysql_global_status_table_open_cache_misses
- mysql_global_status_threads_cached
- mysql_global_status_threads_connected
- mysql_global_status_threads_created
- mysql_global_status_threads_running
- mysql_global_status_uptime
- mysql_global_variables_innodb_additional_mem_pool_size
- mysql_global_variables_innodb_log_buffer_size
- mysql_global_variables_innodb_open_files
- mysql_global_variables_key_buffer_size
- mysql_global_variables_max_connections
- mysql_global_variables_open_files_limit
- mysql_global_variables_query_cache_size
- mysql_global_variables_thread_cache_size
- mysql_global_variables_tokudb_cache_size
- mysql_slave_status_master_server_id
- mysql_slave_status_seconds_behind_master
- mysql_slave_status_slave_io_running
- mysql_slave_status_slave_sql_running
- mysql_slave_status_sql_delay
- mysql_up
Related blog posts:
8.2.13 -

Nginx
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Nginx] No Intances Up | No Nginx instances Up | Prometheus |
List of dashboards:
List of metrics:
- net.bytes.in
- net.bytes.out
- net.http.error.count
- net.http.request.count
- net.http.request.time
- nginx_connections_accepted
- nginx_connections_active
- nginx_connections_handled
- nginx_connections_reading
- nginx_connections_waiting
- nginx_connections_writing
- nginx_up
8.2.14 -

Nginx-ingress
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Nginx-Ingress] High Http 4xx Error Rate | Too many HTTP requests with status 4xx (> 5%) | Prometheus |
[Nginx-Ingress] High Http 5xx Error Rate | Too many HTTP requests with status 5xx (> 5%) | Prometheus |
[Nginx-Ingress] High Latency | Nginx p99 latency is higher than 10 seconds | Prometheus |
[Nginx-Ingress] Ingress Certificate Expiry | Nginx Ingress Certificate will expire in less than 14 days | Prometheus |
List of dashboards:
- Nginx_Kubernetes_Ingress_Controller
List of metrics:
- go_build_info
- go_gc_duration_seconds
- go_gc_duration_seconds_count
- go_gc_duration_seconds_sum
- go_goroutines
- go_memstats_buck_hash_sys_bytes
- go_memstats_gc_sys_bytes
- go_memstats_heap_alloc_bytes
- go_memstats_heap_idle_bytes
- go_memstats_heap_inuse_bytes
- go_memstats_heap_released_bytes
- go_memstats_heap_sys_bytes
- go_memstats_lookups_total
- go_memstats_mallocs_total
- go_memstats_mcache_inuse_bytes
- go_memstats_mcache_sys_bytes
- go_memstats_mspan_inuse_bytes
- go_memstats_mspan_sys_bytes
- go_memstats_next_gc_bytes
- go_memstats_stack_inuse_bytes
- go_memstats_stack_sys_bytes
- go_memstats_sys_bytes
- go_threads
- nginx_ingress_controller_config_last_reload_successful
- nginx_ingress_controller_config_last_reload_successful_timestamp_seconds
- nginx_ingress_controller_ingress_upstream_latency_seconds_count
- nginx_ingress_controller_ingress_upstream_latency_seconds_sum
- nginx_ingress_controller_nginx_process_connections
- nginx_ingress_controller_nginx_process_cpu_seconds_total
- nginx_ingress_controller_nginx_process_resident_memory_bytes
- nginx_ingress_controller_request_duration_seconds_bucket
- nginx_ingress_controller_request_duration_seconds_count
- nginx_ingress_controller_request_duration_seconds_sum
- nginx_ingress_controller_request_size_sum
- nginx_ingress_controller_requests
- nginx_ingress_controller_response_duration_seconds_count
- nginx_ingress_controller_response_duration_seconds_sum
- nginx_ingress_controller_response_size_sum
- nginx_ingress_controller_ssl_expire_time_seconds
- process_cpu_seconds_total
- process_max_fds
- process_open_fds
8.2.15 -

Ntp
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Ntp] Drift is too high | Drift is too high | Prometheus |
List of dashboards:
List of metrics:
8.2.16 -

Opa
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Opa gatekeeper] Too much time since the last audit | There was more than 120 second since the last audit | Prometheus |
[Opa gatekeeper] Spike of violations | There was more than 30 violations | Prometheus |
List of dashboards:
List of metrics:
- gatekeeper_audit_duration_seconds_bucket
- gatekeeper_audit_last_run_time
- gatekeeper_constraint_template_ingestion_count
- gatekeeper_constraint_template_ingestion_duration_seconds_bucket
- gatekeeper_constraint_templates
- gatekeeper_constraints
- gatekeeper_request_count
- gatekeeper_request_duration_seconds_bucket
- gatekeeper_request_duration_seconds_count
- gatekeeper_violations
8.2.17 -

Php-fpm
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Php-Fpm] Percentage of instances low | Less than 75% of instances are up | Prometheus |
[Php-Fpm] Recently reboot | Instances have been recently reboot | Prometheus |
[Php-Fpm] Limit of child proccess exceeded | Number of childs process have been exceeded | Prometheus |
[Php-Fpm] Reaching limit of queue process | Buffer of queue requests reaching its limit | Prometheus |
[Php-Fpm] Too slow requests processing | Requests have taking too much time to be processed | Prometheus |
List of dashboards:
List of metrics:
- kube_workload_status_desired
- phpfpm_accepted_connections
- phpfpm_active_processes
- phpfpm_idle_processes
- phpfpm_listen_queue
- phpfpm_listen_queue_length
- phpfpm_max_children_reached
- phpfpm_process_requests
- phpfpm_slow_requests
- phpfpm_start_since
- phpfpm_total_processes
- phpfpm_up
8.2.18 -

Portworx
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Portworx] No Quorum | Portworx No Quorum. | Prometheus |
[Portworx] Node Status Not OK | Portworx Node Status Not OK. | Prometheus |
[Portworx] Offline Nodes | Portworx Offline Nodes. | Prometheus |
[Portworx] Nodes Storage Full or Down | Portworx Nodes Storage Full or Down. | Prometheus |
[Portworx] Offline Storage Nodes | Portworx Offline Storage Nodes. | Prometheus |
[Portworx] Unhealthy Node KVDB | Portworx Unhealthy Node KVDB. | Prometheus |
[Portworx] Cache read hit rate is low | Portworx Cache read hit rate is low. | Prometheus |
[Portworx] Cache write hit rate is low | Portworx Cache write hit rate is low. | Prometheus |
[Portworx] High Read Latency In Disk | Portworx High Read Latency In Disk. | Prometheus |
[Portworx] High Write Latency In Disk | Portworx High Write Latency In Disk. | Prometheus |
[Portworx] Low Cluster Capacity | Portworx Low Cluster Capacity. | Prometheus |
[Portworx] Disk Full In 48H | Portworx Disk Full In 48H. | Prometheus |
[Portworx] Disk Full In 12H | Portworx Disk Full In 12H. | Prometheus |
[Portworx] Pool Status Not Online | Portworx Node Status Not Online. | Prometheus |
[Portworx] High Write Latency In Pool | Portworx High Write Latency In Pool. | Prometheus |
[Portworx] Pool Full In 48H | Portworx Pool Full In 48H. | Prometheus |
[Portworx] Pool Full In 12H | Portworx Pool Full In 12H. | Prometheus |
[Portworx] High Write Latency In Volume | Portworx High Write Latency In Volume. | Prometheus |
[Portworx] High Read Latency In Volume | Portworx High Read Latency In Volume. | Prometheus |
[Portworx] License Expiry | Portworx License Expiry. | Prometheus |
List of dashboards:
- Portworx Cluster
- Portworx Volumes
List of metrics:
- go_build_info
- go_gc_duration_seconds
- go_gc_duration_seconds_count
- go_gc_duration_seconds_sum
- go_goroutines
- go_memstats_buck_hash_sys_bytes
- go_memstats_gc_sys_bytes
- go_memstats_heap_alloc_bytes
- go_memstats_heap_idle_bytes
- go_memstats_heap_inuse_bytes
- go_memstats_heap_released_bytes
- go_memstats_heap_sys_bytes
- go_memstats_lookups_total
- go_memstats_mallocs_total
- go_memstats_mcache_inuse_bytes
- go_memstats_mcache_sys_bytes
- go_memstats_mspan_inuse_bytes
- go_memstats_mspan_sys_bytes
- go_memstats_next_gc_bytes
- go_memstats_stack_inuse_bytes
- go_memstats_stack_sys_bytes
- go_memstats_sys_bytes
- go_threads
- process_cpu_seconds_total
- process_max_fds
- process_open_fds
- px_cluster_disk_available_bytes
- px_cluster_disk_total_bytes
- px_cluster_status_nodes_offline
- px_cluster_status_nodes_online
- px_cluster_status_nodes_storage_down
- px_cluster_status_quorum
- px_cluster_status_size
- px_cluster_status_storage_nodes_decommissioned
- px_cluster_status_storage_nodes_offline
- px_cluster_status_storage_nodes_online
- px_disk_stats_num_reads_total
- px_disk_stats_num_writes_total
- px_disk_stats_read_bytes_total
- px_disk_stats_read_latency_seconds
- px_disk_stats_used_bytes
- px_disk_stats_write_latency_seconds
- px_disk_stats_written_bytes_total
- px_kvdb_health_state_node_view
- px_network_io_received_bytes_total
- px_network_io_sent_bytes_total
- px_node_status_license_expiry
- px_node_status_node_status
- px_pool_stats_available_bytes
- px_pool_stats_flushed_bytes_total
- px_pool_stats_num_flushes_total
- px_pool_stats_num_writes
- px_pool_stats_status
- px_pool_stats_total_bytes
- px_pool_stats_write_latency_seconds
- px_pool_stats_written_bytes
- px_px_cache_read_hits
- px_px_cache_read_miss
- px_px_cache_write_hits
- px_px_cache_write_miss
- px_volume_attached
- px_volume_attached_state
- px_volume_capacity_bytes
- px_volume_currhalevel
- px_volume_halevel
- px_volume_read_bytes_total
- px_volume_read_latency_seconds
- px_volume_reads_total
- px_volume_replication_status
- px_volume_state
- px_volume_status
- px_volume_usage_bytes
- px_volume_write_latency_seconds
- px_volume_writes_total
- px_volume_written_bytes_total
8.2.19 -

Postgresql
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[PostgreSQL] Instance Down | PostgreSQL instance is unavailable | Prometheus |
[PostgreSQL] Low UpTime | The PostgreSQL instance has a UpTime of less than 1 hour | Prometheus |
[PostgreSQL] Max Write Buffer Reached | Background writer stops because it reached the maximum write buffers | Prometheus |
[PostgreSQL] High WAL Files Archive Error Rate | High error rate in WAL files archiver | Prometheus |
[PostgreSQL] Low Available Connections | Low available network connections | Prometheus |
[PostgreSQL] High Response Time | High response time in at least one of the databases | Prometheus |
[PostgreSQL] Low Cache Hit Rate | Low cache hit rate | Prometheus |
[PostgreSQL] DeadLocks In Database | Deadlocks detected in database | Prometheus |
List of dashboards:
- Postgresql_DB_Golden_Signals
- Postgresql_Instance_Health
List of metrics:
- pg_database_size_bytes
- pg_locks_count
- pg_postmaster_start_time_seconds
- pg_replication_lag
- pg_settings_max_connections
- pg_settings_superuser_reserved_connections
- pg_stat_activity_count
- pg_stat_activity_max_tx_duration
- pg_stat_archiver_archived_count
- pg_stat_archiver_failed_count
- pg_stat_bgwriter_buffers_alloc
- pg_stat_bgwriter_buffers_backend
- pg_stat_bgwriter_buffers_checkpoint
- pg_stat_bgwriter_buffers_clean
- pg_stat_bgwriter_checkpoint_sync_time
- pg_stat_bgwriter_checkpoint_write_time
- pg_stat_bgwriter_checkpoints_req
- pg_stat_bgwriter_checkpoints_timed
- pg_stat_bgwriter_maxwritten_clean
- pg_stat_database_blk_read_time
- pg_stat_database_blks_hit
- pg_stat_database_blks_read
- pg_stat_database_conflicts_confl_deadlock
- pg_stat_database_conflicts_confl_lock
- pg_stat_database_deadlocks
- pg_stat_database_numbackends
- pg_stat_database_temp_bytes
- pg_stat_database_tup_deleted
- pg_stat_database_tup_fetched
- pg_stat_database_tup_inserted
- pg_stat_database_tup_returned
- pg_stat_database_tup_updated
- pg_stat_database_xact_commit
- pg_stat_database_xact_rollback
- pg_stat_user_tables_idx_scan
- pg_stat_user_tables_n_tup_hot_upd
- pg_stat_user_tables_seq_scan
- pg_up
Related blog posts:
8.2.20 -

Rabbitmq
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[RabbitMQ] Cluster Operator Unavailable Replicas | There are kube_pod_names that are either running but not yet available or kube_pod_names that still have not been created. | Prometheus |
[RabbitMQ] Insufficient Established Erlang Distribution Links | Insuffient establised erland distribution links | Prometheus |
[RabbitMQ] Low Disk Watermark Predicted | The predicted free disk space in 24 hours from now is low | Prometheus |
[RabbitMQ] High Connection Churn | There are a high connection churn | Prometheus |
[RabbitMQ] No MajorityOfNodesReady | There are so many nodes not ready | Prometheus |
[RabbitMQ] Persistent Volume Missing | There is at least one pvc not bound | Prometheus |
[RabbitMQ] Unroutable Messages | There were unroutable message within the last 5 minutes in RabbitMQ cluster | Prometheus |
[RabbitMQ] File Descriptors Near Limit | The file descriptors are near to the limit | Prometheus |
[RabbitMQ] Container Restarts | Over the last 10 minutes a rabbitmq container was restarted | Prometheus |
[RabbitMQ] TCP Sockets Near Limit | The TCP sockets are near to the limit | Prometheus |
List of dashboards:
- Rabbitmq_Usage
- Rabbitmq_Overview
List of metrics:
- erlang_vm_dist_node_state
- kube_deployment_status_replicas_unavailable
- kube_kube_pod_name_container_status_restarts_total
- kube_persistentvolumeclaim_status_phase
- kube_statefulset_replicas
- kube_statefulset_status_replicas_ready
- rabbitmq_build_info
- rabbitmq_channel_consumers
- rabbitmq_channel_get_ack_total
- rabbitmq_channel_get_empty_total
- rabbitmq_channel_get_total
- rabbitmq_channel_messages_acked_total
- rabbitmq_channel_messages_confirmed_total
- rabbitmq_channel_messages_delivered_ack_total
- rabbitmq_channel_messages_delivered_total
- rabbitmq_channel_messages_published_total
- rabbitmq_channel_messages_redelivered_total
- rabbitmq_channel_messages_unconfirmed
- rabbitmq_channel_messages_unroutable_dropped_total
- rabbitmq_channel_messages_unroutable_returned_total
- rabbitmq_channels
- rabbitmq_channels_closed_total
- rabbitmq_channels_opened_total
- rabbitmq_connections
- rabbitmq_connections_closed_total
- rabbitmq_connections_opened_total
- rabbitmq_disk_space_available_bytes
- rabbitmq_disk_space_available_limit_bytes
- rabbitmq_process_max_fds
- rabbitmq_process_max_tcp_sockets
- rabbitmq_process_open_fds
- rabbitmq_process_open_tcp_sockets
- rabbitmq_process_resident_memory_bytes
- rabbitmq_queue_messages_published_total
- rabbitmq_queue_messages_ready
- rabbitmq_queue_messages_unacked
- rabbitmq_queues
- rabbitmq_queues_created_total
- rabbitmq_queues_declared_total
- rabbitmq_queues_deleted_total
- rabbitmq_resident_memory_limit_bytes
8.2.21 -

Redis
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Redis] Low UpTime | Uptime of less than 1 hour in a redis instance | Prometheus |
[Redis] High Memory Usage | High memory usage | Prometheus |
[Redis] High Clients Usage | High client connections usage | Prometheus |
[Redis] High Response Time | Response time over 250ms | Prometheus |
[Redis] High Fragmentation Ratio | High fragmentation ratio | Prometheus |
[Redis] High Keys Eviction Ratio | High keys eviction ratio | Prometheus |
[Redis] Recurrent Rejected Connections | Recurrent rejected connections | Prometheus |
[Redis] Low Hit Ratio | Low keyspace hit ratio | Prometheus |
List of dashboards:
List of metrics:
- redis_blocked_clients
- redis_commands_duration_seconds_total
- redis_commands_processed_total
- redis_commands_total
- redis_config_maxclients
- redis_connected_clients
- redis_connected_slaves
- redis_connections_received_total
- redis_cpu_sys_children_seconds_total
- redis_cpu_sys_seconds_total
- redis_cpu_user_children_seconds_total
- redis_cpu_user_seconds_total
- redis_db_avg_ttl_seconds
- redis_db_keys
- redis_evicted_keys_total
- redis_expired_keys_total
- redis_keyspace_hits_total
- redis_keyspace_misses_total
- redis_mem_fragmentation_ratio
- redis_memory_max_bytes
- redis_memory_used_bytes
- redis_memory_used_dataset_bytes
- redis_memory_used_lua_bytes
- redis_memory_used_overhead_bytes
- redis_memory_used_scripts_bytes
- redis_net_input_bytes_total
- redis_net_output_bytes_total
- redis_pubsub_channels
- redis_pubsub_patterns
- redis_rdb_changes_since_last_save
- redis_rdb_last_save_timestamp_seconds
- redis_rejected_connections_total
- redis_slowlog_length
- redis_uptime_in_seconds
Related blog posts:
8.2.22 -

Sysdig-admission-controller
This integration is enabled by default.
List of alerts
Alert | Description | Format |
---|
[Sysdig Admission Controller] No K8s Audit Events Received | The Admission Controller is not receiving Kubernetes Audit events | Prometheus |
[Sysdig Admission Controller] K8s Audit Events Throttling | Kubernetes Audit events is being throttled | Prometheus |
[Sysdig Admission Controller] Scanning Events Throttling | Scanning events is being throttled | Prometheus |
[Sysdig Admission Controller] Inline Scanning Throttling | The inline scanning queue is not empty for a long time | Prometheus |
[Sysdig Admission Controller] High Error Rate In Scan Status From Backend | High Error Rate In Scan Status From Backend | Prometheus |
[Sysdig Admission Controller] High Error Rate In Scan Report From Backend | High Error Rate In Scan Status From Backend | Prometheus |
[Sysdig Admission Controller] High Error Rate In Image Scan | High Error Rate In Image Scan | Prometheus |
List of dashboards:
- Sysdig_Admission_Controller
List of metrics:
- go_build_info
- go_gc_duration_seconds
- go_gc_duration_seconds_count
- go_gc_duration_seconds_sum
- go_goroutines
- go_memstats_buck_hash_sys_bytes
- go_memstats_gc_sys_bytes
- go_memstats_heap_alloc_bytes
- go_memstats_heap_idle_bytes
- go_memstats_heap_inuse_bytes
- go_memstats_heap_released_bytes
- go_memstats_heap_sys_bytes
- go_memstats_lookups_total
- go_memstats_mallocs_total
- go_memstats_mcache_inuse_bytes
- go_memstats_mcache_sys_bytes
- go_memstats_mspan_inuse_bytes
- go_memstats_mspan_sys_bytes
- go_memstats_next_gc_bytes
- go_memstats_stack_inuse_bytes
- go_memstats_stack_sys_bytes
- go_memstats_sys_bytes
- go_threads
- k8s_audit_ac_alerts_total
- k8s_audit_ac_events_processed_total
- k8s_audit_ac_events_received_total
- process_cpu_seconds_total
- process_max_fds
- process_open_fds
- queue_length
- scan_report_cache_hits
- scan_report_cache_misses
- scan_status_cache_hits
- scan_status_cache_misses
- scanner_scan_errors
- scanner_scan_report_error_from_backend_count
- scanner_scan_report_retrieved_from_backend_count
- scanner_scan_requests_already_queued
- scanner_scan_requests_error
- scanner_scan_requests_queued
- scanner_scan_status_error_from_backend_count
- scanner_scan_status_retrieved_from_backend_count
- scanner_scan_success
- scanning_ac_admission_responses_total
- scanning_ac_containers_processed_total
- scanning_ac_http_scanning_handler_requests_total
8.3 -
Custom Integrations for Sysdig Monitor
Prometheus Metrics
Describes how Sysdig agent enables automatically collecting metrics from services that expose native Prometheus metrics as well as from applications with Prometheus exporters, how to set up your environment, and scrape Prometheus metrics seamlessly.
Java Management Extention (JMX) Metrics
Describes how to configure your Java virtual machines so Sysdig
Agent can collect JMX metrics using the JMX protocol.
StatsD Metrics
Describes how the Sysdig agent collects custom StatsD metrics with
an embedded StatsD server.
Node.JS Metrics
Illustrates how Sysdig is able to monitor node.js applications by
linking a library to the node.js codebase.
8.3.1 -
Collect Prometheus Metrics
Sysdig supports collecting, storing, and querying Prometheus native metrics and labels. You can use Sysdig in the same way that you use
Prometheus and leverage Prometheus Query Language (PromQL) to create dashboards and alerts.
Sysdig is compatible with Prometheus HTTP API to query your monitoring data programmatically using PromQL and extend Sysdig to other platforms
like Grafana.
From a metric collection standpoint, a lightweight Prometheus server is directly embedded into the Sysdig agent to facilitate metric collection.
This also supports targets, instances, and jobs with filtering and relabeling using Prometheus syntax. You can configure the agent to
identify these processes that expose Prometheus metric endpoints on its own host and send it to the Sysdig collector for storing and further
processing.

The Prometheus product itself does not necessarily have to be installed for Prometheus metrics
collection.
Agent Compatibility
See the Sysdig agent versions and compatibility with Prometheus features:
Sysdig Agent v12.2.0 and Above
The following features are enabled by default:
- Automatically scraping any Kubernetes pods with the following annotation set:
prometheus.io/scrape=true
- Automatically scrape applications supported by Monitoring Integrations.
For more information, see Set up the Environment.
Sysdig Agent Prior to v12.0.0
Manually enable Prometheus in dragent.yaml
file:
prometheus:
enabled: true
For more information, see Enable Promscrape V2 on Older Versions of Sysdig Agent .
Learn More
The following topics describe in detail about setting up the environment for service discovery, metrics collection, and further processing.
See the following blog posts for additional context on the Prometheus metric and how such metrics are typically used.
8.3.1.1 -
Set Up the Environment
If you are already leveraging Kubernetes Service Discovery, specifically the approach given in prometheus-kubernetes.yml, you might already have annotations attached to the pods that mark them as eligible for scraping. Such environments can quickly begin scraping the same metrics by using the Sysdig agent in a single step.
If you are not using Kubernetes Service Discovery, follow the instructions given below:
Annotation
Ensure that the Kubernetes pods that contain your Prometheus exporters have been deployed with the following annotations to enable scraping, substituting the listening exporter-TCP-port
:
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "exporter-TCP-port"
The configuration above assumes your exporters use the typical endpoint called /metrics
. If your exporter is using a different
endpoint, specify by adding the following additional annotation, substituting the exporter-endpoint-name
:
prometheus.io/path: "/exporter-endpoint-name"
Sample Exporter
Use the Sample Exporter to test your environment. You will quickly see auto-discovered Prometheus metrics being displayed on Sysdig Monitor. You can use this working example as a basis to similarly annotate your own exporters.
8.3.1.2 -
Enable Prometheus Native Service Discovery
Prometheus service discovery is a standard method of finding endpoints to scrape for metrics. You configure prometheus.yaml
and custom jobs to prepare for scraping endpoints in the same way you do for native Prometheus.
For metric collection, a lightweight Prometheus server, named promscrape
, is directly embedded into the Sysdig agent to facilitate metric collection. Promscrape supports filtering and relabeling targets, instances, and jobs and identify them using the custom jobs configured in the prometheus.yaml
file. The latest versions of Sysdig agent (above v12.0.0) by default identify the processes that expose Prometheus metric endpoints on its own host and send it to the Sysdig collector for storing and further processing. On older versions of Sysdig agent, you enable these features by configuring dragent.yaml
.
Working with Promscrape
Promscrape is a lightweight Prometheus server that is embedded with the Sysdig agent. Promscrape scrapes metrics from Prometheus endpoints and sends them for storing and processing.
Promscrape has two versions: Promscrape V1 and Promscrape V2.
Promscrape V2
Promscrape itself discovers targets by using the standard Prometheus configuration (native Prometheus service discovery), allowing the use of relabel_configs
to find or modify targets. An instance of promscrape runs on every node that is running a Sysdig agent and is intended to collect metrics from local as well as remote targets specified in the prometheus.yaml
file. The prometheus.yaml
file you create is shared across all such nodes.
Promscrape V2 is enabled by default on Sysdig agent v12.5.0 and above. On older versions of Sysdig agent, you need to manually enable Promscrape V2, which allows for native Prometheus service discovery, by setting the prom_service_discovery
parameter to true
in dragent.yaml
.
Promscrape V1
Sysdig agent discovers scrape targets through the Sysdig process_filter
rules. For more information, see Process Filter.
About Promscrape V2
Supported Features
Promscrape V2 supports the following native Prometheus capabilities:
Relabeling: Promscrape V2 supports Prometheus native relabel_config and metric_relabel_configs. Relabel configuration enables the following:
Sample format: In addition to the regular sample format (metrics name, labels, and metrics reading), Promscrape V2 includes metrics type (counter, gauge, histogram, summary) to every sample sent to the agent.
Scraping configuration: Promscrape V2 supports all types of scraping configuration, such as federation, blackbox-exporter, and so on.
Label mapping: The metrics can be mapped to their source (pod, process) by using the source labels which in turn map certain Prometheus label names to the known agent tags.
Unsupported Features
Promscrape V2 does not support calculated metrics.
Promscrape V2 does not support cluster-wide features such as
recording rules and alert management.
Service discovery configurations in Promscrape V1 (process_filter
) and Promscrape V2 (prometheus.yaml
) are incompatible and non-translatable.
Promscrape V2 collects metrics from both local and remote targets specified in the prometheus.yaml
file and therefore it does not make sense to configure promscrape to scrape remote targets, because you will see metrics duplication in this case.
Promscrape V2 does not have the cluster view and therefore it ignores the configuration of recording rules and alerts, which is used in the cluster-wide metrics collection. Therefore, the following Prometheus Configurations are not supported
Sysdig uses __HOSTNAME__
, which is not a standard Prometheus
keyword.
Enable Promscrape V2 on Older Versions of Sysdig Agent
To enable Prometheus native service discovery on agent versions prior to
11.2:
Open dragent.yaml
file.
Set the following Prometheus Service Discovery parameter to true:
prometheus:
prom_service_discovery: true
If true, promscrape.v2
is used. Otherwise, promscrape.v1
is
used to scrape the targets.
Restart the agent.
Create Custom Jobs
Prerequisites
Ensure the following features are enabled:
- Monitoring Integration
- Promscrape V2
If you are using Sysdig agent v12.0.0 or above, these features are enabled by default.
Prepare Custom Job
You set up custom jobs in the Prometheus configuration file to identify endpoints that expose Prometheus metrics. Sysdig agent uses these custom jobs to scrape endpoints by using promscrape, the lightweight Prometheus server embedded in it.
Guidelines
Ensure that targets are scraped only by the agent running on the same node as the target. You do this by adding the host selection relabeling rules.
Use the the sysdig specific relabeling rules to automatically get the right workload labels applied.
Example Prometheus Configuration file
The prometheus.yaml
file comes with a default configuration for
scraping the pods running on the local node. This configuration also
includes the rules to preserve pod UID and container name labels for
further correlation with Kubernetes State Metrics or Sysdig native
metrics.
Here is an example prometheus.yaml
file that you can use to set up custom jobs.
global:
scrape_interval: 10s
scrape_configs:
- job_name: 'my_pod_job'
sample_limit: 40000
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Look for pod name starting with "my_pod_prefix" in namespace "my_namespace"
- action:
source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_pod_name]
separator: /
regex: my_namespace/my_pod_prefix.+
# In those pods try to scrape from port 9876
- source_labels: [__address__]
action: replace
target_label: __address__
regex: (.+?)(\\:\\d)?
replacement: $1:9876
# Trying to ensure we only scrape local targets
# __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
# of all the active network interfaces on the host
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
Default Scrape Job
If Monitoring Integration is not enabled for you and you still want to automatically collect metrics from pods with the Prometheus annotations set (prometheus.io/scrape=true
), add the following default scrape job to your prometheus.yaml
file:
- job_name: 'k8s-pods'
sample_limit: 40000
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Trying to ensure we only scrape local targets
# __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
# of all the active network interfaces on the host
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
Default Prometheus Configuration File
Here is the default prometheus.yaml
file.
global:
scrape_interval: 10s
scrape_configs:
- job_name: 'k8s-pods'
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Trying to ensure we only scrape local targets
# __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
# of all the active network interfaces on the host
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
target_label: __metrics_path__
regex: (.+)
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
Understand the Prometheus Settings
Scrape Interval
The default scrape interval is 10 seconds. However, the value can be
overridden per scraping job. The scrape interval configured in the
prometheus.yaml
is independent of the agent configuration.
Promscrape V2 reads prometheus.yaml
and initiates scraping jobs.
The metrics from targets are collected per scrape interval for each
target and immediately forwarded to the agent. The agent sends the
metrics every 10 seconds to the Sysdig collector. Only those metrics
that have been received since the last transmission are sent to the
collector. If a scraping job for a job has a scrape interval longer than
10 seconds, the agent transmissions might not include all the metrics
from that job.
Hostname Selection
__HOSTIPS__
is replaced by the host IP addresses. Selection by the
host IP address is preferred because of its reliability.
__HOSTNAME__
is replaced with the actual hostname before promscrape
starts scraping the targets. This allows promscrape
to ignore targets
running on other hosts.
Relabeling Configuration
The default Prometheus configuration file contains the following two
relabeling configurations:
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
These rules add two
labels, sysdig_k8s_pod_uid
and sysdig_k8s_pod_container_name
to
every metric gathered from the local targets, containing pod ID and
container name respectively. These labels will be dropped from the
metrics before sending them to the Sysdig collector for further
processing.
Here is an example for setting up the prometheus.yaml
file using the agent configmap:
apiVersion: v1
data:
dragent.yaml: |
new_k8s: true
k8s_cluster_name: your-cluster-name
metrics_excess_log: true
10s_flush_enable: true
app_checks_enabled: false
use_promscrape: true
new_k8s: true
promscrape_fastproto: true
prometheus:
enabled: true
prom_service_discovery: true
log_errors: true
max_metrics: 200000
max_metrics_per_process: 200000
max_tags_per_metric: 100
ingest_raw: true
ingest_calculated: false
snaplen: 512
tags: role:cluster
prometheus.yaml: |
global:
scrape_interval: 10s
scrape_configs:
- job_name: 'haproxy-router'
basic_auth:
username: USER
password: PASSWORD
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Trying to ensure we only scrape local targets
# We need the wildcard at the end because in AWS the node name is the FQDN,
# whereas in Azure the node name is the base host name
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'default/router-1-.+'
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
kind: ConfigMap
metadata:
labels:
app: sysdig-agent
name: sysdig-agent
namespace: sysdig-agent
8.3.1.3 -
Migrating from Promscrape V1 to V2
Promscrape is the lightweight Prometheus server in the Sysdig agent. An
updated version of promscrape
, named Promscrape V2 is available. This configuration is controlled by the prom_discovery_service
parameter in the dragent.yaml
file. To use the latest features, such as Service Discovery and Monitoring Integrations, you need to have this option enabled in your environment.
Compare Promscrape V1 and V2
The main difference between V1 and V2 is how scrape targets are
determined.
In v1 targets are found through process-filtering
rules configured in
dragent.yaml
or dragent.default.yaml
(if no rules are given in
dragent.yaml
).The process-filtering
rules are applied to all the
running processes on the host. Matches are made based on process
attributes, such as process name or TCP ports being listened to, as well
as associated contexts from docker or Kubernetes, such as container
labels or Kubernetes annotations.
With Promscrape V2, scrape targets are determined by scrape_configs
fields in a prometheus.yaml
file (or the prometheus-v2.default.yaml
file if no prometheus.yaml
exists). Because promscrape
is adapted
from the open-source Prometheus server, the scrape_config
settings are
compatible with the normal Prometheus configuration. Here is an example:
global:
scrape_interval: 10s
scrape_configs:
- job_name: 'my_pod_job'
sample_limit: 40000
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Look for pod name starting with "my_pod_prefix" in namespace "my_namespace"
- action:
source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_pod_name,__meta_kubernetes_pod_label]
separator: /
regex: my_namespace/my_pod_prefix.+
- action: keep
source_labels: [__meta_kubernetes_pod_label_app]
regex: my_app_metrics
# In those pods try to scrape from port 9876
- source_labels: [__address__]
action: replace
target_label: __address__
regex: (.+?)(\\:\\d)?
replacement: $1:9876
# Trying to ensure we only scrape local targets
# __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
# of all the active network interfaces on the host
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
Migrate Using Default Configuration
The default configuration for Promscrape v1 triggers the scraping based
on standard Kubernetes pod annotations and container labels. The default
configuration for v2 currently triggers scraping only based on the
standard Kubernetes pod annotations leveraging the Prometheus native
service discovery.
Example Pod Annotations
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: ""
| true
| Required field. |
prometheus.io/port: ""
| The port number to scrape | Optional. It will scrape all pod-registered ports if omitted. |
prometheus.io/scheme
| <http|https > | The default is http . |
(required field)prometheus.io/path | The URL | The default is /metrics . |
Example Static Job
- job_name: 'static10'
static_configs:
- targets: ['localhost:5010']
Guidelines
Users running Kubernetes with Promscrape v1 default rules and
triggering scraping based on pod annotations need not take any
action to migrate to v2. The migration happens automatically.
Users operating non-Kubernetes environments might need to continue
using v1 for now, depending on how scraping is triggered. As of
today promscrape.v2
doesn’t support leveraging container and
Docker labels to discover Prometheus metrics endpoints. If your
environment is one of these, define static jobs with the IP:port
to be scrapped.
Migrate Using Custom Rules
If you relying on custom process_filter
rules to collect metrics, use
any method using standard Prometheus configuration syntax to scrape the
endpoints. We recommend one of the following:
- Adopt the standard approach of adding the standard Prometheus
annotations to their pods. For more information, see Migrate Using Default Configuration.
- Write a Prometheus
scrape_config
by using Kubernetes pods service
discovery and use the appropriate pod metadata to trigger their
scrapes.
See the below example for converting your process_filter
rules to
Prometheus terminology.
- include:
kubernetes.pod.annotation.sysdig.com/test: true
| - action: keep
source_labels: [__meta_kubernetes_pod_annotation_sysdig_com_test]
regex: true
|
- include:
kubernetes.pod.label.app: sysdig
| - action: keep
source_labels: [__meta_kubernetes_pod_label_app]
regex: 'sysdig'
|
-include:
container.label.com.sysdig.test: true
| Not supported. |
- include:
process.name: test
| Not supported. |
- include:
process.cmdline: sysdig-agent
| Not supported. |
- include:
port: 8080
| - action: keep
source_labels: [__meta_kubernetes_pod_container_port_number]
regex: '8080'
|
- include:
container.image: sysdig-agent
| Not supported. |
- include:
container.name: sysdig-agent
| - action: keep
source_labels: [__meta_kubernetes_pod_container_name]
regex: 'sysdig-agent'
|
- include:
appcheck.match: sysdig
| Appchecks are not compatble with Promscrape v2. See Configure Monitoring Integrations for supported integrations. |
If you have any queries related to promscrape migration, contact Sysdig Support.
8.3.2 -
Integrate JMX Metrics from Java Virtual Machines
The Sysdig agent retrieves data from your Java virtual machines using
the JMX protocol. The agent is configured to automatically discover
active Java virtual machines and poll them for basic JVM metrics like
Heap Memory and Garbage collector as well as application-specific
metrics. Now, the following applications are supported by default:
- ActiveMQ
- Cassandra
- Elasticsearch
- HBase
- Kafka
- Tomcat
- Zookeeper
The agent can also be easily configured to extract custom JMX metrics
coming from your own Java processes. Metrics extracted are shown in the
pre-defined Application views or under the Metrics > JVM
and JMX
menus.
The module java.management
must be loaded for the Sysdig agent to
collect both JVM and JMX metrics.
The default JMX metrics configuration is found in the
/opt/draios/etc/dragent.default.yaml
file. When customizing
existing entries, copy the complete application’s bean listing from that
defaults yaml file into the user settings file
/opt/draios/etc/dragent.yaml
. The Sysdig agent will merge
configurations of both files.
Java versions 7 - 10 are currently supported by the Sysdig agents.
For Java 11-14 you must be running minimum agent version 10.1.0 and must
run the app with the JMX
Remote
option.
Here is what your dragent.yaml file might look like for a customized
entry for the Spark application:
customerid: 07c948-your-key-here-006f3b
tags: local:nyc,service:db3
jmx:
per_process_beans:
spark:
pattern: "spark"
beans:
- query: "metrics:name=Spark shell.BlockManager.disk.diskSpaceUsed_MB"
attributes:
- name: VALUE
alias: spark.metric
Include the jmx:
and per_process_beans:
section headers at the
beginning of your application/bean list. For more information on adding
parameters to a container agent’s configuration file, see Understanding
the Agent Config Files.
Bean Configuration
Basic JVM metrics are pre-defined inside the default_beans:
section. This section is defined in the agent’s default settings file
and contains beans and attributes that are going to be polled for every
Java process, like memory and garbage collector usage:
jmx:
default_beans:
- query: "java.lang:type=Memory"
attributes:
- HeapMemoryUsage
- NonHeapMemoryUsage
- query: "java.lang:type=GarbageCollector,*"
attributes:
- name: "CollectionCount"
type: "counter"
- name: "CollectionTime"
type: "counter"
Metrics specific for each application are specified in sections named
after the applications. For example, this is the Tomcat section:
per_process_beans:
tomcat:
pattern: "catalina"
beans:
- query: "Catalina:type=Cache,*"
attributes:
- accessCount
- cacheSize
- hitsCount
- . . .
The key name, tomcat
in this case, will be displayed as a process name
in the Sysdig Monitor user interface instead of just java
. The
pattern:
parameter specifies a string that is used to match a java
process name and arguments with this set of JMX metrics. If the process
main class full name contains the given text, the process is tagged and
the metrics specified in the section will be fetched.
The class names are matched against the process argument list. If you
implement JMX metrics in a custom manner that does not expose the class
names on the command line, you will need to find a pattern which
conveniently matches your java invocation command line.
The beans:
section contains the list of beans to be queried, based
on JMX patterns. JMX patterns are explained in details in the Oracle
documentation,
but in practice, the format of the query line is pretty simple: you can
specify the full name of the bean like java.lang:type=Memory
, or
you can fetch multiple beans in a single line using the wildcard *
as in: java.lang:type=GarbageCollector,*
(note that this is just a wildcard, not a regex).
To get the list of all the beans and attributes that your application exports, you can
use JVisualVM,
Jmxterm, JConsole or other
similar tools. Here is a screenshot from JConsole showing where to find
the namespace, bean and attribute (metric) information (JConsole is
available when you install the Java Development Kit):

For each query, you have to specify the attributes that you want to
retrieve, and for each of them a new metric will be created. We support the following JMX attributes types (For these attributes,
all the subattributes will be retrieved):
Attributes may be absolute values or rates. For absolute values, we need to
calculate a per second rate before sending them. In this case, you can
specify type: counter
, the default is rate
which can be
omitted, so usually you can simply write the attribute name.
Limits
The total number of JMX metrics polled per host is limited to 500. The
maximum number of beans queried per process is limited to 300. If more
metrics are needed please contact your sales representative with your
use case.
In agents 0.46 and earlier, the limit was 100 beans for each process.
Aliases
JMX beans and attributes can have very long names. To avoid interface
cluttering we added support for aliasing, you can specify an alias in
the attribute configuration. For example:
cassandra:
pattern: "cassandra"
beans:
- query: "org.apache.cassandra.db:type=StorageProxy
attributes:
- name: RecentWriteLatencyMicros
alias: cassandra.write.latency
- name: RecentReadLatencyMicros
alias: cassandra.read.latency
In this way the alias will be used in Sysdig Monitor instead of the raw
bean name. Aliases can be dynamic as well, getting data from the bean
name - useful where you use pattern bean queries. For example, here we
are using the attribute name
to create different metrics:
- query: "java.lang:type=GarbageCollector,*"
attributes:
- name: CollectionCount
type: counter
alias: jvm.gc.{name}.count
- name: CollectionTime
type: counter
alias: jvm.gc.{name}.time
This query will match multiple beans (All Garbage collectors) and the
metric name will reflect the name of the Garbage Collector. For example:
jvm.gc.ConcurrentMarkSweep.count
. General syntax is:
{<bean_property_key>}
, to get all beans properties you can use a
JMX explorer like JVisualVM or Jmxterm.
To use these metrics in promQL queries, you have to add the prefix jmx_
and replace the dots (.
) from metrics name by underscores (_
).
For example, the metric name
jvm.gc.ConcurrentMarkSweep.count
will be
jmx_jvm_gc_ConcurrentMarkSweep_count
in promQL.
Troubleshooting: Why Can’t I See Java (JMX) Metrics?
The Sysdig agent normally auto-discovers Java processes running on your
host and enables the JMX extensions for polling them.
JMX Remote
If your Java application is not discovered automatically by the agent,
try adding the following parameter on your application’s command line:
-Dcom.sun.management.jmxremote
For more information, see Oracle’s web page on monitoring using JMX
technology.
Java Versions
Java versions 7 - 10 are currently supported by the Sysdig agents.
For Java 11-14 you must be running minimum agent version 10.1.0 and must
run the app with the JMX
Remote
option.
Java-Based Applications and JMX Authentication
For Java-based applications (Cassandra, Elasticsearch, Kafka, Tomcat,
Zookeeper and etc.), the Sysdig agent requires the Java runtime
environment (JRE) to be installed to poll for metrics (beans).
The Sysdig agent does not support JMX authentication.
If the Docker-container-based Sysdig agent is installed, the JRE is
installed alongside the agent binaries and no further dependencies
exist. However, if you are installing the service-based agent
(non-container) and you do not see the JVM/JMX metrics reporting, your
host may not have the JRE installed or it may not be installed in the
expected location: usr/bin/java
To confirm if the Sysdig agent is able to find the JRE, restart the
agent with service dragent restart
and check the agent’s
/opt/draios/logs/draios.log
file for the two Java detection and
location log entries recorded during agent startup.
Example if Java is missing or not found:
2017-09-08 23:19:27.944, Information, java detected: false
2017-09-08 23:19:27.944, Information, java_binary:
Example if Java is found:
2017-09-08 23:19:27.944, Information, java detected: true
2017-09-08 23:19:27.944, Information, java_binary: /usr/bin/java
If Java is not installed, the resolution is to install the Java Runtime
Environment.
If your host has Java installed but not in the expected location (
/usr/bin/java
) you can install a symlink from /usr/bin/java
to the
actual binary OR set the java_home:
variable in the Sysdig agent’s
configuration file: /opt/draios/etc/dragent.yaml
java_home: /usr/my_java_location/
Disabling JMX Polling
If you do not need it or otherwise want to disable JMX metrics
reporting, you can add the following two lines to the agent’s user
settings configuration file /opt/draios/etc/dragent.yaml
:
jmx:
enabled: false
After editing the file, restart the native Linux agent via
service dragent restart
or restart the container agent to make the
change take effect.
If using our containerized agent, instead of editing the dragent.yaml
file, you can add this extra parameter in the docker run
command when
starting the agent:
-e ADDITIONAL_CONF="jmx:\n enabled: false\n"
8.3.3 -
Integrate StatsD Metrics
StatsD is an open-source project built
by Etsy. Using a StatsD library specific to your application’s language,
it allows for the easy generation and transmission of custom application
metrics to a collection server.
The Sysdig agent contains an embedded StatsD server, so your custom
metrics can now be sent to our collector and be relayed to the Sysdig
Monitor backend for aggregation. Your application metrics and the rich
set of metrics collected by our agent already can all be visualized in
the same simple and intuitive graphical interface. Configuring alert
notifications is also exactly the same.
Installation and Configuration
The Statsd server, embedded in Sysdig agent beginning with version
0.1.136, is pre-configured and starts by default so no additional user
configuration is necessary. Install the agent in a supported
distribution directly or install the Docker containerized version in
your container server and you’re done.
Sending StatsD Metrics
Active Collection
By default, the Sysdig agent’s embedded StatsD collector listens on the
standard StatsD port, 8125, both on TCP and UDP. StatsD is a text
based
protocol,
where samples are separated by a \n
.
Sending metrics from your application to the collector is as simple as:
echo "hello_statsd:1|c" > /dev/udp/127.0.0.1/8125
The example transmits the counter metric "hello_statsd"
with a value
of ‘1’ to the Statsd collector listening on UDP port 8125. Here is a
second example sending the output of a more complex shell command giving
the number of established network connections:
echo "EstablishedConnections:`netstat -a | grep ESTAB | wc -l`|c" > /dev/udp/127.0.0.1/8125
The protocol format is as follows:
METRIC_NAME:METRIC_VALUE|TYPE[|@SAMPLING_RATIO]
Metric names can be any string except reserved characters: |#:@
.
Value is a number and depends on the metric type. Type can be any of:
c
, ms
, g
, s
. Sampling ratio is a value between 0
(exclusive) and 1 and it’s used to handle subsampling. When sent,
metrics will be available in the same display menu for the subviews as
the built in metrics.
Passive Collection
In infrastructures already containing a third party StatsD collection
server, StatsD metrics can be collected “out of band”. A passive
collection technique is automatically performed by our agent by
intercepting system calls - as is done for all the Sysdig Monitor
metrics normally collected. This method does not require changing your
current StatsD configuration and is an excellent way to ’test drive’ the
Sysdig Monitor application without having to perform any modifications
other than agent installation.
The passive mode of collection is especially suitable for containerized
environments where simplicity and efficiency are essential. With the
containerized version of the Sysdig Monitor agent running on the host,
all other container applications can continue to transmit to any
currently implemented collector. In the case where no collector exists,
container applications can simply be configured to send StatsD metrics
to the localhost interface (127.0.0.1) as demonstrated above - no actual
StatsD server needs to be listening at that address.
Effectively, each network transmission made from inside the application
container, including statsd messages sent to a non existent destination,
generates a system call. The Sysdig agent captures these system calls
from its own container, where the statsd collector is listening. In
practice, the Sysdig agent acts as a transparent proxy between the
application and the StatsD collector, even if they are in different
containers. The agent correlates which container a system call is coming
from, and uses that information to transparently label the StatsD
messages.

The above graphic demonstrates the components of the Sysdig agent and
where metrics are actively or passively collected. Regardless of the
method of collection, the number of StatsD metrics the agent can
transmit is limited by your payment plan.
Note 1: When using the passive technique, ICMP port unreachable
events may be generated on the host network.
Note 2: Some clients may use IPv6 addressing (::1) for the
“localhost” address string. Metrics collection over IPv6 is not
supported at this time. If your StatsD metrics are not visible in the
Sysdig Monitor interface, please use “127.0.0.1” instead of “localhost”
string to force IPv4. Another solution that may be required is adding
the JVM option: java.net.preferIPv4Stack=true.
Note 3: When StatsD metrics are not continuously transmitted by your
application (once per second as in the case of all agent created
metrics), the charts will render a ‘zero’ or null value. Any alert
conditions will only look at those Statsd values actually transmitted
and ignore the nulls.
Supported Metric Types
Counter
A counter metric is updated with the value sent by the application, sent
to the Sysdig Monitor backend, and then reset to zero. You can use it to
count, for example, how many calls have been made to an API:
api.login:1|c
You can specify negative values to decrement a counter.
Gauge
A gauge is a single value that will be sent as is:
table_size:10000|g
These are plotted as received, in the sense, they are at a point in time
metrics. You can achieve relative increments or decrements on a counter
by prepending the value with a + or a - respectively. As an example,
these three samples will cause table_size
to be 950:
table_size:1000|g
table_size:-100|g
table_size:+50|g
In Sysdig Monitor, the gauge value is only rendered on the various
charts when it is actually transmitted by your application. When not
transmitted, a null is plotted on the charts which is not used in any
calculations or alerts.
Set
A set is like a counter, but it counts unique elements. For example:
active_users:user1|s active_users:user2|sactive_users:user1|s
Will cause the value of active_users to be 2.
Metric Labels
Labels are an extension of the StatsD specification offered by Sysdig
Monitor to offer better flexibility in the way metrics are grouped,
filtered and visualized. Labeling can be achieved by using the following
syntax:
enqueued_messages#az=eu-west-3,country=italy:10|c
In general, this is the syntax you can use for labeling:
METRIC_NAME#LABEL_NAME=LABEL_VALUE,LABEL_NAME ...
Labels can be simple strings or key/value pairs, separated by an =
sign. Simple labels can be used for filtering in the Sysdig Monitor web
interface. Key/value labels can be used for both filtering and
segmentation.
Label names prefixed with ‘agent.label’ are reserved for Sysdig agent
use only and any custom labels starting with that prefix will be
ignored.
Limits
The number of StatsD metrics the agent can transmit is limited to 1000
for the host and 1000 for all running containers combined. If more
metrics are needed please contact your sales representative with your
use case.
Collect StatsD Metrics Under Load
The Sysdig agent can reliably receive StatsD metrics from containers,
even while the agent is under load. This setting is controlled by the
use_forwarder
configuration parameter.
The Sysdig agent automatically parses and records StatsD metrics.
Historically, the agent parsed the system call stream from the kernel in
order to read and record StatsD metrics from containers. For performance
reasons, the agent may not be able to collect all StatsD metrics using
this mechanism if the load is high. For example, if the StatsD client
writes more than 2kB worth of StatsD metrics in a single system call,
the agent will truncate the StatsD message, resulting in loss of StatsD
metrics.
With the introduction of the togglable use_forwarder
option, the agent
can collect StastsD metrics even under high load.
This feature is introduced in Sysdig agent v0.90.1. As of agent v10.4.0,
the configuration is enabled by default.
statsd:
use_forwarder: true
To disable, set it to false:
statsd:
use_forwarder: false
When enabled, rather than use the system call stream for container
StatsD messages, the agent listens for UDP datagrams on the configured
StatsD port on the localhost within the container’s network namespace.
This enables the agent to reliably receive StatsD metrics from
containers, even while the agent is under load.
This option introduces a behavior change in the agent, both in the
destination address and in port settings.
When the option is disabled, the agent reads StatsD metrics that are
destined to any remote address.
With the option is enabled, the agent receives only those metrics
that are addressed to the localhost.
When the option is disabled, the agent reads the container StatsD
messages destined to only port 8125.
When the option is enabled, the agent uses the configured StatsD
port.
StatsD Server Running in a Monitored Container
Using the forwarder is not a valid use case when a StatsD server is
running in the container that you are monitoring.
A StatsD server running in a container will already have a process bound
to port 8125 or a configured StatsD port, so you can’t use that port to
collect the metrics with the forwarder. A 10-second startup delay exists
in the detection logic to allow any custom StatsD process to bind to
that particular port before the forwarder. This ensures that the
forwarder does not interrupt the operation.
Therefore, for this particular use case, you will need to use the
traditional method. Disable the forwarder and capture the metrics via
the system call stream.
Compatible Clients
Every StatsD compliant client works with our implementation. Here is a
quick list, it’s provided just as reference. We don’t support them, we
support only the protocol specification compliance.
A full list can be found at the StatsD GitHub
page.
Turning Off StatsD Reporting
To disable Sysdig agent’s embedded StatsD server, append the following
lines to the /opt/draios/etc/dragent.yaml configuration file in each
installed host:
statsd:
enabled: false
Note that if Sysdig Secure is used, a compliance check is enabled by
default and it sends metrics via StatsD. When disabling StatsD, you need
to disable the compliance check as well.
security:
default_compliance_schedule: ""
After modifying the configuration file, you will need to restart the
agent with:
service dragent restart
Changing the StatsD Listener Port and Transport Protocol
To modify the port that the agent’s embedded StatsD server listens on,
append the following lines to the /opt/draios/etc/dragent.yaml
configuration file in each installed host (replace #### with your
port):
statsd:
tcp_port: ####
udp_port: ####
Characters Allowed For StatsD Metric Names
Use standard ASCII characters, we suggest also to use . namespaces
as
we do for all our metrics.
Allowed characters: a-z A-Z 0-9 _ .
For more information on adding parameters to a container agent’s
configuration file, see /en/docs/installation/sysdig-agent/agent-configuration/understand-the-agent-configuration/.
8.3.4 -
Integrate Node.js Application Metrics
Sysdig is able to monitor node.js applications by linking a library to
the node.js code, which then creates a server in the code to export the
StatsD metrics.
The example below shows a node.js application that exports metrics using
the Prometheus protocol:
{
"name": "node-example",
"version": "1.0.0",
"description": "Node example exporting metrics via Prometheus",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"license": "BSD-2-Clause",
"dependencies": {
"express": "^4.14.0",
"gc-stats": "^1.0.0",
"prom-client": "^6.3.0",
"prometheus-gc-stats": "^0.3.1"
}
}
The index.js
library function is shown below:
// Use express as HTTP middleware
// Feel free to use your own
var express = require('express')
var app = express()
// Initialize Prometheus exporter
const prom = require('prom-client')
const prom_gc = require('prometheus-gc-stats')
prom_gc()
// Sample HTTP route
app.get('/', function (req, res) {
res.send('Hello World!')
})
// Export Prometheus metrics from /metrics endpoint
app.get('/metrics', function(req, res) {
res.end(prom.register.metrics());
});
app.listen(3000, function () {
console.log('Example app listening on port 3000!')
})
To integrate an application:
Add an appcheck in the dockerfile
:
FROM node:latest
WORKDIR /app
ADD package.json ./
RUN npm install
ENV SYSDIG_AGENT_CONF 'app_checks: [{name: node, check_module: prometheus, pattern: {comm: node}, conf: { url: "http://localhost:{port}/metrics" }}]'
ADD index.js ./
ENTRYPOINT [ "node", "index.js" ]
Run the application:
user@host:~$ docker build -t node-example
user@host:~$ docker run -d node-example
Once the Sysdig agent is deployed, node.js metrics will be automatically
retrieved. The image below shows an example of key node.js metrics
visible on the Sysdig Monitor UI:

8.4 -
Advanced Configuration for Monitoring Integrations
8.4.1 -
You can use dashboards and alerts for PersistentVolumeClaim (PVC) metrics in the regions where PVC metrics are supported.

To see data on PVC dashboards and alerts, ensure that the prerequisites are met.
Prerequisites
Apply Rules
If you are upgrading the Sysdig agent, either download sysdig-agent-clusterrole.yaml or apply the following rule to the ClusterRole associated with your Sysdig agent.
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
nodes/proxy
- nonResourceURLs:
- /metrics
verbs:
- get
The rules are required to scrape the kubelet containers. With this rule enabled, you will also have the kubelet metrics and can access kubelet templates for both dashboards and alerts.
This configuration change is only required for agent upgrades because the sysdig-agent-clusterrole.yaml
associated with fresh installations will already have this configuration. See Steps for Kubernetes (Vanilla) for information on Sysdig agent installation.
Sysdig Agent v12.3.0 or Above
PVC metrics are enabled by default for Sysdig agent v12.3.0 or above. To disable collecting PVC metrics, add the following to the dragent.yaml
file:
k8s_extra_resources:
include:
- services
- resourcequotas
Sysdig Agent Prior to v12.3.0
Contact your Sysdig representative or Sysdig Support for technical assistance with enabling PVC metrics in your environment.
Upgrade Sysdig agent to v12.2.0 or above
If you are an existing Sysdig user, include the following configuration in the dragent.yaml
file:
k8s_extra_resources:
include:
- persistentvolumes
- persistentvolumeclaims
- storageclasses
Access PVC Dashboard Template
Log in to Sysdig Monitor and click Dashboards.
On the Dashboards slider, scroll down to locate Dashboard Templates.
Click Kubernetes to expand the Kubernetes dashboard templates.
Select the PVC and Storage dashboard.
Access PVC Alert Template
Log in to Sysdig Monitor and click Alerts.
On the Alerts page, click Library.
On the Library page, click All Templates.
Select the Kubenetes PVC alert templates.
PVC Metrics
Metrics | Metric Type | Labels | Metric Source |
---|
kube_persistentvolume_status_phase | Gauge | persistentvolume, phase | Kubernetes API |
kube_persistentvolume_claim_ref | Gauge | persistentvolume, name | Kubernetes API |
kube_storageclass_created | Gauge | storageclass | Kubernetes API |
kube_storageclass_info | Gauge | storageclass, provisioner, reclaim_policy, volume_binding_mode | Kubernetes API |
kube_storageclass_labels | Gauge | storageclass | Kubernetes API |
kube_pod_spec_volumes_persistentvolumeclaims_info | Gauge | namespace, pod, uid, volume, persistentvolumeclaim | Kubernetes API |
kube_pod_spec_volumes_persistentvolumeclaims_readonly | Gauge | namespace, pod, uid, volume, persistentvolumeclaim | Kubernetes API |
kube_persistentvolumeclaim_status_condition | Gauge | namespace, persistentvolumeclaim, type, status | Kubernetes API |
kube_persistentvolumeclaim_status_phase | Gauge | namespace, persistentvolumeclaim, phase | Kubernetes API |
kube_persistentvolumeclaim_access_mode | Gauge | namespace, persistentvolumeclaim, access_mode | Kubernetes API |
kubelet_volume_stats_inodes | Gauge | namespace, persistentvolumeclaim | Kubelet |
kubelet_volume_stats_inodes_free | Gauge | namespace, persistentvolumeclaim | Kubelet |
kubelet_volume_stats_inodes_used | Gauge | namespace, persistentvolumeclaim | Kubelet |
kubelet_volume_stats_used_bytes | Gauge | namespace, persistentvolumeclaim | Kubelet |
kubelet_volume_stats_available_bytes | Gauge | namespace, persistentvolumeclaim | Kubelet |
kubelet_volume_stats_capacity_bytes | Gauge | namespace, persistentvolumeclaim | Kubelet |
storage_operation_duration_seconds_bucket | Gauge | operation_name, volume_plugin,le | Kubelet |
storage_operation_duration_seconds_sum | Gauge | operation_name, volume_plugin | Kubelet |
storage_operation_duration_seconds_count | Gauge | operation_name, volume_plugin | Kubelet |
storage_operation_errors_total | Gauge | operation_name, volume_plugin | Kubelet |
storage_operation_status_count | Gauge | operation_name, status, volume_plugin | Kubelet |
8.4.2 -
Integrate Keda for HPA
Sysdig supports Keda to deploy Kubernetes Horizontal Pod Autoscaler (HPA) using custom metrics exposed by Sysdig Monitor. You can do this by configuring Prometheus queries and endpoints in Keda. Keda uses that information to query your Prometheus server and create HPA. The HPA will takee care of scaling pods based on your usage of resources, such as CPU and memory.
This option replaces Sysdig’s existing custom metric server for HPA.
Install Keda
Requirements:
- Helm
- Keda v2.3 or above (Endpoint authentication)
Install Keda with helm by running the following command:
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace \
--set image.metricsApiServer.tag=2.4.0 --set image.keda.tag=2.4.0 \
--set prometheus.metricServer.enabled=true
Create Authentication for Sysdig Prometheus Endpoint
Do the following in each namespace where you want to use Keda. This example uses the namespace, keda
.
Create the secret with the API key as the bearer token:
kubectl create secret generic keda-prom-secret --from-literal=bearerToken=<API_KEY> -n keda
Create the triggerAuthentication.yaml
file:
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: keda-prom-creds
spec:
secretTargetRef:
- parameter: bearerToken
name: keda-prom-secret
key: bearerToken
Apply the configurations in the triggerAuthentication.yaml
file :
kubectl apply -f -n keda triggerAuthentication.yaml
You can configure HPA for a Deployment, StatefulSet, or CRD. Keda uses a CRD to configure the HPA. You create a ScaledObject
and it automatically sets up the metrics server and the HPA object under the hood.
To create a ScaledObject, specify the following:
spec.scaleTargetRef.name
: The unique name of the Deployment.spec.scaleTargetRef.kind
: The kind of object to be scaled: Deployment, SStatefulSet, CustomResource.spec.minReplicaCount
: The minimum number of replicas that the Deployment should have.spec.maxReplicaCount
: The maximum number of replicas that the Deployment should have.
In the ScaledObject, use a trigger of type prometheus
to get the metrics from your Sysdig Monitor account. To do so, specify the following:
triggers.metadata.serverAddress
: The address of the Prometheus endpoint. It is the Sysdig Monitor URL with prefix /prometheus
. For example: https://app.sysdigcloud.com/prometheus
.triggers.metadata.query
: The PromQL query that will return a value. Ensure that the query returns a vector/scalar single element response.triggers.metadata.metricName
: The name of the metric that will be created in the kubernetes API endpoint, /apis/external.metrics.k8s.io/v1beta1
.triggers.metadata.threshold
: The threshold that will be used to scale the Deployment.
Ensure that you add the authModes
and authenticationRef
to the trigger.
Check the ScaledObject
. Here is an example of a ScaledObject:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: keda-web
spec:
scaleTargetRef:
kind: Deployment
name: web
minReplicaCount: 1
maxReplicaCount: 4
triggers:
- type: prometheus
metadata:
serverAddress: https://app.sysdigcloud.com/prometheus
metricName: sysdig_container_cpu_cores_used
query: sum(sysdig_container_cpu_cores_used{kube_cluster_name="my-cluster-name", kube_namespace_name="keda", kube_workload_name = "web"} * 10
threshold: "5"
authModes: "bearer"
authenticationRef:
name: keda-prom-creds
The HPA will divide the value of the metric by the number of current replicas, therefore, try to avoid using the AVERAGE aggregation. Use SUM instead to aggregate the metrics by workload. For example, if the sum of all the values of all the pods is 100 and there are 5 replicas, the HPA will calculate that the value of the metric is 20.
Advanced Configurations
The ScaledObject
permits additional options:
spec.pollingInterval
:
Specify the interval to check each trigger on. By default KEDA will check each trigger source on every ScaledObject every 30 seconds.
Warning: setting this to a low value will cause Keda to make frequent API calls to the Prometheus endpoint. The minimum value for pollingInterval
is 10 seconds. The scraping frequency of the Sysdig Agent is 10 seconds.
spec.cooldownPeriod
:
The wait period between the last active trigger reported and scaling the resource back to 0. By default the value is 5 minutes (300 seconds).
spec.idleReplicaCount
:
Enabling this property allows KEDA to scale the resource down to the specified number of replicas. If some activity exists on the target triggers, KEDA will scale the target resource immediately to the value of minReplicaCount
and scaling is handed over to HPA. When there is no activity, the target resource is again scaled down to the value specified by idleReplicaCount
. This setting must be less than minReplicaCount
.
spec.fallback
:
This property allows you to define a number of replicas if consecutive connection errors happens with the Prometheus endpoint of your Sysdig account.
spec.fallback.failureThreshold
: The number of consecutive errors to apply the fallback.spec.fallback.replicas
: The number of replicas to apply in case of connection error.
spec.advanced.horizontalPodAutoscalerConfig.behavior
:
This property allows you to define the behavior of the Kubernetes HPA Object. See the Kubernetes documentation for more information.
Learn More
8.4.3 -
Sysdig now supports Prometheus recording rules for metric aggregation and querying.
You can configure recording rules by using the Sysdig API. Ensure that you define them in a Prometheus compatible way. The mandatory parameters are:
record
: The unique name of the time series. It must be a valid metric name.
expr
: The PromQL expression to evaluate. In each evaluation cycle, the given expression is evaluated and the result is recorded as a new set of time series with the metric name specified in record
.
labels
: The unique identifiers to add or overwrite before storing the result.
To enable this feature in your environment, contact Sysdig Support.
8.4.4 -
Sysdig enables Grafana users to query metrics from Sysdig and visualize
them in Grafana dashboards. In order to integrate Sysdig with Grafana,
you configure a data source. There are two types of data sources
supported:
Prometheus
Prometheus data source comes with Grafana and is natively compatible
with PromQL. Sysdig provides a Prometheus-compatible API to achieve
API-only integration with Grafana.
Sysdig
Sysdig data source requires additional settings and is more
compatible with the simple “form-based” data configuration. Use the
Sysdig native API instead of the Prometheus API. See Sysdig Grafana
datasource
for more information.
Using the Prometheus API on Grafana v6.7 and Above
You use the Sysdig Prometheus API to set up the datasource to use with
Grafana. Before Grafana can consume Sysdig metrics, Grafana must
authenticate itself to Sysdig. To do so, you must set up an HTTP
authentication by using the Sysdig API Token because no UI support is
currently available on Grafana.
Assuming that you are not using Grafana, spin up a Grafana container
as follows:
$ docker run --rm -p 3000:3000 --name grafana grafana/grafana
Login to Grafana as administrator and create a new datasource by
using the following information:

URL: https://<Monitor URL for Your Region>/prometheus
See SaaS Regions and IP
Ranges and identify
the correct URLs associated with your Sysdig application and
region.
Authentication: Do not select any authentication mechanisms.
Access: Server (default)
Custom HTTP Headers:
Header: Enter the word, Authorization
Value: Enter the word, Bearer , followed by a space
and <Your Sysdig API Token>
API Token is available through Settings > User
Profile > Sysdig Monitor API.
Using the Grafana API on Grafana v6.6 and Below
The feature requires Grafana v5.3.0 or above.
You use the Grafana API to set up the Sysdig datasource.
Download and run Grafana in a container.
docker run --rm -p 3000:3000 --name grafana grafana/grafana
Create a JSON file.
cat grafana-stg-ds.json
{
"name": "Sysdig staging PromQL",
"orgId": 1,
"type": "prometheus",
"access": "proxy",
"url": "https://app-staging.sysdigcloud.com/prometheus",
"basicAuth": false,
"withCredentials": false,
"isDefault": false,
"editable": true,
"jsonData": {
"httpHeaderName1": "Authorization",
"tlsSkipVerify": true
},
"secureJsonData": {
"httpHeaderValue1": "Bearer your-Sysdig-API-token"
}
}
Get your Sysdig API Token and plug it in the JSON file above.
"httpHeaderValue1": "Bearer your_Sysdig_API_Token"
Add the datasource to Grafana.
curl -u admin:admin -H "Content-Type: application/json" http://localhost:3000/api/datasources -XPOST -d @grafana-stg-ds.json
Run Grafana.
http://localhost:3000
Use the default credentials, admin: admin
, to sign in to Grafana.
Open the Data Source tab under Configuration on Grafana and
confirm that the one you have added is listed on the page.

8.5 -
Troubleshoot Monitoring Integrations
Review the common troubleshooting scenarios you might encounter while getting a Monitor integration working and see what you can do if an integration does not report metics after installation.
Check Prerequisites
Some integrations require secrets and other resources available in the correct namespace in order for it to work. Integrations such as database exporters might require you to create a user and provide with special permissions in the database to be able to connect with the endpoint and generate metrics.
Ensure that the prerequisites of the integration are met before proceeding with installation.
Verify Exporter Is Running
If the integration is an exporter, ensure that the pods corresponding to the exporter are running correctly. You can check this after installing the integration. If the exporter is installed as a sidecar of the application (such as Nginx), verify that the exporter container is added to the pod.
You can check the status of the pods with the Kubernetes dashboard Pods Status and Performance
or with the following command:
kubectl get pods --namespace=<namespace>
Additionally, if the container has problems and cannot start, check the description of the pod for error messages:
kubectl describe pod <pod-name> --namespace=<namespace>
Verify Metrics Are Generated
Check whether a running exporter is generating metrics by accessing the metrics endpoint:
kubectl port-forward <pod-name> <pod-port> <local-port> --namespace=<namespace>
curl http://localhost:<local-port>/metrics
This is also valid for applications that don’t need an exporter to generate their own metrics.
If the exporter is not generating metics, there could be problems accessing or authenticating with the application. Check the logs associated with the pods:
kubectl logs <pod-name> --namespace=<namespace>
If the application is instrumented and is not generating metrics, check if the Prometheus metrics option or the module is activated.
Verify Sysdig Agent Is Scraping Metrics
If an application doesn’t need an exporter to generate metrics, check if it has the default Prometheus annotations.
Additionally, you can check if the Sysdig agent can access the metrics endpoint. To do so, use the following command:
kubectl exec <sysdig-agent-pod-name> --namespace=sysdig-agent -- /bin/sh -c "curl http://<exporer-pod-ip>:<pod-port>/metrics"
Select the Sysdig Agent pod in the same node than the pod used to scrape.
8.5.1 -
Monitor Log Files
You can search for particular strings within a given log file, and
create a metric that is displayed in Sysdig Monitor’s Explore page. The
metrics appear under the StatsD section:

Sysdig provides this functionality via a “chisel” script called
“logwatcher”, written in Lua. You call the script by adding a
logwatcher
parameter in the chisels
section of the agent
configuration file (dragent.yaml
). You define the log file name and
the precise string to be searched. The results are displayed as metrics
in the Monitor UI.
Caveats
The logwatcher chisel adds to Sysdig’s monitoring capability but is not
a fully featured log monitor. Note the following limitations:
No regex support: Sysdig does not offer regex support; you must
define the precise log file and string to be searched.
(If you were to supply a string with spaces, forward-slashes, or
back-slashes in it, the metric generated would also have these
characters and so could not be used to create an alert.)
Limit of 12 string searches/host: Logwatcher is implemented as a
LUA script and, due to resources consumed by this chisel, it is not
recommended to have more than a dozen string searches configured per
agent/host.
Implementation
Edit the agent configuration file to enable the logwatcher chisel. See
Understanding the Agent
Config Files for editing
options.
Preparation
Determine the log file name(s) and string(s) you want to monitor.
To monitor the output of docker logs <container-name>,
find the
container’s docker log file with:
docker inspect <container-name> | grep LogPath
Edit dragent.yaml
Access dragent.yaml
directly at /opt/draios/etc/dragent.yaml
.
Add a chisels entry:
Format:
chisels:
- name: logwatcher
args:
filespattern: YOURFILENAME.log
term: YOURSTRING
Sample Entry:
customerid: 831f2-your-key-here-d69401
tags: tagname.tagvalue
chisels:
- name: logwatcher
args:
filespattern: draios.log
term: Sent
In this example, Sysdig’s own draios.log
is searched for the
Sent
string.
The output, in the Sysdig Monitor UI, would show the StatsD metric
logwatcher.draios_log.Sent
and the number of ‘Sent’ items
detected.
Optional: Add multiple -name:
sections in the config file to
search for additional logs/strings.
Note the recommended 12-string/agent limit.
Restart the agent for changes to take effect.
For container agent:
docker restart sysdig-agent
For non-containerized (service) agent:
service dragent restart
Parameters
Name | Value | Description |
---|
name | logwatcher | The chisel used in the enterprise Sysdig platform to search log files. (Other chisels are available in Sysdig’s open-source product.) |
filespattern | YOURFILENAME.log | The log file to be searched. Do not specify a path with the file name. |
term | YOURSTRING | The string to be searched. |
View Log File Metrics in the Monitor UI
To view logwatcher results:
Log in to Sysdig Monitor and select Explore
.
Select Entire Infrastructure > Overview by Host.
In the resulting drop-down, either scroll to
Metrics > StatsD > logwatcher
or enter “logwatcher
” in the
search field.
Each string you configured in the agent config file will be listed
in the format logwatcher.YOURFILENAME_log.STRING.

The relevant metrics are displayed.
You can also Add an
Alert
on logwatcher metrics, to be notified when an important log entry
appears.
8.6 -
(Legacy) Integrations for Sysdig Monitor
Integrate metrics with Sysdig Monitor from a number of platforms,
orchestrators, and a wide range of applications. Sysdig collects metrics
from Prometheus, JMX, StatsD, Kubernetes, and many application stacks to
provide a 360-degree view of your infrastructure. Many metrics are
collected by default out of the box; you can also extend the integration
or create custom metrics.
Key Benefits
Collects the richest data set for cloud-native visibility and
security
Polls data, auto-discover context in order to provide operational
and security insights
Extends the power of Prometheus metrics with additional insights
from other metrics types and infrastructure stack
Integrate Prometheus alert and events for Kubernetes monitoring
needs
Expose application metrics using Java JMX and MBeans monitoring
Key Integrations
Inbound
Prometheus Metrics
Describes how Sysdig Agent enables automatically collecting metrics
from Prometheus exporters, how to set up your environment, and
scrape Prometheus metrics from local as well as remote hosts.
Java Management Extention (JMX) Metrics
Describes how to configure your Java virtual machines so Sysdig
Agent can collect JMX metrics using the JMX protocol.
StatsD Metrics
Describes how the Sysdig agent collects custom StatsD metrics with
an embedded StatsD server.
Node.JS Metrics
Illustrates how Sysdig is able to monitor node.js applications by
linking a library to the node.js codebase.
Integrate Applications
Describes the monitoring capabilities of Sysdig agent with
application check scripts or ‘app checks’.
Monitor Log Files
Learn how to search a string by using the chisel script called logwatcher.
AWS CloudWatch
Illustrates how to configure Sysdig to collect various types of CloudWatch metrics.
Agent Installation
Learn how to install Sysdig agents on supported platforms.
Oubound
Notification Channels
Learn how to add, edit, or delete a variety of notification channel types, and how to disable or delete notifications when they are not needed, for example, during scheduled downtime.
S3 Capture Storage
Learn how to configure Sysdig to use an AWS S3 bucket or custom S3 storage for storing Capture files.
For Sysdig instances deployed on IBM Cloud Monitoring with
Sysdig, an additional form of metrics
collection is offered: Platform metrics. Rather than being collected by
the Sysdig agent, when enabled, Platform metrics are reported to Sysdig
directly by the IBM Cloud infrastructure.
Enable this feature by logging into the IBM Cloud console and selecting
“Enable” for IBM Platform metrics under the Configure your resource
section when creating a new IBM Cloud Monitoring with a Sysdig instance,
as described
here.
8.6.1 -
(Legacy) Collect Prometheus Metrics
Sysdig supports collecting, storing, and querying Prometheus native
metrics and labels. You can use Sysdig in the same way that you use
Prometheus and leverage
Prometheus Query Language (PromQL) to create dashboards and alerts.
Sysdig is compatible with Prometheus HTTP API to query your monitoring
data programmatically using PromQL and extend Sysdig to other platforms
like Grafana.
From a metric collection standpoint, a lightweight Prometheus server is
directly embedded into the Sysdig agent to facilitate metric collection.
This also supports targets, instances, and jobs with filtering and
relabeling using Prometheus syntax. You can configure the agent to
identify these processes that expose Prometheus metric endpoints on its
own host and send it to the Sysdig collector for storing and further
processing.

This document uses metric and time series interchangeably. The
description of configuration parameters refers to “metric”, but in
strict Prometheus terms, those imply time series. That is, applying a
limit of 100 metrics implies applying a limit on time series, where all
the time series data might not have the same metric name.
The Prometheus product
itself does not necessarily have to be installed for Prometheus metrics
collection.
See the Sysdig agent versions and compatibility with Prometheus features:
Latest versions of agent (v12.0.0 and above): The following features are enabled by default:
- Automatically scraping any Kubernetes pods with the following annotation set:
prometheus.io/scrape=true
- Automatically scrape applications supported by Monitoring Integrations.
Sysdig agent prior to v12.0.0: Manually enable Prometheus in dragent.yaml
file:
prometheus:
enabled: true
Learn More
The following topics describe in detail how to configure the Sysdig agent for service discovery, metrics collection, and further processing.
See the following blog posts for additional context on the Prometheus
metric and how such metrics are typically used.
8.6.1.1 -
(Legacy) Working with Prometheus Metrics
The Sysdig agent uses its visibility to all running processes (at both
the host and container levels) to find eligible targets for scraping
Prometheus metrics. By default, no scraping is attempted. Once the
feature is enabled, the agent assembles a list of eligible targets,
apply filtering rules, and sends back to the Sysdig collector.
Latest Prometheus Features
Sysdig agents v12.0 or above is required for the following capabilities:
Sysdig agents v10.0 or above is required for the following capabilities:
New capabilities of using Prometheus data:
Ability to visualize data using PromQL queries. See Using
PromQL.
Create alerts from PromQL-based Dashboards. See Create Panel
Alerts.
Backward compatibility for dashboards v2 and alerts.
The new PromQL data cannot be visualized by using the Dashboard
v2 Histogram. Use time-series based visualization for the
histogram metrics.
New metrics limit per agent
10-second data granularity
Higher retention rate on the new metric store.
Prerequisites and Guidelines
Sysdig agent v 10.0.0 and above is required for the latest
Prometheus features.
Prometheus feature is enabled in the dragent.yaml
file.
prometheus:
enabled: true
See Setting up the Environment for more information.
The endpoints of the target should be available on a TCP connection
to the agent. The agent scrapes a target, remote or local, specified
by the IP: Port
or the URL
in dragent.yaml
.
Service Discovery
To use native Prometheus service discovery, enable Promscrape V2 as described in Enable Prometheus Native Service Discovery. This section covers the Sysdig way of service discovery that involves configuring
process filters in the Sysdig agent.
The way service discovery works in the Sysdig agent differs from that of
the Prometheus
server.
While the Prometheus server has built-in integration with several
service discovery mechanisms and the prometheus.yml
file to read the
configuration settings from, the Sysdig agent auto-discovers any process
(exporter or instrumented) that matches the specifications in the
dragent.yaml
, file and instructs the embedded lightweight Prometheus
server to retrieve the metrics from it.
The lightweight Prometheus server in the agent is named promscrape
and
is controlled by the flag of the same name in the dragent.yaml
file.
See Configuring Sysdig
Agent for more information.
Unlike the Prometheus server that can scrape processes running on all
the machines in a cluster, the agent can scrape only those processes
that are running on the host that it is installed on.
Within the set of eligible processes/ports/endpoints, the agent scrapes
only the ports that are exporting Prometheus metrics and will stop
attempting to scrape or retry on ports based on how they respond to
attempts to connect and scrape them. It is therefore strongly
recommended that you create a configuration that restricts the process
and ports for attempted scraping to the minimum expected range for your
exporters. This minimizes the potential for unintended side-effects in
both the Agent and your applications due to repeated failed connection
attempts.
The end to end metric collection can be summarized as follows:
A process is determined to be eligible for possible scraping if it
positively matches against a series of Process Filter
include/exclude rules. See Process Filter
for more information.
The Agent will then attempt to scrape an eligible process at a
/metrics
endpoint on all of its listening TCP ports unless the
additional configuration is present to restrict scraping to a subset
of ports and/or another endpoint name.
Upon receiving the metrics, the agent applies the following rules
before sending them to the Sysdig collector.
The metrics ultimately appear in the Sysdig Monitor Explore interface in
the Prometheus section.

8.6.1.2 -
(Legacy) Set Up the Environment
Quick Start For Kubernetes Environments
Prometheus users who are already leveraging Kubernetes Service
Discovery
(specifically the approach in this sample
prometheus-kubernetes.yml)
may already have Annotations attached to the Pods that mark them as
eligible for scraping. Such environments can quickly begin scraping the
same metrics using the Sysdig Agent in a couple of easy steps.
Enable the Prometheus metrics feature in the Sysdig Agent. Assuming
you are deploying using
DaemonSets,
the needed config can be added to the Agent’s dragent.yaml
by
including the following in your DaemonSet YAML (placing it in the
env
section for the sysdig-agent
container):
- name: ADDITIONAL_CONF
value: "prometheus:\n enabled: true"
Ensure the Kubernetes Pods that contain your Prometheus exporters
have been deployed with the following Annotations to enable scraping
(substituting the listening exporter-TCP-port)
:
spec:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "exporter-TCP-port"
The configuration above assumes your exporters use the typical
endpoint called /metrics
. If an exporter is using a different
endpoint, this can also be specified by adding the following
additional optional Annotation, substituting the
exporter-endpoint-name
:
prometheus.io/path: "/exporter-endpoint-name"
If you try this Kubernetes Deployment of a simple
exporter,
you will quickly see auto-discovered Prometheus metrics being displayed
in Sysdig Monitor. You can use this working example as a basis to
similarly Annotate your own exporters.
If you have Prometheus exporters not deployed in annotated Kubernetes
Pods that you would like to scrape, the following sections describe the
full set of options to configure the Agent to find and scrape your
metrics.
Quick Start for Container Environments
In order for Prometheus scraping to work in a Docker-based container
environment, set the following labels to the application containers,
substituting <exporter-port
> and <exporter-path
> with
the correct port and path where metrics are exported by your
application:
io.prometheus.scrape=true
io.prometheus.port=<exporter-port>
io.prometheus.path=<exporter-path>
For example, if mysqld-exporter
is to be scraped, spin up the
container as follows:
docker -d -l io.prometheus.scrape=true -l io.prometheus.port=9104 -l io.prometheus.path=/metrics mysqld-exporter
8.6.1.3 -
(Legacy) Configuring Sysdig Agent
This feature is not supported with Promscrape V2. For information on different versions of Promscrape and migrating to the latest version, see Migrating from Promscrape V1 to V2.
As is typical for the agent, the default configuration for the feature is specified in dragent.default.yaml
, and you can override the defaults by configuring parameters in the dragent.yaml
. For each parameter, you do not set in dragent.yaml
, the defaults in dragent.default.yaml
will remain in effect.
Main Configuration Parameters
prometheus
| See below | Turns Prometheus scraping on and off. |
process_filter
| See below | Specifies which processes may be eligible for scraping. See [Process Filter](/en/docs/sysdig-monitor/monitoring-integrations/legacy-integrations/legacycollect-prometheus-metrics/configuring-sysdig-agent/#process-filter). |
use_promscrape
| See below. | Determines whether to use promscrape for scraping Prometheus metrics. |
promscrape
Promscrape is a lightweight Prometheus server that is embedded with the
Sysdig agent. The use_promscrape
parameter controls whether to use it
to scrape Prometheus endpoints.
use_promscrape
| true
| Promscrape has two versions: Promscrape V1 and Promscrape V2. With V1, Sysdig agent discovers scrape targets through the process_filter
rules. With V2, promscrape itself discovers targets by using the standard Prometheus configuration, allowing the use of relabel_configs
to find or modify targets.
prometheus
The prometheus
section defines the behavior related to Prometheus
metrics collection and analysis. It allows for turning the feature on,
set a limit from the agent side on the number of metrics to be scraped,
and determines whether to report histogram metrics and log failed scrape
attempts.
enabled
| false
| Turns Prometheus scraping on and off. |
interval
| 10
| How often (in seconds) the agent will scrape a port for Prometheus metrics |
prom_service_discovery
| true
| Enables native Prometheus service discovery. If disabled, promscrape.v1 is used to scrape the targets. See Enable Prometheus Native Service Discovery. On agent versions prior to 11.2, the default is false. |
max_metrics
| 1000
| The maximum number of total Prometheus metrics that will be scraped across all targets. This value of 1000 is the maximum per-agent, and is a separate limit from other Custom Metrics. For example, StatsD, JMX, and App Checks. |
timeout
| 1 | Used to configure the amount of time the agent will wait while scraping a Prometheus endpoint before timing out. The default value is 1 second. As of agent v10.0, this parameter is only used when promscrape is disabled. Since promscrape is now default, timeout can be considered deprecated, however it is still used when you explicitly disable promscrape . |
Process Filter
The process_filter
section specifies which of the processes known
by an agent may be eligible for scraping.
Note that once you specify a process_filter
in your
dragent.yaml
, this replaces the entire Prometheus
process_filter
section (i.e. all the rules) shown in the
dragent.default.yaml
.
The Process Filter is specified in a series of include
and
exclude
rules that are evaluated top-to-bottom for each process
known by an Agent. If a process matches an include
rule, scraping
will be attempted via a /metrics
endpoint on each listening TCP
port for the process, unless a conf
section also appears within
the rule to further restrict how the process will be scraped. See
conf for more information.
Multiple patterns can be specified in a single rule, in which case all
patterns must match for the rule to be a match (AND logic).
Within a pattern value, simple “glob” wildcarding may be used, where
*
matches any number of characters (including none) and ?
matches any single character. Note that due to YAML syntax, when using
wildcards, be sure to enclose the value in quotes ("*"
).
The table below describes the supported patterns in Process Filter
rules. To provide realistic examples, we’ll use a simple sample
Prometheus
exporter (source
code
here)
which can be deployed as a container using the Docker command line
below. To help illustrate some of the configuration options, this sample
exporter presents Prometheus metrics on /prometheus
instead of the
more common /metrics
endpoint, which will be shown in the example
configurations further below.
# docker run -d -p 8080:8080 \
--label class="exporter" \
--name my-java-app \
luca3m/prometheus-java-app
# ps auxww | grep app.jar
root 11502 95.9 9.2 3745724 753632 ? Ssl 15:52 1:42 java -jar /app.jar --management.security.enabled=false
# curl http://localhost:8080/prometheus
...
random_bucket{le="0.005",} 6.0
random_bucket{le="0.01",} 17.0
random_bucket{le="0.025",} 51.0
...
container.image
| Matches if the process is running inside a container running the specified image | - include:
container.image: luca3m/prometheus-java-app
|
container.name
| Matches if the process is running inside a container with the specified name | - include:
container.name: my-java-app
|
container.label.*
| Matches if the process is running in a container that has a Label matching the given value | - include:
container.label.class: exporter
|
kubernetes.<object>.annotation.* kubernetes.<object>.label.*
| Matches if the process is attached to a Kubernetes object (Pod, Namespace, etc.) that is marked with the Annotation/Label matching the given value. Note: This pattern does not apply to the Docker-only command-line shown above, but would instead apply if the exporter were installed as a Kubernetes Deployment using this example YAML. Note: See Kubernetes Objects, below, for information on the full set of supported Annotations and Labels. | - include:
kubernetes.pod.annotation.prometheus.io/scrape: true
|
process.name
| Matches the name of the running process | - include:
process.name: java
|
process.cmdline
| Matches a command line argument | - include:
process.cmdline: "*app.jar*"
|
port
| Matches if the process is listening on one or more TCP ports. The pattern for a single rule can specify a single port as shown in this example, or a single range (e.g.8079-8081 ), but does not support comma-separated lists of ports/ranges. Note: This parameter is only used to confirm if a process is eligible for scraping based on the ports on which it is listening. For example, if a process is listening on one port for application traffic and has a second port open for exporting Prometheus metrics, it would be possible to specify the application port here (but not the exporting port), and the exporting port in the conf section (but not the application port), and the process would be matched as eligible and the exporting port would be scraped. | - include:
port: 8080
|
appcheck.match
| Matches if an Application Check with the specific name or pattern is scheduled to run for the process. | - exclude:
appcheck.match: "*"
|
Instead of the **`include`** examples shown above that would have each
matched our process, due to the previously-described ability to combine
multiple patterns in a single rule, the following very strict
configuration would also have matched:
- include:
container.image: luca3m/prometheus-java-app
container.name: my-java-app
container.label.class: exporter
process.name: java
process.cmdline: "*app.jar*"
port: 8080
conf
Each include
rule in the port_filter
may include a
conf
portion that further describes how scraping will be attempted
on the eligible process. If a conf
portion is not included,
scraping will be attempted at a /metrics
endpoint on all listening
ports of the matching process. The possible settings:
port
| Either a static number for a single TCP port to be scraped, or a container/Kubernetes Label name or Kubernetes Annotation specified in curly braces. If the process is running in a container that is marked with this Label or is attached to a Kubernetes object (Pod, Namespace, etc.) that is marked with this Annotation/Label, scraping will be attempted only on the port specified as the value of the Label/Annotation. Note: The Label/Annotation to match against will not include the text shown in red. Note: See Kubernetes Objectsfor information on the full set of supported Annotations and Labels. Note: If running the exporter inside a container, this should specify the port number that the exporter process in the container is listening on, not the port that the container exposes to the host. | port: 8080
- or - port: "{container.label.io.prometheus.port}"
- or - port: "{kubernetes.pod.annotation.prometheus.io/port}"
|
port_filter
| A set of include and exclude rules that define the ultimate set of listening TCP ports for an eligible process on which scraping may be attempted. Note that the syntax is different from the port pattern option from within the higher-level include rule in the process_filter . Here a given rule can include single ports, comma-separated lists of ports (enclosed in square brackets), or contiguous port ranges (without brackets). | port_filter:
- include: 8080 - exclude: [9092,9200,9300] - include: 9090-9100
|
path
| Either the static specification of an endpoint to be scraped, or a container/Kubernetes Label name or Kubernetes Annotation specified in curly braces. If the process is running in a container that is marked with this Label or is attached to a Kubernetes object (Pod, Namespace, etc.) that is marked with this Annotation/Label, scraping will be attempted via the endpoint specified as the value of the Label/Annotation. If path is not specified, or specified but the Agent does not find the Label/Annotation attached to the process, the common Prometheus exporter default of /metrics will be used. Note: A Label/Annotation to match against will not include the text shown in red. Note: See Kubernetes Objects for information on the full set of supported Annotations and Labels. | path: "/prometheus"
- or - path: "{container.label.io.prometheus.path}"
- or - path: "{kubernetes.pod.annotation.prometheus.io/path}"
|
host
| A hostname or IP address. The default is localhost. | host: 192.168.1.101
- or -
host: subdomain.example.com
- or -
host: localhost
|
use_https
| When set to true , connectivity to the exporter will only be attempted through HTTPS instead of HTTP. It is false by default. (Available in Agent version 0.79.0 and newer) | use_https: true
|
ssl_verify
| When set to true , verification will be performed for the server certificates for an HTTPS connection. It is false by default. Verification was enabled by default before 0.79.0. (Available in Agent version 0.79.0 and newer) | ssl_verify: true
|
Authentication Integration
As of agent version 0.89, Sysdig can collect Prometheus metrics from
endpoints requiring authentication. Use the parameters below to enable
this function.
For username/password authentication:
For authentication using a token:
For certificate authentication with a certificate key:
auth_cert_path
auth_key_path
Token substitution is also supported for all the authorization
parameters. For instance a username can be taken from a Kubernetes
annotation by specifying
username: "{kubernetes.service.annotation.prometheus.openshift.io/username}"
conf Authentication Example
Below is an example of the dragent.yaml
section showing all the
Prometheus authentication configuration options, on OpenShift,
Kubernetes, and etcd.
In this example:
The username/password
are taken from a default annotation used by
OpenShift.
The auth token
path is commonly available in Kubernetes
deployments.
The certificate
and key
used here for etcd may normally not be
as easily accessible to the agent. In this case they were extracted
from the host namespace, constructed into Kubernetes secrets, and
then mounted into the agent container.
prometheus:
enabled: true
process_filter:
- include:
port: 1936
conf:
username: "{kubernetes.service.annotation.prometheus.openshift.io/username}"
password: "{kubernetes.service.annotation.prometheus.openshift.io/password}"
- include:
process.name: kubelet
conf:
port: 10250
use_https: true
auth_token_path: "/run/secrets/kubernetes.io/serviceaccount/token"
- include:
process.name: etcd
conf:
port: 2379
use_https: true
auth_cert_path: "/run/secrets/etcd/client-cert"
auth_key_path: "/run/secrets/etcd/client-key"
Kubernetes Objects
As described above, there are multiple configuration options that can be
set based on auto-discovered values for Kubernetes Labels and/or
Annotations. The format in each case begins with
"kubernetes.OBJECT.annotation."
or "kubernetes.OBJECT.label."
where
OBJECT
can be any of the following supported Kubernetes object types:
daemonSet
deployment
namespace
node
pod
replicaSet
replicationController
service
statefulset
The configuration text you add after the final dot becomes the name of
the Kubernetes Label/Annotation that the Agent will look for. If the
Label/Annotation is discovered attached to the process, the value of
that Label/Annotation will be used for the configuration option.
Note that there are multiple ways for a Kubernetes Label/Annotation to
be attached to a particular process. One of the simplest examples of
this is the Pod-based approach shown in Quick Start For Kubernetes
Environments.
However, as an example alternative to marking at the Pod level, you
could attach Labels/Annotations at the Namespace level, in which case
auto-discovered configuration options would apply to all processes
running in that Namespace regardless of whether they’re in a Deployment,
DaemonSet, ReplicaSet, etc.
8.6.1.4 -
(Legacy) Filtering Prometheus Metrics
As of Sysdig agent 9.8.0, a lightweight Prometheus server is embedded in
agents named promscrape
and a prometheus.yaml
file is included as
part of configuration files. Using the open source Prometheus
capabilities, Sysdig leverages a Prometheus feature to allow you to
filter Prometheus metrics at the source before ingestion. To do so, you
will:
Ensure that the Prometheus scraping is enabled in the
dragent.yaml
file.
prometheus:
enabled: true
On agent v9.8.0 and above, enable the feature by setting the
use_promscrape
parameter to true in the dragent.yaml
. See
Enable Filtering at
Ingestion.
Edit the configuration in the prometheus.yaml
file. See Edit
Prometheus Configuration
File.
Sysdig-specific configuration is found in the prometheus.yaml
file.
Enable Filtering at Ingestion
On agent v9.8.0, in order for target filtering to work, the
use_promscrape
parameter in the dragent.yaml
must be set to true.
For more information on configuration, see Configuring Sysdig
Agent.
use_promscrape: true
On agent v10.0, use_promscrape
is enabled by default. Implies,
promscrape is used for scraping Prometheus metrics.
Filtering configuration is optional. The absence of prometheus.yaml
will not change the existing behavior of the agent.
Edit Prometheus Configuration File
About the Prometheus Configuration File
The prometheus.yaml
file contains
mostly the filtering/relabeling
configuration in a list of key-value pairs, representing target process
attributes.
You replace keys and values with the desired tags corresponding to your
environment.
In this file, you will configure the following:
The prometheus.yaml
file is installed alongside dragent.yaml
. For
the most part, the syntax of prometheus.yaml
complies with the
standard Prometheus
configuration.
Default Configuration
A configuration with empty key-value pairs is considered a default
configuration. The default configuration will be applied to all the
processes to be scraped that don’t have a matching filtering
configuration. In Sample Prometheus Configuration
File,
the job_name: 'default'
section represents the default configuration.
Kubernetes Environments
If the agent runs in Kubernetes environments (Open
Source/OpenShift/GKE), include the following Kubernetes objects as
key-value pairs. See Agent Install:
Kubernetes for details on
agent installation.
For example:
sysdig_sd_configs:
- tags:
namespace: backend
deployment: my-api
In addition to the aforementioned tags, any of these object types can be
matched against:
daemonset: my_daemon
deployment: my_deployment
hpa: my_hpa
namespace: my_namespace
node: my_node
pod: my_pode
replicaset: my_replica
replicationcontroller: my_controller
resourcequota: my_quota
service: my_service
stateful: my_statefulset
For Kubernetes/OpenShift/GKE deployments, prometheus.yaml
shares the
same ConfigMap with dragent.yaml
.
Docker Environments
In Docker environments, include attributes such as container, host,
port, and more. For example:
sysdig_sd_configs:
- tags:
host: my-host
port: 8080
For Docker-based deployments, prometheus.yaml
can be mounted from the
host.
Sample Prometheus Configuration File
global:
scrape_interval: 20s
scrape_configs:
- job_name: 'default'
sysdig_sd_configs: # default config
relabel_configs:
- job_name: 'my-app-job'
sample_limit: 2000
sysdig_sd_configs: # apply this filtering config only to my-app
- tags:
namespace: backend
deployment: my-app
metric_relabel_configs:
# Drop all metrics starting with http_
- source_labels: [__name__]
regex: "http_(.+)"
action: drop
metric_relabel_configs:
# Drop all metrics for which the city label equals atlantis
- source_labels: [city]
regex: "atlantis"
action: drop
8.6.1.5 -
(Legacy) Example Configuration
This topic introduces you to default and specific Prometheus
configurations.
Default Configuration
As an example that pulls together many of the configuration elements
shown above, consider the default Agent configuration that’s inherited
from the dragent.default.yaml
.
prometheus:
enabled: true
interval: 10
log_errors: true
max_metrics: 1000
max_metrics_per_process: 100
max_tags_per_metric: 20
# Filtering processes to scan. Processes not matching a rule will not
# be scanned
# If an include rule doesn't contain a port or port_filter in the conf
# section, we will scan all the ports that a matching process is listening to.
process_filter:
- exclude:
process.name: docker-proxy
- exclude:
container.image: sysdig/agent
# special rule to exclude processes matching configured prometheus appcheck
- exclude:
appcheck.match: prometheus
- include:
container.label.io.prometheus.scrape: "true"
conf:
# Custom path definition
# If the Label doesn't exist we'll still use "/metrics"
path: "{container.label.io.prometheus.path}"
# Port definition
# - If the Label exists, only scan the given port.
# - If it doesn't, use port_filter instead.
# - If there is no port_filter defined, skip this process
port: "{container.label.io.prometheus.port}"
port_filter:
- exclude: [9092,9200,9300]
- include: 9090-9500
- include: [9913,9984,24231,42004]
- exclude:
container.label.io.prometheus.scrape: "false"
- include:
kubernetes.pod.annotation.prometheus.io/scrape: true
conf:
path: "{kubernetes.pod.annotation.prometheus.io/path}"
port: "{kubernetes.pod.annotation.prometheus.io/port}"
- exclude:
kubernetes.pod.annotation.prometheus.io/scrape: false
Consider the following about this default configuration:
All Prometheus scraping is disabled by default. To enable the entire
configuration shown here, you would only need to add the following
to your dragent.yaml
:
prometheus:
enabled: true
Enabling this option and any pods (in case of Kubernetes) that have
the right annotation set or containers (if not) that have the labels
set will automatically be scrapped.
Once enabled, this default configuration is ideal for the use case
described in the Quick Start For Kubernetes
Environments.
A Process Filter rule excludes processes that are likely to exist in
most environments but are known to never export Prometheus metrics,
such as the Docker Proxy and the Agent itself.
Another Process Filter rule ensures that any processes configured to
be scraped by the legacy Prometheus application check will not be
scraped.
Another Process Filter rule is tailored to use container Labels.
Processes marked with the container Label io.prometheus.scrape
will become eligible for scraping, and if further marked with
container Labels io.prometheus.port
and/or
io.prometheus.path
, scraping will be attempted only on this
port and/or endpoint. If the container is not marked with the
specified path Label, scraping the /metrics
endpoint will be
attempted. If the container is not marked with the specified port
Label, any listening ports in the port_filter
will be
attempted for scraping (this port_filter
in the default is set
for the range of ports for common Prometheus
exporters,
with exclusions for ports in the range that are known to be used by
other applications that are not exporters).
The final Process Filter Include rule is tailored to the use case
described in the Quick Start For Kubernetes
Environments.
Scrape a Single Custom Process
If you need to scrape a single custom process, for instance, a java
process listening on port 9000 with path /prometheus
, add the
following to the dragent.yaml
:
prometheus:
enabled: true
process_filter:
- include:
process.name: java
port: 9000
conf:
# ensure we only scrape port 9000 as opposed to all ports this process may be listening to
port: 9000
path: "/prometheus"
This configuration overrides the default process_filter
section shown
in Default Configuration.
You can add relevant rules from the default configuration to this to
further filter down the metrics.
port
has different purposes depending on where it’s placed in the
configuration. When placed under the include
section, it is a
condition for matching the include rule.
Placing a port
under conf
indicates that only that particular port
is scraped when the rule is matched as opposed to all the ports that the
process could be listening on.
In this example, the first rule will be matched for the Java process
listening on port 9000. The java process listening only on port 9000
will be scrapped.
Scrape a Single Custom Process Based on Container Labels
If you still want to scrape based on container labels, you could just
append the relevant rules from the defaults to the process_filter
. For
example:
prometheus:
enabled: true
process_filter:
- include:
process.name: java
port: 9000
conf:
# ensure we only scrape port 9000 as opposed to all ports this process may be listening to
port: 9000
path: "/prometheus"
- exclude:
process.name: docker-proxy
- include:
container.label.io.prometheus.scrape: "true"
conf:
path: "{container.label.io.prometheus.path}"
port: "{container.label.io.prometheus.port}"
port
has a different meaning depending on where it’s placed in the
configuration. When placed under the include
section, it’s a condition
for matching the include rule.
Placing port
under conf
indicates that only that port is scraped
when the rule is matched as opposed to all the ports that the process
could be listening on.
In this example, the first rule will be matched for the process
listening on port 9000. The java process listening only on port 9000
will be scrapped.
Container Environment
With this default configuration enabled, a containerized install of our
example exporter shown below would be automatically scraped via the
Agent.
# docker run -d -p 8080:8080 \
--label io.prometheus.scrape="true" \
--label io.prometheus.port="8080" \
--label io.prometheus.path="/prometheus" \
luca3m/prometheus-java-app
Kubernetes Environment
In a Kubernetes-based environment, a Deployment with the Annotations as
shown in this example
YAML
would be scraped by enabling the default configuration.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prometheus-java-app
spec:
replicas: 1
template:
metadata:
labels:
app: prometheus-java-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/prometheus"
prometheus.io/port: "8080"
spec:
containers:
- name: prometheus-java-app
image: luca3m/prometheus-java-app
imagePullPolicy: Always
Non-Containerized Environment
This is an example of a non-containerized environment or a containerized
environment that doesn’t use Labels or Annotations. The following
dragent.yaml
would override the default and do per-second scrapes
of our sample exporter and also a second exporter on port 5005, each at
their respective non-standard endpoints. This can be thought of as a
conservative “whitelist” type of configuration since it restricts
scraping to only exporters that are known to exist in the environment
and the ports on which they’re known to export Prometheus metrics.
prometheus:
enabled: true
interval: 1
process_filter:
- include:
process.cmdline: "*app.jar*"
conf:
port: 8080
path: "/prometheus"
- include:
port: 5005
conf:
port: 5005
path: "/wacko"
port
has a different meaning depending on where it’s placed in the
configuration. When placed under the include
section, it’s a condition
for matching the include rule. Placing port
under conf
indicates
that only that port is scraped when the rule is matched as opposed to
all the ports that the process could be listening on.
In this example, the first rule will be matched for the process
*app.jar*. The java process listening only on port 8080 will be
scrapped as opposed to all the ports that *app.jar* could be listening
on. The second rule will be matched for port 5005 and the process
listening only on 5005 will be scraped.
8.6.1.6 -
(Legacy) Logging and Troubleshooting
Logging
After the Agent begins scraping Prometheus metrics, there may be a delay
of up to a few minutes before the metrics become visible in Sysdig
Monitor. To help quickly confirm your configuration is correct, starting
with Agent version 0.80.0, the following log line will appear in the
Agent log the first time since starting that it has found and is
successfully scraping at least one Prometheus exporter:
2018-05-04 21:42:10.048, 8820, Information, 05-04 21:42:10.048324 Starting export of Prometheus metrics
As this is an INFO level log message, it will appear in Agents using the
default logging settings. To reveal even more detail,increase the Agent
log level to DEBUG , which
produces a message like the following that reveals the name of a
specific metric first detected. You can then look for this metric to be
visible in Sysdig Monitor shortly after.
2018-05-04 21:50:46.068, 11212, Debug, 05-04 21:50:46.068141 First prometheus metrics since agent start: pid 9583: 5 metrics including: randomSummary.95percentile
Troubleshooting
See the previous section for information on expected log messages during
successful scraping. If you have enabled Prometheus and are not seeing
the Starting export
message shown there, revisit your
configuration.
It is also suggested to leave the configuration option in its default
setting of log_errors: true
, which will reveal any issues
scraping eligible processes in the Agent log.
For example, here is an error message for a failed scrape of a TCP port
that was listening but not accepting HTTP requests:
2017-10-13 22:00:12.076, 4984, Error, sdchecks[4987] Exception on running check prometheus.5000: Exception('Timeout when hitting http://localhost:5000/metrics',)
2017-10-13 22:00:12.076, 4984, Error, sdchecks, Traceback (most recent call last):
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/sdchecks.py", line 246, in run
2017-10-13 22:00:12.076, 4984, Error, sdchecks, self.check_instance.check(self.instance_conf)
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/checks.d/prometheus.py", line 44, in check
2017-10-13 22:00:12.076, 4984, Error, sdchecks, metrics = self.get_prometheus_metrics(query_url, timeout, "prometheus")
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/checks.d/prometheus.py", line 105, in get_prometheus_metrics
2017-10-13 22:00:12.077, 4984, Error, sdchecks, raise Exception("Timeout when hitting %s" % url)
2017-10-13 22:00:12.077, 4984, Error, sdchecks, Exception: Timeout when hitting http://localhost:5000/metrics
Here is an example error message for a failed scrape of a port that was
responding to HTTP requests on the /metrics
endpoint but not
responding with valid Prometheus-format data. The invalid endpoint is
responding as follows:
# curl http://localhost:5002/metrics
This ain't no Prometheus metrics!
And the corresponding error message in the Agent log, indicating no
further scraping will be attempted after the initial failure:
2017-10-13 22:03:05.081, 5216, Information, sdchecks[5219] Skip retries for Prometheus error: could not convert string to float: ain't
2017-10-13 22:03:05.082, 5216, Error, sdchecks[5219] Exception on running check prometheus.5002: could not convert string to float: ain't
8.6.1.7 -
This feature is not supported with Promscrape V2. For information on different versions of Promscrape and migrating to the latest version, see Migrating from Promscrape V1 to V2.
(Legacy) Collecting Prometheus Metrics from Remote Hosts
Sysdig Monitor can collect Prometheus metrics from remote endpoints with
minimum configuration. Remote endpoints (remote hosts) refer to hosts
where Sysdig Agent cannot be deployed. For example, a Kubernetes master
node on managed Kubernetes services such as GKE and EKS where user
workload cannot be deployed, which in turn implies no Agents involved.
Enabling remote scraping on such hosts is as simple as identifying an
Agent to perform the scraping and declaring the endpoint configurations
with a remote services section in the Agent configuration file.
The collected Prometheus metrics are reported under and associated with
the Agent that performed the scraping as opposed to associating them
with a process.
Preparing the Configuration File
Multiple Agents can share the same configuration. Therefore, determine
which one of those Agents scrape the remote endpoints with the
dragent.yaml
file. This is applicable to both
Create a separate configuration section for remote services in the
Agent configuration file under the prometheus
configuration.
Include a configuration section for each remote endpoint, and add
either a URL or host/port (and an optional path) parameter to each
section to identify the endpoint to scrape. The optional path
identifies the resource at the endpoint. An empty path parameter
defaults to the "/metrics"
endpoint for scraping.
Optionally, add custom tags for each endpoint configuration for
remote services. In the absence of tags, metric reporting might not
work as expected when multiple endpoints are involved. Agents cannot
distinguish similar metrics scraped from multiple endpoints unless
those metrics are uniquely identified by tags.
To help you get started, an example configuration for Kubernetes is
given below:
prometheus:
remote_services:
- prom_1:
kubernetes.node.annotation.sysdig.com/region: europe
kubernetes.node.annotation.sysdig.com/scraper: true
conf:
url: "https://xx.xxx.xxx.xy:5005/metrics"
tags:
host: xx.xxx.xxx.xy
service: prom_1
scraping_node: "{kubernetes.node.name}"
- prom_2:
kubernetes.node.annotation.sysdig.com/region: india
kubernetes.node.annotation.sysdig.com/scraper: true
conf:
host: xx.xxx.xxx.yx
port: 5005
use_https: true
tags:
host: xx.xxx.xxx.yx
service: prom_2
scraping_node: "{kubernetes.node.name}"
- prom_3:
kubernetes.pod.annotation.sysdig.com/prom_3_scraper: true
conf:
url: "{kubernetes.pod.annotation.sysdig.com/prom_3_url}"
tags:
service: prom_3
scraping_node: "{kubernetes.node.name}"
- haproxy:
kubernetes.node.annotation.yourhost.com/haproxy_scraper: true
conf:
host: "mymasternode"
port: 1936
path: "/metrics"
username: "{kubernetes.node.annotation.yourhost.com/haproxy_username}"
password: "{kubernetes.node.annotation.yourhost.com/haproxy_password}"
tags:
service: router
In the above example, scraping is triggered by node and pod annotations.
You can add annotations to nodes and pods by using the
kubectl annotate
command as follows:
kubectl annotate node mynode --overwrite sysdig.com/region=india sysdig.com/scraper=true haproxy_scraper=true yourhost.com/haproxy_username=admin yourhost.com/haproxy_password=admin
In this example, you set annotation on a node to trigger scraping of the
prom2
and haproxy
services as defined in the above configuration.
Preparing Container Environments
An example configuration for Docker environment is given below:
prometheus:
remote_services:
- prom_container:
container.label.com.sysdig.scrape_xyz: true
conf:
url: "https://xyz:5005/metrics"
tags:
host: xyz
service: xyz
In order for remote scraping to work in a Docker-based container
environment, set the com.sysdig.scrape_xyz=true
label to the Agent
container. For example:
docker run -d --name sysdig-agent --restart always --privileged --net host --pid host -e ACCESS_KEY=<KEY> -e COLLECTOR=<COLLECTOR> -e SECURE=true -e TAGS=example_tag:example_value -v /var/run/docker.sock:/host/var/run/docker.sock -v /dev:/host/dev -v /proc:/host/proc:ro -v /boot:/host/boot:ro -v /lib/modules:/host/lib/modules:ro -v /usr:/host/usr:ro --shm-size=512m sysdig/agent
Substitute <KEY
>, <COLLECTOR
>, TAGS
with your account
key, collector, and tags respectively.
Syntax of the Rules
The syntax of the rules for the remote_services
is almost identical to
those of process_filter
with an exception to the include/exclude rule.
The remote_services
section does not use include/exclude rules.
The process_filter
uses include and exclude rules of which only the
first match against a process is applied, whereas, in
the remote_services
section, each rule has a corresponding service name
and all the matching rules are applied.
Rule Conditions
The rule conditions work the same way as those for the process_filter
.
The only caveat is that the rules will be matched against the Agent
process and container because the remote process/context is unknown.
Therefore, matches for container labels and annotations work as before
but they must be applicable to the Agent container as well. For
instance, node annotations will apply because the Agent container runs
on a node.
For annotations, multiple patterns can be specified in a single rule, in
which case all patterns must match for the rule to be a match (AND
operator). In the following example, the endpoint will not be considered
unless both the annotations match:
kubernetes.node.annotation.sysdig.com/region_scraper: europe
kubernetes.node.annotation.sysdig.com/scraper: true
That is, Kubernetes nodes belonging to only the Europe region are
considered for scraping.
Authenticating Sysdig Agent
Sysdig Agent requires necessary permissions on the remote host to scrape
for metrics. The authentication
methods
for local scraping works for authenticating agents on remote hosts as
well, but the authorization parameters work only in the agent context.
Authentication based on certificate-key pair requires it to be
constructed into Kubernetes secret and mounted to the agent.
In token-based authentication, make sure the agent token has access
rights on the remote endpoint to do the scraping.
Use annotation to retrieve username/password instead of passing them
in plaintext. Any annotation enclosed in curly braces will be
replaced by the value of the annotation. If the annotation doesn’t
exist the value will be an empty string. Token substitution is
supported for all the authorization parameters. Because
authorization works only in the Agent context, credentials cannot be
automatically retrieved from the target pod. Therefore, use an
annotation in the Agent pod to pass them. To do so, set the password
into an annotation for the selected Kubernetes object.
In the following example, an HAProxy account is authenticated with the
password supplied in the yourhost.com/haproxy_password
annotation on
the agent node.
- haproxy:
kubernetes.node.annotation.yourhost.com/haproxy_scraper: true
conf:
host: "mymasternode"
port: 1936
path: "/metrics"
username: "{kubernetes.node.annotation.yourhost.com/haproxy_username}"
password: "{kubernetes.node.annotation.yourhost.com/haproxy_password}"
tags:
service: router
8.6.2 -
(Legacy) Integrate Applications (Default App Checks)
The Sysdig agent supports additional application monitoring capabilities
with application check scripts or ‘app checks’. These are a set of
plugins that poll for custom metrics from the specific applications
which export them via status or management pages: e.g. NGINX, Redis,
MongoDB, Memcached and more.
Many app checks are enabled by default in the agent and when a
supported application is found, the correct app check script will be
called and metrics polled automatically.
However, if default connection parameters are changed in your
application, you will need to modify the app check connection parameters
in the Sysdig Agent configuration file (dragent.yaml)
to match your
application.
In some cases, you may also need to enable the metrics reporting
functionality in the application before the agent can poll them.
This page details how to make configuration changes in the agent’s
configuration file, and provides an application integration example.
Click the Supported Applications links for application-specific details.
Python Version for App Checks:
As of agent version 9.9.0, the default version of Python used for app
checks is Python 3.
Python 2 can still be used by setting the following option in your
dragent.yaml
:
python_binary: <path to python 2.7 binary>
For containerized agents, this path will be: /usr/bin/python2.7
Edit dragent.yaml to Integrate or Modify Application Checks
Out of the box, the Sysdig agent will gather and report on a wide
variety of pre-defined metrics. It can also accommodate any number of
custom parameters for additional metrics collection.
The agent relies on a pair of configuration files to define metrics
collection parameters:
dragent.default.yaml
| The core configuration file. You can look at it to understand more about the default configurations provided. Location: "/opt/draios/etc/dragent.default.yaml ." CAUTION. This file should never be edited. |
dragent.yaml
| The configuration file where parameters can be added, either directly in YAML as name/value pairs, or using environment variables such as 'ADDITIONAL_CONF." Location: "/opt/draios/etc/dragent.yaml ." |
The “dragent.yaml
” file can be accessed and edited in several ways,
depending on how the agent was installed.
Review Understanding the Agent Config
Files for details.
The examples in this section presume you are entering YAML code directly
intodragent.yaml,
under the app_checks
section.
Find the default settings
To find the default app-checks for already supported applications, check
the dragent.default.yaml
file.
(Location: /opt/draios/etc/dragent.default.yaml
.)
app_checks:
- name: APP_NAME
check_module: APP_CHECK_SCRIPT
pattern:
comm: PROCESS_NAME
conf:
host: IP_ADDR
port: PORT
app_checks
| | The main section of dragent.default.yaml that contains a list of pre-configured checks. | n/a |
name
| | Every check should have a uniquename: which will be displayed on Sysdig Monitor as the process name of the integrated application. | e.g. MongoDB |
check_module
| | The name of the Python plugin that polls the data from the designated application. All the app check scripts can be found inside the /opt/draios/lib/python/checks.d directory. | e.g. elastic |
pattern
| | This section is used by the Sysdig agent to match a process with a check. Four kinds of keys can be specified along with any arguments to help distinguish them. | n/a |
| comm
| Matches process name as seen in /proc/PID /status | |
| port
| Matches based on the port used (i.e MySQL identified by 'port: 3306') | |
| arg
| Matches any process arguments | |
| exe
| Matches the process exe as seen in /proc/PID /exe link | |
conf
| | This section is specific for each plugin. You can specify any key/values that the plugins support. | |
| host
| Application-specific. A URL or IP address | |
| port
| | |
{...}
tokens can be used as values, which will be substituted with
values from process info.
Change the default settings
To override the defaults:
Copy relevant code blocks from dragent.default.yaml
into
dragent.yaml
. (Or copy the code from the appropriate app
check integration page in this documentation section.)
Any entries copied into dragent.yaml
file will override similar
entries in dragent.default.yaml
.
Never modify dragent.default.yaml
, as it will be overwritten
whenever the agent is updated.
Modify the parameters as needed.
Be sure to use proper YAML. Pay attention to consistent spacing for
indents (as shown) and list all check entries under an app_checks:
section title.
Save the changes and restart the agent.
Use service restart agent
or docker restart sysdig-agent
.
Metrics for the relevant application should appear in the Sysdig Monitor
interface under the appropriate name.
Example 1: Change Name and Add Password
Here is a sample app-check entry for Redis. The app_checks
section was
copied from the dragent.default.yaml
file and modified for a specific
instance.
customerid: 831f3-Your-Access-Key-9401
tags: local:sf,acct:dev,svc:db
app_checks:
- name: redis-6380
check_module: redisdb
pattern:
comm: redis-server
conf:
host: 127.0.0.1
port: PORT
password: PASSWORD
Edits made:
As the token PORT
is used, it will be translated to the actual port
where Redis is listening.
Example 2: Increase Polling Interval
The default interval for an application check to be run by the agent is
set to every second. You can increase the interval per application check
by adding the interval: parameter (under the -name section) and the
number of seconds to wait before each run of the script.
interval:
must be put into each app check entry that should run less
often; there is no global setting.
Example: Run the NTP check once per minute:
app_checks:
- name: ntp
interval: 60
pattern:
comm: systemd
conf:
host: us.pool.ntp.org
Disabling
Disable a Single Application Check
Sometimes the default configuration shipped with the Sysdig agent does
not work for you or you may not be interested in checks for a single
application. To turn a single check off, add an entry like this to
disable it:
app_checks:
- name: nginx
enabled: false
This entry overrides the default configuration of the nginx
check,
disabling it.
If you are using the ADDITIONAL_CONF
parameter to modify your
container agent’s configuration, you would add an entry like this to
your Docker run command (or Kubernetes manifest):
-e ADDITIONAL_CONF="app_checks:\n - name: nginx\n enabled: false\n"
Disable ALL Application Checks
If you do not need it or otherwise want to disable the application check
functionality, you can add the following entry to the agent’s user
settings configuration file /opt/draios/etc/dragent.yaml
:
app_checks_enabled: false
Restart the agent as shown immediately above for either the native Linux
agent installation or the container agent installation.
Sysdig allows custom application check-script configurations to be
created for each individual container in the infrastructure, via the
environment variable SYSDIG_AGENT_CONF
. This avoids the need for
multiple edits and entries to achieve the container-specific
customization, by enabling application teams to configure their own
checks.
The SYSDIG_AGENT_CONF
variable stores a YAML-formatted configuration
for the app check, and is used to match app-check configurations. It can
be stored directly within the Docker file.
The syntax is the same as dragent.yaml
syntax.
The example below defines a per container app-check for Redis in the
Dockerfile, using the SYSDIG_AGENT_CONF
environment variable:
FROM redis
# This config file adds a password for accessing redis instance
ADD redis.conf /
ENV SYSDIG_AGENT_CONF { "app_checks": [{ "name": "redis", "check_module": "redisdb", "pattern": {"comm": "redis-server"}, "conf": { "host": "127.0.0.1", "port": "6379", "password": "protected"} }] }
ENTRYPOINT ["redis-server"]
CMD [ "/redis.conf" ]
The example below shows how parameters can be added to a container
started with docker run
, by either using the -e/–envflag
variable,
or injecting the parameters using an orchestration system (for example,
Kubernetes):
PER_CONTAINER_CONF='{ "app_checks": [{ "name": "redis", "check_module": "redisdb", "pattern": {"comm": "redis-server"}, "conf": { "host": "127.0.0.1", "port": "6379", "password": "protected"} }] }'
docker run --name redis -v /tmp/redis.conf:/etc/redis.conf -e SYSDIG_AGENT_CONF="${PER_CONTAINER_CONF}" -d redis /etc/redis.conf
Metrics Limit
Metric limits are defined by your payment plan. If more metrics are
needed please contact your sales representative with your use case.
Note that a metric with the same name but different tag name will count
as a unique metric by the agent. Example: a metric 'user.clicks'
with
the tag 'country=us'
and another 'user.clicks'
with the
'tag country=it'
are considered two metrics which count towards the
limit.
Supported Applications
Below is the supported list of applications the agent will automatically
poll.
Some app-check scripts will need to be configured since no defaults
exist, while some applications may need to be configured to output their
metrics. Click a highlighted link to see application-specific notes.
- Active MQ
- Apache
- Apache CouchDB
- Apache HBase
- Apache Kafka
- Apache Zookeeper
- Consul
- CEPH
- Couchbase
- Elasticsearch
- etcd
- fluentd
- Gearman
- Go
- Gunicorn
- HAProxy
- HDFS
- HTTP
- Jenkins
- JVM
- Lighttpd
- Memcached
- Mesos/Marathon
- MongoDB
- MySQL
- NGINX and NGINX Plus
- NTP
- PGBouncer
- PHP-FPM
- Postfix
- PostgreSQL
- Prometheus
- RabbitMQ
- RedisDB
- Supervisord
- SNMP
- TCP
You can also
8.6.2.1 -
Apache
Apache web server is an open-source, web
server creation, deployment, and management software. If Apache is
installed on your environment, the Sysdig agent will connect using the
mod_status
module on Apache. You may need to edit the default entries
in the agent configuration file to connect. See the Default
Configuration, below.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Apache Setup
Install mod_status
on your Apache servers and enable
ExtendedStatus.
The following configuration is required. If it is already present, then
un-comment the lines, otherwise add the configuration.
LoadModule status_module modules/mod_status.so
...
<Location /server-status>
SetHandler server-status
Order Deny,Allow
Deny from all
Allow from localhost
</Location>
...
ExtendedStatus On
Sysdig Agent Configuration
Review how to edit dragent.yaml to Integrate or Modify Application
Checks.
Apache has a common default for exposing metrics. The process command
name can be either apache2
or httpd
. By default, the Sysdig agent
will look for the process apache2
. If named differently in your
environment (e.g. httpd
), edit the configuration file to match the
process name as shown in Example 1.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with Apache and collect all metrics.
app_checks:
- name: apache
check_module: apache
pattern:
comm: apache2
conf:
apache_status_url: "http://localhost:{port}/server-status?auto"
log_errors: false
Example
If it is necessary to edit dragent.yaml
to change the process name,
use the following example and update the comm
with the value httpd.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
app_checks:
- name: apache
check_module: apache
pattern:
comm: httpd
conf:
apache_status_url: "http://localhost/server-status?auto"
log_errors: false
Metrics Available
The Apache metrics are listed in the metrics dictionary here: Apache
Metrics.
UI Examples

8.6.2.2 -
Apache Kafka
Apache Kafka is a distributed streaming
platform. Kafka is used for building real-time data pipelines and
streaming apps. It is horizontally scalable, fault-tolerant, wicked
fast, and runs in production in thousands of companies. If Kafka is
installed on your environment, the Sysdig agent will automatically
connect. See the Default Configuration, below.
The Sysdig agent automatically collects metrics from Kafka via JMX
polling. You need to provide consumer names and topics in the agent
config file to collect consumer-based Kafka metrics.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Kafka Setup
Kafka will automatically expose all metrics. You do not need to add
anything on the Kafka instance.
Zstandard
, one of the compressions available in the Kafka integration,
is only included in Kafka versions 2.1.0 or newer. See also Apache
documentation.
Sysdig Agent Configuration
Review how to edit dragent.yaml to Integrate or Modify Application
Checks.
Metrics from Kafka via JMX polling are already configured in the agent’s
default-settings configuration file. Metrics for consumers, however,
need to use app-checks to poll the Kafka and Zookeeper API. You need to
provide consumer names and topics in dragent.yaml
file.
Default Configuration
Since consumer names and topics are environment-specific, a default
configuration is not present in dragent.default.yaml
.
Refer to the following examples for adding Kafka checks to
dragent.yaml.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1: Basic Configuration
A basic example with sample consumer and topic names:
app_checks:
- name: kafka
check_module: kafka_consumer
pattern:
comm: java
arg: kafka.Kafka
conf:
kafka_connect_str: "127.0.0.1:9092" # kafka address, usually localhost as we run the check on the same instance
zk_connect_str: "localhost:2181" # zookeeper address, may be different than localhost
zk_prefix: /
consumer_groups:
sample-consumer-1: # sample consumer name
sample-topic-1: [0, ] # sample topic name and partitions
sample-consumer-2: # sample consumer name
sample-topic-2: [0, 1, 2, 3] # sample topic name and partitions
Example 2: Store Consumer Group Info (Kafka 9+)
From Kafka 9 onwards, you can store consumer group config info inside
Kafka itself for better performance.
app_checks:
- name: kafka
check_module: kafka_consumer
pattern:
comm: java
arg: kafka.Kafka
conf:
kafka_connect_str: "localhost:9092"
zk_connect_str: "localhost:2181"
zk_prefix: /
kafka_consumer_offsets: true
consumer_groups:
sample-consumer-1: # sample consumer name
sample-topic-1: [0, ] # sample topic name and partitions
If kafka_consumer_offsets entry is
set to true
the app check will
look for consumer offsets in Kafka. The appcheck will also look in Kafka
if zk_connect_str
is not set.
Example 3: Aggregate Partitions at the Topic Level
To enable aggregation of partitions at the topic level, use
kafka_consumer_topics
with aggregate_partitions
= true
.
In this case the app check will aggregate the lag
& offset
values at
the partition level, reducing the number of metrics collected.
Set aggregate_partitions
= false
to disable metrics aggregation at
the partition level. In this case, the appcheck will show lag
and
offset
values for each partition.
app_checks:
- name: kafka
check_module: kafka_consumer
pattern:
comm: java
arg: kafka.Kafka
conf:
kafka_connect_str: "localhost:9092"
zk_connect_str: "localhost:2181"
zk_prefix: /
kafka_consumer_offsets: true
kafka_consumer_topics:
aggregate_partitions: true
consumer_groups:
sample-consumer-1: # sample consumer name
sample-topic-1: [0, ] # sample topic name and partitions
sample-consumer-2: # sample consumer name
sample-topic-2: [0, 1, 2, 3] # sample topic name and partitions
Optional tags can be applied to every emitted metric, service check,
and/or event.
app_checks:
- name: kafka
check_module: kafka_consumer
pattern:
comm: java
arg: kafka.Kafka
conf:
kafka_connect_str: "localhost:9092"
zk_connect_str: "localhost:2181"
zk_prefix: /
consumer_groups:
sample-consumer-1: # sample consumer name
sample-topic-1: [0, ] # sample topic name and partitions
tags: ["key_first_tag:value_1", "key_second_tag:value_2", "key_third_tag:value_3"]
Example 5: SSL and Authentication
If SSL and authentication are enabled on Kafka, use the following
configuration.
app_checks:
- name: kafka
check_module: kafka_consumer
pattern:
comm: java
arg: kafka.Kafka
conf:
kafka_consumer_offsets: true
kafka_connect_str: "127.0.0.1:9093"
zk_connect_str: "localhost:2181"
zk_prefix: /
consumer_groups:
test-group:
test: [0, ]
test-4: [0, 1, 2, 3]
security_protocol: SASL_SSL
sasl_mechanism: PLAIN
sasl_plain_username: <USERNAME>
sasl_plain_password: <PASSWORD>
ssl_check_hostname: true
ssl_cafile: <SSL_CA_FILE_PATH>
#ssl_context: <SSL_CONTEXT>
#ssl_certfile: <CERT_FILE_PATH>
#ssl_keyfile: <KEY_FILE_PATH>
#ssl_password: <PASSWORD>
#ssl_crlfile: <SSL_FILE_PATH>
Configuration Keywords and Descriptions
security_protocol (str)
| Protocol used to communicate with brokers. | PLAINTEXT
|
sasl_mechanism (str)
| String picking SASL mechanism when security_protocol is SASL_PLAINTEXT or SASL_SSL | Currently only PLAIN is supported |
sasl_plain_username (str)
| Username for SASL PLAIN authentication. | |
sasl_plain_password (str)
| Password for SASL PLAIN authentication. | |
ssl_context (ssl.SSLContext)
| Pre-configured SSLContext for wrapping socket connections. If provided, all other ssl_* configurations will be ignored. | none |
ssl_check_hostname (bool)
| Flag to configure whether SSL handshake should verify that the certificate matches the broker's hostname. | true |
ssl_cafile (str)
| Optional filename of ca file to use in certificate veriication. | none |
ssl_certfile (str)
| Optional filename of file in pem format containing the client certificate, as well as any CA certificates needed to establish the certificate's authenticity. | none |
ssl_keyfile (str)
| Optional filename containing the client private key. | none |
ssl_password (str)
| Optional password to be used when loading the certificate chain. | none |
ssl_crlfile (str)
| Optional filename containing the CRL to check for certificate expiration. By default, no CRL check is done. When providing a file, only the leaf certificate will be checked against this CRL. The CRL can only be checked with 2.7.9+. | none |
Example 6: Regex for Consumer Groups and Topics
As of Sysdig agent version 0.94, the Kafka app check has added
optional regex (regular expression) support for Kafka consumer groups
and topics.
Regex Configuration:
No new metrics are added with this feature
The new parameter consumer_groups_regex
is added, which includes
regex for consumers and topics from Kafka. Consumer offsets stored
in Zookeeper are not collected.
Regex for topics is optional. When not provided, all topics under
the consumer will be reported.
The regex Python syntax is documented here:
https://docs.python.org/3.7/library/re.html#regular-expression-syntax
If both consumer_groups
and consumer_groups_regex
are provided
at the same time, matched consumer groups from both parameters will
be merged
Sample configuration:
app_checks:
- name: kafka
check_module: kafka_consumer
pattern:
comm: java
arg: kafka.Kafka
conf:
kafka_connect_str: "localhost:9092"
zk_connect_str: "localhost:2181"
zk_prefix: /
kafka_consumer_offsets: true
# Regex can be provided in following format
# consumer_groups_regex:
# 'REGEX_1_FOR_CONSUMER_GROUPS':
# - 'REGEX_1_FOR_TOPIC'
# - 'REGEX_2_FOR_TOPIC'
consumer_groups_regex:
'consumer*':
- 'topic'
- '^topic.*'
- '.*topic$'
- '^topic.*'
- 'topic\d+'
- '^topic_\w+'
Example
topic_\d+ | All strings having keyword topic followed by _ and one or more digit characters (equal to [0-9]) | my-topic_1 topic_23 topic_5-dev | topic_x my-topic-1 topic-123 |
topic | All strings having topic keyword | topic_x x_topic123 | xyz |
consumer* | All strings have consumer keyword | consumer-1 sample-consumer sample-consumer-2 | xyz |
^topic_\w+ | All strings starting with topic followed by _ and any one or more word characters (equal to [a-zA-Z0-9_]) | topic_12 topic_x topic_xyz_123 | topic-12 x_topic topic__xyz |
^topic.* | All strings starting with topic | topic-x topic123 | x-topic x_topic123 |
.*topic$ | All strings ending with topic | x_topic sampletopic | topic-1 x_topic123 |
Metrics Available
Kafka Consumer Metrics (App Checks)
See Apache Kafka Consumer
Metrics.
JMX Metrics
See Apache Kafka JMX
Metrics.
Result in the Monitor UI

8.6.2.3 -
Consul
Consul is a distributed service mesh to
connect, secure, and configure services across any runtime platform and
public or private cloud. If Consul is installed on your environment, the
Sysdig agent will automatically connect and collect basic metrics. If
the Consul Access Control List (ACL) is configured, you may need to edit
the default entries to connect. Also, additional latency metrics can be
collected by modifying default entries. See the Default Configuration,
below.
It’s easy! Sysdig automatically detects metrics from this app based on
standard default configurations.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Consul Configuration
Consul is ready to expose metrics without any special configuration.
Sysdig Agent Configuration
Review how to edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
``uses the following code to
connect with Consul and collect basic metrics.
app_checks:
- name: consul
pattern:
comm: consul
conf:
url: "http://localhost:8500"
catalog_checks: yes
With the dragent.default.yaml
file, the below set of metrics are
available in the Sysdig Monitor UI:
Metrics name |
---|
consul.catalog.nodes_critical |
consul.catalog.nodes_passing |
consul.catalog.nodes_up |
consul.catalog.nodes_warning |
consul.catalog.total_nodes |
consul.catalog.services_critical |
consul.catalog.services_passing |
consul.catalog.services_up |
consul.catalog.services_warning |
consul.peers |
Additional metrics and event can be collected by adding configuration in
dragent.yaml
file. The ACL token must be provided if enabled. See the
following examples.
Remember! Never edit dragent.default.yaml
``directly; always edit
only dragent.yaml
.
Example 1: Enable Leader Change Event
self_leader_check
An enabled node will watch for itself to become the
leader and will emit an event when that happens. It can be enabled on
all nodes.
app_checks:
- name: consul
pattern:
comm: consul
conf:
url: "http://localhost:8500"
catalog_checks: yes
self_leader_check: yes
logs_enabled: true
Example 2: Enable Latency Metrics
If the network_latency_checks
flag is enabled, then the Consul network
coordinates will be retrieved and the latency calculated for each node
and between data centers.
app_checks:
- name: consul
pattern:
comm: consul
conf:
url: "http://localhost:8500"
catalog_checks: yes
network_latency_checks: yes
logs_enabled: true
With the above changes, you can see the following additional metrics:
Example 3: Enable ACL Token
When the ACL System
is enabled in Consul, the ACL Agent Token
must
be added in dragent.yaml
in order to collect metrics.
Follow Consul’s official documentation to Configure
ACL,
Bootstrap
ACL and
Create Agent
Token.
app_checks:
- name: consul
pattern:
comm: consul
conf:
url: "http://localhost:8500"
acl_token: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" #Add agent token
catalog_checks: yes
logs_enabled: true
Example 4: Collect Metrics from Non-Leader Node
Required: Agent 9.6.0+
With agent 9.6.0, you can use the configuration option
single_node_install
(Optional. Default: false
). Set this option to
true
and the app check will be performed on non-leader nodes of
Consul.
app_checks:
- name: consul
pattern:
comm: consul
conf:
url: "http://localhost:8500"
catalog_checks: yes
single_node_install: true
StatsD Metrics
In addition to the metrics from the Sysdig app-check, there are many
other metrics that Consul can send using StatsD. Those metrics will be
automatically collected by the Sysdig agent’s StatsD integration if
Consul is configured to send them.
Add statsd_address
under telemetry
to the Consul config file. The
default config file location is /consul/config/local.json
{
...
"telemetry": {
"statsd_address": "127.0.0.1:8125"
}
...
}
See Telemetry Metrics
for more details.
Metrics Available
See Consul Metrics.
Result in the Monitor UI

8.6.2.4 -
Couchbase
Couchbase Server is a distributed,
open-source, NoSQL database
engine. The core architecture is designed to simplify building modern
applications with a flexible data model and simpler high availability,
high scalability, high performance, and advanced security. If Couchbase
is installed on your environment, the Sysdig agent will automatically
connect. If authentication is configured, you may need to edit the
default entries to connect. See the Default Configuration, below.
The Sysdig agent automatically collects all bucket and node metrics. You
can also edit the configuration to collect query metrics.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Couchbase Setup
Couchbase will automatically expose all metrics. You do not need to
configure anything on the Couchbase instance.
Sysdig Agent Configuration
Review how to edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with Couchbase and collect all bucket and node metrics.
app_checks:
- name: couchbase
pattern:
comm: beam.smp
arg: couchbase
port: 8091
conf:
server: http://localhost:8091
If authentication is enabled, you need to edit dragent.yaml
file to
connect with Couchbase. See Example 1.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1: Authentication
Replace <username>
and <password>
with appropriate values and update
the dragent.yaml
file.
app_checks:
- name: couchbase
pattern:
comm: beam.smp
arg: couchbase
port: 8091
conf:
server: http://localhost:8091
user: <username>
password: <password>
# The following block is optional and required only if the 'path' and
# 'port' need to be set to non-default values specified here
cbstats:
port: 11210
path: /opt/couchbase/bin/cbstats
Example 2: Query Stats
Additionally, you can configure query_monitoring_url
to get query
monitoring stats. This is available from Couchbase version 4.5. See
Query
Monitoring
for more detail.
app_checks:
- name: couchbase
pattern:
comm: beam.smp
arg: couchbase
port: 8091
conf:
server: http://localhost:8091
query_monitoring_url: http://localhost:8093
Metrics Available
See Couchbase Metrics.
Result in the Monitor UI

8.6.2.5 -
Elasticsearch
Elasticsearch is an open-source, distributed,
document storage and search engine that stores and retrieves data
structures in near real-time. Elasticsearch represents data in the form
of structured JSON documents and makes full-text search accessible via
RESTful API and web clients for languages like PHP, Python, and Ruby.
It’s also elastic in the sense that it’s easy to scale
horizontally—simply add more nodes to distribute the load. If
Elasticsearch is
installed on your environment, the Sysdig agent will automatically
connect in most of the cases. See the Default Configuration, below.
The Sysdig Agent automatically collects default metrics. You can also
edit the configuration to collect Primary
Shard
stats.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Elasticsearch Setup
Elasticsearch is ready to expose metrics without any special
configuration.
Sysdig Agent Configuration
Review how to edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with Elasticsearch and collect basic metrics.
app_checks:
- name: elasticsearch
check_module: elastic
pattern:
port: 9200
comm: java
conf:
url: http://localhost:9200
For more metrics, you may need to change the elasticsearch default
setting in dragent.yaml
:
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1: Agent authentication to Elasticsearch Cluster with Authentication
Password Authentication
app_checks:
- name: elasticsearch
check_module: elastic
pattern:
port: 9200
comm: java
conf:
url: https://sysdigcloud-elasticsearch:9200
username: readonly
password: some_password
ssl_verify: false
Certificate Authentication
app_checks:
- name: elasticsearch
check_module: elastic
pattern:
port: 9200
comm: java
conf:
url: https://localhost:9200
ssl_cert: /tmp/certs/ssl.crt
ssl_key: /tmp/certs/ssl.key
ssl_verify: true
ssl_cert
: Path to the certificate chain used for validating the
authenticity of the Elasticsearch server.
ssl_key
: Path to the certificate key used for authenticating to the
Elasticsearch server.
Example 2: Enable Primary shard Statistics
app_checks:
- name: elasticsearch
check_module: elastic
pattern:
port: 9200
comm: java
conf:
url: http://localhost:9200
pshard_stats : true
pshard-specific Metrics
Enable pshard_stats
to monitor the following additional metrics:
Metric Name |
---|
elasticsearch.primaries.flush.total |
elasticsearch.primaries.flush.total.time |
elasticsearch.primaries.docs.count |
elasticsearch.primaries.docs.deleted |
elasticsearch.primaries.get.current |
elasticsearch.primaries.get.exists.time |
elasticsearch.primaries.get.exists.total |
elasticsearch.primaries.get.missing.time |
elasticsearch.primaries.get.missing.total |
elasticsearch.primaries.get.time |
elasticsearch.primaries.get.total |
elasticsearch.primaries.indexing.delete.current |
elasticsearch.primaries.indexing.delete.time |
elasticsearch.primaries.indexing.delete.total |
elasticsearch.primaries.indexing.index.current |
elasticsearch.primaries.indexing.index.time |
elasticsearch.primaries.indexing.index.total |
elasticsearch.primaries.merges.current |
elasticsearch.primaries.merges.current.docs |
elasticsearch.primaries.merges.current.size |
elasticsearch.primaries.merges.total |
elasticsearch.primaries.merges.total.docs |
elasticsearch.primaries.merges.total.size |
elasticsearch.primaries.merges.total.time |
elasticsearch.primaries.refresh.total |
elasticsearch.primaries.refresh.total.time |
elasticsearch.primaries.search.fetch.current |
elasticsearch.primaries.search.fetch.time |
elasticsearch.primaries.search.fetch.total |
elasticsearch.primaries.search.query.current |
elasticsearch.primaries.search.query.time |
elasticsearch.primaries.search.query.total |
elasticsearch.primaries.store.size |
Example 3: Enable Primary shard Statistics for Master Node only
app_checks:
- name: elasticsearch
check_module: elastic
pattern:
port: 9200
comm: java
conf:
url: http://localhost:9200
pshard_stats_master_node_only: true
Note that this option takes precedence over the pshard_stats
option
(above). This means that if the following configuration were put into
place, only the pshard_stats_master_node_only
option would be
respected:
app_checks:
- name: elasticsearch
check_module: elastic
pattern:
port: 9200
comm: java
conf:
url: http://localhost:9200
pshard_stats: true
pshard_stats_master_node_only: true
All Available Metrics
With the default settings and the pshard
setting, the total available
metrics are listed here: Elasticsearch
Metrics.
Result in the Monitor UI

8.6.2.6 -
etcd
etcdis a distributed key-value store that
provides a reliable way to store data across a cluster of machines. If
etcd is installed on your environment, the Sysdig agent will
automatically connect. If you are using ectd older than version 2, you
may need to edit the default entries to connect. See the Default
Configuration section, below.
The Sysdig Agent automatically collects all metrics.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
etcd Versions
etcd v2
The app check functionality described on this page supports etcd
metrics from APIs that are specific to v2 of etcd.
These APIs are present in etcd v3 as well, but export metrics only
for the v2 datastores. For example, after upgrading from etcd v2 to v3,
if the v2 datastores are not migrated to v3, the v2 APIs will continue
exporting metrics for these datastores. If the v2 datastores are
migrated to v3, the v2 APIs will no longer export metrics for these
datastores.
etcd v3
etcd v3 uses a native Prometheus exporter. The exporter only exports
metrics for v3 datastores. For example, after upgrading from etcd v2 to
v3, if v2 datastores are not migrated to v3, the Prometheus endpoint
will not export metrics for these datastores. The Prometheus endpoint
will only export metrics for datastores migrated to v3 or datastores
created after the upgrade to v3.
If your etcd version is v3 or higher, use the information on this page
to enable an integration: Integrate Prometheus
Metrics.
etcd Setup
etcd will automatically expose all metrics. You do not need to add
anything to the etcd instance.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
The default agent configuration for etcd will look for the application
on localhost, port 2379.
No customization is required.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with etcd and collect all metrics.
app_checks:
- name: etcd
pattern:
comm: etcd
conf:
url: "http://localhost:2379"
etcd (before version 2) does not listen on localhost
, so the Sysdig
agent will not connect to it automatically. In such case, you may need
edit the dragent.yaml
file with the hostname and port. See Example 1.
Alternatively, you can add the option -bind-addr 0.0.0.0:4001
to the
etcd command line to allow the agent to connect.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1
You can use {hostname}
and {port}
as a tokens in the conf:
section. This is the recommended setting for Kubernetes customers.
app_checks:
- name: etcd
pattern:
comm: etcd
conf:
url: "http://{hostname}:{port}"
Alternatively you can specify the real hostname and port.
app_checks:
- name: etcd
pattern:
comm: etcd
conf:
url: "http://my_hostname:4000" #etcd service listening on port 4000
Example 2: SSL/TLS Certificate
If encryption is used, add the appropriate SSL/TLS entries. Provide
correct path of SSL/TLS key and certificates used in etcd configuration
in fields ssl_keyfile, ssl_certfile, ssl_ca_certs
.
app_checks:
- name: etcd
pattern:
comm: etcd
conf:
url: "https://localhost:PORT"
ssl_keyfile: /etc/etcd/peer.key # Path to key file
ssl_certfile: /etc/etcd/peer.crt # Path to SSL certificate
ssl_ca_certs: /etc/etcd/ca.crt # Path to CA certificate
ssl_cert_validation: True
Metrics Available
See etcd Metrics.
Result in the Monitor UI

8.6.2.7 -
fluentd
Fluentd is an open source data collector,
which allows unifying data collection and consumption to better use and
understand data. Fluentd structures data as JSON as much as possible, to
unify all facets of processing log data: collecting, filtering,
buffering, and outputting logs across multiple sources and destinations.
If Fluentd is installed on your environment, the Sysdig agent will
automatically connect. See See the Default Configuration section, below.
The Sysdig agent automatically collects default metrics.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Fluentd Setup
Fluentd can be installed as a package (.deb, .rpm, etc) depending on the
OS flavor, or it can be deployed in a Docker container. Fluentd
installation is documented
here. For the
examples on this page, a .deb package
installation is
used.
After installing Fluentd, add following lines in fluentd.conf
:
<source>
@type monitor_agent
bind 0.0.0.0
port 24220
</source>
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’sdragent.default.yaml
uses the following code to
connect with Fluentd and collect default metrics.
(If you use a non-standard port for monitor_agent
, you can
configure it as usual in the agent config file dragent.yaml.)
- name: fluentd
pattern:
comm: fluentd
conf:
monitor_agent_url: http://localhost:24220/api/plugins.json
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example
To generate the metric data, it is necessary to generate some logs
through an application. In the following example, HTTP is used. (For
more information, see Life of a Fluentd
event.)
Execute the following command on in the Fluentd environment:
$ curl -i -X POST -d 'json={"action":"login","user":2}' http://localhost:8888/test.cycle
Expected output: (Note: Here the status code is 200 OK, as HTTP traffic
is successfully generated; it will vary per application.)
HTTP/1.1 200 OK
Content-type: text/plain
Connection: Keep-Alive
Content-length: 0
Metrics Available
See fluentd Metrics.
Result in the Monitor UI

8.6.2.8 -
Go
Golang expvaris the standard interface
designed to instrument and expose custom metrics from a Go
program via
HTTP
. In addition to custom metrics, it also exports some metrics
out-of-the-box, such as command line arguments, allocation stats, heap
stats, and garbage collection metrics.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Go_expvar Setup
You will need to create a custom entry in the user settings config file
for your Go application, due to the difficulty in determining if an
application is written in Go by looking at process names or arguments.
Be sure your app has expvars
enabled, which means importing the
expvar
module and having an HTTP server started from inside your
app, as follows:
import (
...
"net/http"
"expvar"
...
)
// If your application has no http server running for the DefaultServeMux,
// you'll have to have a http server running for expvar to use, for example
// by adding the following to your init function
func init() {
go http.ServeAndListen(":8080", nil)
}
// You can also expose variables that are specific to your application
// See http://golang.org/pkg/expvar/ for more information
var (
exp_points_processed = expvar.NewInt("points_processed")
)
func processPoints(p RawPoints) {
points_processed, err := parsePoints(p)
exp_points_processed.Add(points_processed)
...
}
See also the following blog entry: How to instrument Go code with
custom expvar
metrics.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
No default configuration for Go is provided in the Sysdig agent
dragent.default.yaml
file. You must edit the agent config file as
described in Example 1.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example
Add the following code sample to dragent.yaml
to collect Go metrics.
app_checks:
- name: go-expvar
check_module: go_expvar
pattern:
comm: go-expvar
conf:
expvar_url: "http://localhost:8080/debug/vars" # automatically match url using the listening port
# Add custom metrics if you want
metrics:
- path: system.numberOfSeconds
type: gauge # gauge or rate
alias: go_expvar.system.numberOfSeconds
- path: system.lastLoad
type: gauge
alias: go_expvar.system.lastLoad
- path: system.numberOfLoginsPerUser/.* # You can use / to get inside the map and use .* to match any record inside
type: gauge
- path: system.allLoad/.*
type: gauge
Metrics Available
See Go Metrics.
Result in the Monitor UI

8.6.2.9 -
HAProxy
HAProxy provides a high-availability load
balancer and proxy server for TCP- and HTTP-based applications which
spreads requests across multiple servers.
The Sysdig agent automatically collects haproxy
metrics. You can also
edit the agent configuration file to collect additional metrics.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
HAProxy Setup
The stats
feature must be enabled on your HAProxy instance. This can
be done by adding the following entry to the HAProxy configuration file
/etc/haproxy/haproxy.cfg
listen stats
bind :1936
mode http
stats enable
stats hide-version
stats realm Haproxy\ Statistics
stats uri /haproxy_stats
stats auth stats:stats
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with HAProxy and collect haproxy metrics:
app_checks:
- name: haproxy
pattern:
comm: haproxy
port: 1936
conf:
username: stats
password: stats
url: http://localhost:1936/
collect_aggregates_only: True
log_errors: false
You can get a few additional status metrics by editing the configuration
in dragent.yaml,
as in the following examples.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
Example: Collect Status Metrics Per Service
Enable the collect_status_metrics
flag to collect the metrics
haproxy.count_per_status
, and haproxy.backend_hosts
.
app_checks:
- name: haproxy
pattern:
comm: haproxy
port: 1936
conf:
username: stats
password: stats
url: http://localhost:1936/haproxy_stats
collect_aggregates_only: True
collect_status_metrics: True
log_errors: false
Example: Collect Status Metrics Per Host
Enable:
collect_status_metrics_by_host:
Instructs the check to collect
status metrics per host, instead of per service. This only applies
if `collect_status_metrics`
is true
.
tag_service_check_by_host:
When this flag is set, the hostname
is also passed with the service check ‘haproxy.backend_up
’.
By default, only the backend name and service name are associated
with it.
app_checks:
- name: haproxy
pattern:
comm: haproxy
port: 1936
conf:
username: stats
password: stats
url: http://localhost:1936/haproxy_stats
collect_aggregates_only: True
collect_status_metrics: True
collect_status_metrics_by_host: True
tag_service_check_by_host: True
log_errors: false
Example: Collect HAProxy Stats by UNIX Socket
If you’ve configured HAProxy to report statistics to a UNIX socket, you
can set the url
in dragent.yaml
to the socket’s path (e.g.,
unix:///var/run/haproxy.sock).
Set up HAProxy Config File
Edit your HAProxy configuration file ( /etc/haproxy/haproxy.cfg
)
to add the following lines to the global
section:
global
[snip]
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
[snip]
Edit dragent.yaml url
Add the socket URL from the HAProxy config to the dragent.yaml file:
app_checks:
- name: haproxy
pattern:
comm: haproxy
conf:
url: unix:///run/haproxy/admin.sock
log_errors: True
Metrics Available
See HAProxy Metrics.
Example: Enable Service Check
Required: Agent 9.6.0+
enable_service_check
: Enable/Disable service
check haproxy.backend.up
.
When set to false
, all service checks will be disabled.
app_checks:
- name: haproxy
pattern:
comm: haproxy
port: 1936
conf:
username: stats
password: stats
url: http://localhost:1936/haproxy_stats
collect_aggregates_only: true
enable_service_check: false
Example: Filter Metrics Per Service
Required: Agent 9.6.0+
services_exclude
(Optional): Name or regex of services to be excluded.
services_include
(Optional): Name or regex of services to be included
If a service is excluded with services_exclude
, it can still be be
included explicitly by services_include
. The following example
excludes all services except service_1
and service_2
.
app_checks:
- name: haproxy
pattern:
comm: haproxy
port: 1936
conf:
username: stats
password: stats
url: http://localhost:1936/haproxy_stats
collect_aggregates_only: true
services_exclude:
- ".*"
services_include:
- "service_1"
- "service_2"
Required: Agent 9.6.0+
There are two additional configuration options introduced with agent
9.6.0:
active_tag
(Optional. Default: false
):
Adds tag active
to backend metrics that belong to the active pool
of connections.
headers
(Optional):
Extra headers such as auth-token
can be passed along with
requests.
app_checks:
- name: haproxy
pattern:
comm: haproxy
port: 1936
conf:
username: stats
password: stats
url: http://localhost:1936/haproxy_stats
collect_aggregates_only: true
active_tag: true
headers:
<HEADER_NAME>: <HEADER_VALUE>
<HEADER_NAME>: <HEADER_VALUE>
Result in the Monitor UI

8.6.2.10 -
HTTP
The HTTP check monitors HTTP-based applications for URL availability.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
HTTP Setup
You do not need to configure anything on HTTP-based applications for the
Sysdig agent to connect.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
No default entry is present in the dragent.default.yaml
for the HTTP
check. You need to add an entry in dragent.yaml
as shown in following
examples.
Never edit dragent.default.yaml
directly; always edit only
dragent.yaml
.
Example 1
First you must identify the process pattern (comm:
). It must match an
actively running process for the HTTP check to work. Sysdig recommends
the process be the one that is serving the URL being checked.
If the URL is is remote from the agent, the user should use a process
that is always running, such as “systemd
”.
Confirm the “comm
” value using the following command:
cat /proc/1/comm
Add the following entry to the dragent.yaml
file and modify the
'name:''comm:'
and 'url:'
parameters as needed:
app_checks:
- name: EXAMPLE_WEBSITE
check_module: http_check
pattern:
comm: systemd
conf:
url: https://www.MYEXAMPLE.com
Example 2
There are multiple configuration options available with the HTTP check.
A full list is provided in the table following Example 2. These keys
should be listed under the conf:
section of the configuration in
Example 1.
app_checks:
- name: EXAMPLE_WEBSITE
check_module: http_check
pattern:
comm: systemd
conf:
url: https://www.MYEXAMPLE.com
# timeout: 1
# method: get
# data:
# <KEY>: <VALUE>
# content_match: '<REGEX>''
# reverse_content_match: false
# username: <USERNAME>
# ntlm_domain: <DOMAIN>
# password: <PASSWORD>
# client_cert: /opt/client.crt
# client_key: /opt/client.key
# http_response_status_code: (1|2|3)\d\d
# include_content: false
# collect_response_time: true
# disable_ssl_validation: true
# ignore_ssl_warning: false
# ca_certs: /etc/ssl/certs/ca-certificates.crt
# check_certificate_expiration: true
# days_warning: <THRESHOLD_DAYS>
# check_hostname: true
# ssl_server_name: <HOSTNAME>
# headers:
# Host: alternative.host.example.com
# X-Auth-Token: <AUTH_TOKEN>
# skip_proxy: false
# allow_redirects: true
# include_default_headers: true
# tags:
# - <KEY_1>:<VALUE_1>
# - <KEY_2>:<VALUE_2>
url
| The URL to test. |
timeout
| The time in seconds to allow for a response. |
method
| The HTTP method. This setting defaults to GET, though many other HTTP methods are supported, including POST and PUT. |
data
| The data option is only available when using the POST method. Data should be included as key-value pairs and will be sent in the body of the request. |
content_match
| A string or Python regular expression. The HTTP check will search for this value in the response and will report as DOWN if the string or expression is not found. |
reverse_content_match
| When true, reverses the behavior of the content_match option, i.e. the HTTP check will report as DOWN if the string or expression in content_match IS found. (default is false) |
username & password
| If your service uses basic authentication, you can provide the username and password here. |
http_response_status_code
| A string or Python regular expression for an HTTP status code. This check will report DOWN for any status code that does not match. This defaults to 1xx, 2xx and 3xx HTTP status codes. For example: 401 or 4\d\d . |
include_content
| When set to true , the check will include the first 200 characters of the HTTP response body in notifications. The default value is false . |
collect_response_time
| By default, the check will collect the response time (in seconds) as the metric network.http.response_time . To disable, set this value to false . |
disable_ssl_validation
| This setting will skip SSL certificate validation and is enabled by default. If you require SSL certificate validation, set this to false . This option is only used when gathering the response time/aliveness from the specified endpoint. Note this setting doesn't apply to the check_certificate_expiration option. |
ignore_ssl_warning
| When SSL certificate validation is enabled (see setting above), this setting allows you to disable security warnings. |
ca_certs
| This setting allows you to override the default certificate path as specified in init_config |
check_certificate_expiration
| When check_certificate_expiration is enabled, the service check will check the expiration date of the SSL certificate. Note that this will cause the SSL certificate to be validated, regardless of the value of the disable_ssl_validation setting. |
days_warning
| When check_certificate_expiration is enabled, these settings will raise a warning alert when the SSL certificate is within the specified number of days from expiration. |
check_hostname
| When check_certificate_expiration is enabled, this setting will raise a warning if the hostname on the SSL certificate does not match the host of the given URL. |
headers
| This parameter allows you to send additional headers with the request. e.g. X-Auth-Token: <AUTH_TOKEN> |
skip_proxy
| If set, the check will bypass proxy settings and attempt to reach the check URL directly. This defaults to false . |
allow_redirects
| This setting allows the service check to follow HTTP redirects and defaults to true . |
tags
| A list of arbitrary tags that will be associated with the check. |
Metrics Available
HTTP metrics concern response time and SSL certificate expiry
information.
See HTTP Metrics.
Service Checks
http.can_connect:
Returns DOWN when any of the following occur:
the request to URL times out
the response code is 4xx/5xx,
or it doesn’t match the pattern
provided in the http_response_status_code
the response body does not contain the pattern in content_match
reverse_content_match
is true
and the response body does contain
the pattern in content_match
URI contains https
and disable_ssl_validation
is false
, and
the SSL connection cannot be validated
Otherwise, returns UP.
Segmentation of the http.can_connect
can be done by URL.
http.ssl_cert:
The check returns:
To disable this check, set check_certificate_expiration
to false
.
Result in the Monitor UI

8.6.2.11 -
Jenkins
Jenkins is an open-source automation server which
helps to automate part of the software development process, permitting
continuous integration and facilitating the technical aspects of
continuous delivery. It supports version control tools (such as
Subversion, Git, Mercurial, etc), can execute Apache Ant, Apache Maven
and SBT-based projects, and allows shell scripts and Windows batch
commands. If Jenkins is installed on your environment, the Sysdig agent
will automatically connect and collect all Jenkins metrics. See the
Default Configuration section, below.
This page describes the default configuration settings, the metrics
available for integration, and a sample result in the Sysdig Monitor UI.
Jenkins Setup
Requires the standard Jenkins server setup with one or more Jenkins Jobs
running on it.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with Jenkins and collect basic metrics.
- name: jenkins
pattern:
comm: java
port: 50000
conf:
name: default
jenkins_home: /var/lib/jenkins #this depends on your environment
Jenkins Folders Plugin
By default, the Sysdig agent does not monitor jobs under job folders
created using Folders plugin.
Set jobs_folder_depth to monitor these jobs. Job folders are scanned
recursively for jobs until the designated folder depth is reached. The
default value = 1.
app_checks:
- name: jenkins
pattern:
comm: java
port: 50000
conf:
name: default
jenkins_home: /var/lib/jenkins
jobs_folder_depth: 3
Metrics Available
The following metrics will be available only after running one or more
Jenkins jobs. They handle queue size, job duration, and job waiting
time.
See Jenkins Metrics.
Result in the Monitor UI

8.6.2.12 -
Lighttpd
Lighttpd is a secure, fast, compliant, and
very flexible web server that has been optimized for high-performance
environments. It has a very low memory footprint compared to other web
servers and takes care of CPU load. Its advanced feature set (FastCGI,
CGI, Auth, Output Compression, URL Rewriting, and many more) make
Lighttpd the perfect web server software for every server that suffers
load problems. If Lighttpd is installed on your environment, the Sysdig
agent will automatically connect. See the Default Configuration section,
below. The Sysdig agent automatically collects the default metrics.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
At this time, the Sysdig app check for Lighttpd supports Lighttpd
version 1.x.x only.
Lighttpd Setup
For Lighttpd, the status page must be enabled. Add mod_status
in
the /etc/lighttpd/lighttpd.conf
config file:
server.modules = ( ..., "mod_status", ... )
Then configure an endpoint for it. If (for security purposes) you want
to open the status page only to users from the local network, it can be
done by adding the following lines in the
/etc/lighttpd/lighttpd.conf file
:
$HTTP["remoteip"] == "127.0.0.1/8" {
status.status-url = "/server-status"
}
If you want an endpoint to be open for remote users based on
authentication, then the mod_auth module should be enabled in the
/etc/lighttpd/lighttpd.conf
config file:
server.modules = ( ..., "mod_auth", ... )
Then you can add the auth.require parameter in the
/etc/lighttpd/lighttpd.conf
config file:
auth.require = ( "/server-status" => ( "method" => ... , "realm" => ... , "require" => ... ) )
For more information on the auth.require
parameter, see the Lighttpd
documentation..
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with Lighttpd and collect basic metrics.
app_checks:
- name: lighttpd
pattern:
comm: lighttpd
conf:
lighttpd_status_url: "http://localhost:{port}/server-status?auto"
log_errors: false
Metrics Available
These metrics are supported for Lighttpd version 1.x.x only. Lighttpd
version 2.x.x is
being built and is NOT ready for use as of this publication.
See Lighttpd Metrics.
Result in the Monitor UI

8.6.2.13 -
Memcached
Memcached is an in-memory key-value store for
small chunks of arbitrary data (strings, objects) from the results of
database calls, API calls, or page rendering. If Memcached is installed
on your environment, the Sysdig agent will automatically connect. See
the Default Configuration section, below. The Sysdig agent automatically
collects basic metrics. You can also edit the configuration to collect
additional metrics related to items and slabs.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Memcached Setup
Memcached will automatically expose all metrics. You do not need to add
anything on Memcached instance.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with Memcached and collect basic metrics:
app_checks:
- name: memcached
check_module: mcache
pattern:
comm: memcached
conf:
url: localhost
port: "{port}"
Additional metrics can be collected by editing Sysdig’s configuration
file dragent.yaml
. If
SASL
is enabled, authentication parameters must be added to dragent.yaml.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1: Additional Metrics
memcache.items.*
and memcache.slabs.*
can be collected by setting
flags in the options
section, as follows . Either value can be set to
false
if you do not want to collect metrics from them.
app_checks:
- name: memcached
check_module: mcache
pattern:
comm: memcached
conf:
url: localhost
port: "{port}"
options:
items: true # Default is false
slabs: true # Default is false
Example 2: SASL
SASL authentication can be enabled with Memcached (see instructions
here). If
enabled, credentials must be provided against username
and password
fields as shown in Example 2.
app_checks:
- name: memcached
check_module: mcache
pattern:
comm: memcached
conf:
url: localhost
port: "{port}"
username: <username>
# Some memcached version will support <username>@<hostname>.
# If memcached is installed as a container, hostname of memcached container will be used as username
password: <password>
Metrics Available
See Memcached Metrics.
Result in the Monitor UI

8.6.2.14 -
Mesos/Marathon
Mesos is built using the same principles as
the Linux kernel, only at a different level of abstraction. The Mesos
kernel runs on every machine and provides applications (e.g., Hadoop,
Spark, Kafka, Elasticsearch) with APIs for resource management and
scheduling across entire datacenter and cloud environments. The Mesos
metrics are divided into master and
agent.
Marathon is a production-grade
container orchestration platform for Apache Mesos.
If Mesos and Marathon are installed in your environment, the Sysdig
agent will automatically connect and start collecting metrics. You may
need to edit the default entries to add a custom configuration if the
default does not work. See the Default Configuration section, below.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Mesos/Marathon Setup
Both Mesos and Marathon will automatically expose all metrics. You do
not need to add anything to the Mesos/Marathon instance.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
The Sysdig agent has different entries for mesos-master, mesos-slave
and marathon
in its configuration file. Default entries are present in
Sysdig’s dragent.default.yaml
file and collect all metrics for Mesos.
For Marathon, it collects basic metrics. You may need add configuration
to collect additional metrics.
Default Configuration
In the URLs for mesos-master
and mesos-slave, {mesos_url}
will be
replaced with either the hostname of the auto-detected mesos
master/slave (if auto-detection is enabled), or with an explicit value
from mesos_state_uri
otherwise.
In the URLs for marathon, {marathon_url}
will be replaced with the
hostname of the first configured/discovered Marathon framework.
For all Mesos and Marathon apps, {auth_token}
will either be blank or
an auto-generated token obtained via the /acs/api/v1/auth/login
endpoint.
Mesos Master
app_checks:
- name: mesos-master
check_module: mesos_master
interval: 30
pattern:
comm: mesos-master
conf:
url: "http://localhost:5050"
auth_token: "{auth_token}"
mesos_creds: "{mesos_creds}"
Mesos Agent
app_checks:
- name: mesos-slave
check_module: mesos_slave
interval: 30
pattern:
comm: mesos-slave
conf:
url: "http://localhost:5051"
auth_token: "{auth_token}"
mesos_creds: "{mesos_creds}"
Marathon
app_checks:
- name: marathon
check_module: marathon
interval: 30
pattern:
arg: mesosphere.marathon.Main
conf:
url: "{marathon_url}"
auth_token: "{auth_token}"
marathon_creds: "{marathon_creds}"
Remember! Never edit dragent.default.yaml
directly; always edit
dragent.yaml
.
Marathon
Enable the flag full_metrics
to collect all metrics for marathon.
The following additional metrics are collected with this configuration:
marathon.cpus
marathon.disk
marathon.instances
marathon.mem
app_checks:
- name: marathon
check_module: marathon
interval: 30
pattern:
arg: mesosphere.marathon.Main
conf:
url: "{marathon_url}"
auth_token: "{auth_token}"
marathon_creds: "{marathon_creds}"
Metrics Available
See Mesos Master Metrics.
See Mesos Agent Metrics.
See Marathon Metrics.
Result in the Monitor UI
Mesos Master

Mesos Agent

Marathon

8.6.2.15 -
MongoDB
MongoDB is an open-source database
management system (DBMS) that uses a document-oriented database model
that supports various forms of data. If MongoDB is installed in your
environment, the Sysdig agent will automatically connect and collect
basic metrics (if
authentication is not
used). You may need to edit the default entries to connect and collect
additional metrics. See the Default Configuration section, below.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
MongoDB Setup
Create a read-only user for the Sysdig agent.
# Authenticate as the admin user.
use admin
db.auth("admin", "<YOUR_MONGODB_ADMIN_PASSWORD>")
# On MongoDB 2.x, use the addUser command.
db.addUser("sysdig-cloud", "sysdig-cloud-password", true)
# On MongoDB 3.x or higher, use the createUser command.
db.createUser({
"user":"sysdig-cloud",
"pwd": "sysdig-cloud-password",
"roles" : [
{role: 'read', db: 'admin' },
{role: 'clusterMonitor', db: 'admin'},
{role: 'read', db: 'local' }
]
})
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with MongoDB.
app_checks:
- name: mongodb
check_module: mongo
pattern:
comm: mongod
conf:
server: "mongodb://localhost:{port}/admin"
The default MongoDB entry should work for without modification if
authentication is not
configured. If you have enabled password authentication, the entry will
need to be changed.
Some metrics are not available by default. Additional configuration
needs to be provided to collect them as shown in following examples.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1: With Authentication
Replace <username> and <password> with actual username and
password.
app_checks:
- name: mongodb
check_module: mongo
pattern:
comm: mongod
conf:
server: mongodb://<username>:<password>@localhost:{port}/admin
replica_check: true
Example 2: Additional Metrics
Some metrics are not collected by default. These can be collected by
adding additional_metrics
section in the dragent.yaml
file under the
app_checks mongodb
configuration.
Available options are:
collection
- Metrics of the specified collections
metrics.commands
- Use of database commands
tcmalloc
- TCMalloc memory allocator
top
- Usage statistics for each collection
app_checks:
- name: mongodb
check_module: mongo
pattern:
comm: mongod
conf:
server: mongodb://<username>:<password>@localhost:{port}/admin
replica_check: true
additional_metrics:
- collection
- metrics.commands
- tcmalloc
- top
List of metrics with respective entries in dragent.yam
l:
metric prefix | Entry under additional_metrics |
---|
mongodb.collection | collection |
mongodb.usage.commands | top |
mongodb.usage.getmore | top |
mongodb.usage.insert | top |
mongodb.usage.queries | top |
mongodb.usage.readLock | top |
mongodb.usage.writeLock | top |
mongodb.usage.remove | top |
mongodb.usage.total | top |
mongodb.usage.update | top |
mongodb.usage.writeLock | top |
mongodb.tcmalloc | tcmalloc |
mongodb.metrics.commands | metrics.commands |
Example 3: Collections Metrics
MongoDB stores documents in collections. Collections are analogous to
tables in relational databases. The Sysdig agent by default does not
collect the following collections metrics:
collections
: List of MongoDB collections to be polled by the
agent. Metrics will be collected for the specified set of
collections. This configuration requires the
additional_metrics.collection
section to be present with an entry
for collection
in the dragent.yaml
file. The collection
entry
under additional_metrics
is a flag that enables the collection
metrics.
collections_indexes_stats
: Collect indexes access metrics for
every index in every collection in the collections
list. The
default value is false.
The metric is available starting MongoDB v3.2.
For the agent to poll them, you must configure the dragent.yaml
file
and add an entry corresponding to the metrics to the conf
section as
follows.
app_checks:
- name: mongodb
check_module: mongo
pattern:
comm: mongod
conf:
server: mongodb://<username>:<password>@localhost:{port}/admin
replica_check: true
additional_metrics:
- collection
- metrics.commands
- tcmalloc
- top
collections:
- <LIST_COLLECTIONS>
collections_indexes_stats: true
You can tighten the security measure of the app check connection with
MongoDB by establishing an SSL connection. To enable secure
communication, you need to set the SSL configuration in dragent.yaml
to true. In an advanced deployment with multi-instances of MongoDB, you
need to include a custom CA certificate or client certificate and other
additional configurations.
Basic SSL Connection
In a basic SSL connection:
A single MongoDB instance is running on the host.
An SSL connection with no advanced features, such as the use of a
custom CA certificate or client certificate.
To establish a basic SSL connection between the agent and the MongoDB
instance:
Open the dragent.yaml
file.
Configure the SSL entries as follows:
app_checks:
- name: mongodb
check_module: mongo
pattern:
comm: mongod
conf:
server: "mongodb://<HOSTNAME>:{port}/admin"
ssl: true
# ssl_cert_reqs: 0 # Disable SSL validation
To disable SSL validation, set ssl_cert_reqs
to 0
. This setting
is equivalent to ssl_cert_reqs=CERT_NONE
.
Advanced SSL Connection
In an advanced SSL connection:
Advanced features, such as custom CA certificate or client
certificate, are configured.
Single or multi-MongoDB instances are running on the host. The agent
is installed as one of the following:
Prerequisites
Set up the following:
Custom CA certificate
Client SSL verification
SSL validation
(Optional ) SSL Configuration Parameters
ssl_certfile
| The certificate file that is used to identify the local connection with MongoDB. |
ssl_keyfile
| The private keyfile that is used to identify the local connection with MongoDB. Ignore this option if the key is included with ssl_certfile . |
ssl_cert_reqs
| Specifies whether a certificate is required from the MongoDB server, and whether it will be validated if provided. Possible values are: 0 for ssl.CERT_NONE . Implies certificates are ignored. 1 for ssl.CERT_OPTIONAL . Implies certificates are not required, but validated if provided. 2 for ssl.CERT_REQUIRED . Implies certificates are required and validated.
|
ssl_ca_certs
| The ca_certs file contains a set of concatenated certification authority certificates, which are used to validate certificates used by MongoDB server. Mostly used when server certificates are self-signed. |
Sysdig Agent as a Container
If Sysdig agent is installed as a container, start it with an extra
volume containing the SSL files mentioned in the agent
configuration. For example:
# extra parameter added: -v /etc/ssl:/etc/ssl
docker run -d --name sysdig-agent --restart always --privileged --net host --pid host -e ACCESS_KEY=xxxxxxxxxxxxx -e SECURE=true -e TAGS=example_tag:example_value -v /var/run/docker.sock:/host/var/run/docker.sock -v /dev:/host/dev -v /proc:/host/proc:ro -v /boot:/host/boot:ro -v /lib/modules:/host/lib/modules:ro -v /usr:/host/usr:ro -v /etc/ssl:/etc/ssl --shm-size=512m sysdig/agent
Open the dragent.yaml
file and configure the SSL entries:
app_checks:
- name: mongodb
check_module: mongo
pattern:
comm: mongod
conf:
server: "mongodb://<HOSTNAME>:{port}/admin"
ssl: true
# ssl_ca_certs: </path/to/ca/certificate>
# ssl_cert_reqs: 0 # Disable SSL validation
# ssl_certfile: </path/to/client/certfile>
# ssl_keyfile: </path/to/client/keyfile>
Sysdig Agent as a Process
If Sysdig agent is installed as a process, store the SSL files on
the host and provide the path in the agent configuration.
app_checks:
- name: mongodb
check_module: mongo
pattern:
comm: mongod
conf:
server: "mongodb://<HOSTNAME>:{port}/admin"
ssl: true
# ssl_ca_certs: </path/to/ca/certificate>
# ssl_cert_reqs: 0 # Disable SSL validation
# ssl_certfile: </path/to/client/certfile>
# ssl_keyfile: </path/to/client/keyfile>
See optional SSL configuration
parameters
for information on SSL certificate files.
Multi-MongoDB Setup
In a multi-MongoDB setup, multiple MongoDB instances are running on a
single host. You can configure either a basic or an advanced SSL
connection individually for each MongoDB instance.
Store SSL Files
In an advanced connection, different SSL certificates are used for each
instance of MongoDB on the same host and are stored in separate
directories. For instance, the SSL files corresponding to two different
MongoDB instances can be stored at a mount point as follows:
Open the dragent.yaml
file.
Configure the SSL entries as follows:
app_checks:
- name: mongodb-ssl-1
check_module: mongo
pattern:
comm: mongod
args: ssl_certificate-1.pem
conf:
server: "mongodb://<HOSTNAME|Certificate_CN>:{port}/admin"
ssl: true
ssl_ca_certs: /etc/ssl/mongo1/ca-cert-1
tags:
- "instance:ssl-1"
- name: mongodb-ssl-2
check_module: mongo
pattern:
comm: mongod
args: ssl_certificate-2.pem
conf:
server: "mongodb://<HOSTNAME|Certificate_CN>:{port}/admin"
ssl: true
ssl_ca_certs: /etc/ssl/mongo2/ca-cert-2
tags:
- "instance:ssl-2"
Replace the names of the instances and certificate files with the
names that you prefer.
Metrics Available
See MongoDB Metrics.
Result in the Monitor UI

8.6.2.16 -
MySQL
MySQL is the world’s most popular open-source
database. With its proven performance, reliability, and ease-of-use,
MySQL has become the leading database choice for web-based applications,
used by high profile web properties including Facebook, Twitter,
YouTube. Additionally, it is an extremely popular choice as an embedded
database, distributed by thousands of ISVs and OEMs.
Supported Distribution
The MySQL AppCheck is supported for following MySQL versions.
If the Sysdig agent is installed as a Process:
Host with Python 2.7: MySQL versions supported - 5.5 to 8
Host with Python 2.6: MySQL versions supported - 4.1 to 5.7
(tested with v5.x only)
NOTE: This implies that MySQL 5.5, 5.6 and 5.7 are supported on
both the Python 2.6 and 2.7 environments.
If the Sysdig agent is installed as a Docker container:
The Docker container of the Sysdig agent has Python 2.7 installed. If it
is installed, respective versions against Python 2.7 will be supported.
The following environments have been tested and are supported. Tests
environments include both the Host/Process and Docker environment.
Python | MySQL | | | | |
---|
2.7 (Ubuntu 16/ CentOS 7) | No | Yes | Yes | Yes | Yes |
2.6 (CentOS 6) | Yes | Yes | Yes | Yes | No |
MySQL Setup
A user must be created on MySQL so the Sysdig agent can collect metrics.
To configure credentials, run the following commands on your server,
replacing the sysdig-clouc-password
parameter.
MySQL version-specific commands to create a user are provided below.
# MySQL 5.6 and earlier
CREATE USER 'sysdig-cloud'@'127.0.0.1' IDENTIFIED BY 'sysdig-cloud-password';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'sysdig-cloud'@'127.0.0.1' WITH MAX_USER_CONNECTIONS 5;
## OR ##
# MySQL 5.7 and 8
CREATE USER 'sysdig-cloud'@'127.0.0.1' IDENTIFIED BY 'sysdig-cloud-password' WITH MAX_USER_CONNECTIONS 5;
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'sysdig-cloud'@'127.0.0.1';
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
There is no default configuration for MySQL, as a unique user and
password are required for metrics polling.
Add the entry for MySQL into dragent.yaml
, updating the user
and pass
field credentials.
app_checks:
- name: mysql
pattern:
comm: mysqld
conf:
server: 127.0.0.1
user: sysdig-cloud
pass: sysdig-cloud-password
Metrics Available
See MySQL Metrics.
Result in the Monitor UI
Default Dashboard

Additional Views

8.6.2.17 -
NGINX and NGINX Plus
NGINX is open-source
software for web serving, reverse proxying, caching, load balancing,
media streaming, and more. It started out as a web server designed for
maximum performance and stability. In addition to its HTTP server
capabilities, NGINX can also function as a proxy server for email (IMAP,
POP3, and SMTP) and a reverse proxy and load balancer for HTTP, TCP, and
UDP servers.
NGINX Plus is a software load
balancer, web server, and content cache built on top of open source
NGINX. NGINX Plus has exclusive enterprise‑grade features beyond what’s
available in the open-source offering, including session persistence,
configuration via API, and active health checks.
The Sysdig agent has a default configuration to collect metrics for
open-source NGINX, provided that you have the HTTP stub status module
enabled. NGINX exposes basic metrics about server activity on a simple
status page with this status module. If NGINX Plus is installed, a wide
range of metrics is available with the NGINX Plus API.
This page describes the setup steps for NGINX/NGINX Plus, the default
configuration settings, how to edit the configuration to collect
additional information, the metrics available for integration, and
sample results in the Sysdig Monitor UI.
NGINX/ NGINX Plus Setup
This section describes the configuration required on the NGINX server.
The Sysdig agent will not collect metrics until the required endpoint is
added to the NGINX configuration, per one of the following methods:
Configuration examples of each are provided below
NGINX Stub Status Module Configuration
The ngx_http_stub_status_module
provides access to basic status
information. It is compiled by default on most distributions. If not, it
should be enabled with the --with-http_stub_status_module
configuration parameter.
To check if the module is already compiled, run the following
command:
nginx -V 2>&1 | grep -o with-http_stub_status_module
If with-http_stub_status_module
is listed, the status module is
enabled. (For more information, see
http://nginx.org/en/docs/http/ngx_http_stub_status_module.html.)
Update the NGINX configuration file with /nginx_status
endpoint as
follows. The default NGINX configuration file is present at
/etc/nginx/nginx.conf
or /etc/nginx/conf.d/default.conf.
# HTTP context
server {
...
# Enable NGINX status module
location /nginx_status {
# freely available with open source NGINX
stub_status;
access_log off;
# for open source NGINX < version 1.7.5
# stub_status on;
}
...
}
NGINX Plus API Configuration
When NGINX Plus is configured, the Plus API can be enabled by adding
/api
endpoint in the NGINX configuration file as follows.
The default NGINX configuration file is present at
/etc/nginx/nginx.conf
or /etc/nginx/conf.d/default.conf.
# HTTP context
server {
...
# Enable NGINX Plus API
location /api {
api write=on;
allow all;
}
...
}
Sysdig Agent Configuration
Configuration Examples:
Example 1 (Default): When only open-source Nginx is configured.
Example 2: When only NginxPlus node is configured.
Example 3: When Nginx and NginxPlus are installed in different
containers on same host.
Flag use_plus_api
and is used for differentiating NGINX &
NGINXPlus metrics.
NGINXPlus metrics are differentiated with prefix nginx.plus.*
When use_plus_api = true,
nginx_plus_api_url
is used to fetch NginxPlus metrics from the
NginxPlus node.
nginx_status_url
is used to fetch Nginx metrics from the Nginx
node (If single host is running two separate containers for
Nginx and NginxPlus).
Example 1: Default Configuration
With the default configuration, only NGINX metrics will be available
once the ngx_http_stub_status_module
is configured.
app_checks:
- name: nginx
check_module: nginx
pattern:
exe: "nginx: worker process"
conf:
nginx_status_url: "http://localhost:{port}/nginx_status"
log_errors: true
Example 2: NGINX Plus only
With this example only NGINX Plus Metrics will be available.
app_checks:
- name: nginx
check_module: nginx
pattern:
exe: "nginx: worker process"
conf:
nginx_plus_api_url: "http://localhost:{port}/api"
use_plus_api: true
user: admin
password: admin
log_errors: true
Example 3: NGINX and NGINX Plus
This is special case where NGINX open-source and NGINX PLUS are
installed on same host but in different containers. With this
configuration, respective metrics will be available for NGINX and NGINX
Plus containers.
app_checks:
- name: nginx
check_module: nginx
pattern:
exe: "nginx: worker process"
conf:
nginx_plus_api_url: "http://localhost:{port}/api"
nginx_status_url: "http://localhost:{port}/nginx_status"
use_plus_api: true
user: admin
password: admin
log_errors: true
List of Metrics
NGINX (Open Source)
See NGINX Metrics.
NGINX Plus
See NGINX Plus Metrics.
Result in the Monitor UI

8.6.2.18 -
NTP
NTP stands for
Network Time Protocol. It is used to synchronize the time on your Linux
system with a centralized NTP server. A local NTP server on the network
can be synchronized with an external timing source to keep all the
servers in your organization in-sync with an accurate time.
If the NTP check is enabled in the Sysdig agent, it reports the time
offset of the local agent from an NTP server.
This page describes how to edit the configuration to collect
information, the metrics available for integration, and a sample result
in the Sysdig Monitor UI.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig's dragent.default.yaml
does not provide any
configuration for NTP.
Add the configuration in Example 1 to the dragent.yaml
file to enable
NTP
checks.
Never edit dragent.default.yaml
directly; always edit only
dragent.yaml
.
Example
- name: ntp
interval: 60
pattern:
comm: systemd
conf:
host: us.pool.ntp.org
offset_threshold: 60
host
: (mandatory) provides the host name of NTP
server.
offset_threshold
: (optional) provides the difference (in seconds)
between the local clock and the NTP server, when the ntp.in_sync
service check becomes CRITICAL
. The default is 60
seconds.
Metrics Available
ntp.offset
, the time difference between the local clock and the NTP
reference clock, is the primary NTP metric.
See also NTP Metrics.
Service Checks
ntp.in_sync:
Returns CRITICAL
if the NTP offset is greater than the threshold
specified in dragent.yaml
, otherwise OK.
Result in the Monitor UI

8.6.2.19 -
PGBouncer
PgBouncer is a lightweight
connection pooler for PostgreSQL. If PgBouncer is installed on your
environment, you may need to edit the Sysdig agent configuration file to
connect. See the Default Configuration section, below.
This page describes the configuration settings, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
PgBouncer Setup
PgBouncer does not ship with a default stats user configuration. To
configure it, you need to add a user allowed to access PgBouncer stats.
Do so by adding following line in pgbouncer.ini
. The default file
location is /etc/pgbouncer/pgbouncer.ini
stats_users = sysdig_cloud
For the same user you need the following entry in userlist.txt.
The
default file location is /etc/pgbouncer/userlist.txt
"sysdig_cloud" "sysdig_cloud_password"
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
No default configuration is present in Sysdig’s dragent.default.yaml
file for PgBouncer, as it requires a unique username and password. You
must add a custom entry in dragent.yaml
as follows:
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example
app_checks:
- name: pgbouncer
pattern:
comm: pgbouncer
conf:
host: localhost # set if the bind ip is different
port: 6432 # set if the port is not the default
username: sysdig_cloud
password: sysdig_cloud_password #replace with appropriate password
Metrics Available
See PGBouncer Metrics.
Result in the Monitor UI

8.6.2.20 -
PHP-FPM
PHP-FPM (FastCGI Process Manager) is an
alternative PHP FastCGI implementation, with some additional features
useful for sites of any size, especially busier sites. If PHP-FPM is
installed on your environment, the Sysdig agent will automatically
connect. You may need to edit the default entries to connect if PHP-FPM
has a custom setting in its config file. See the Default Configuration
section, below.
The Sysdig agent automatically collects all metrics with default
configuration.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
PHP-FPM Setup
This check has a default configuration that should suit most use cases.
If it does not work for you, verify that you have added these lines to
your php-fpm.conf
file. The default location is /etc/
pm.status_path = /status
ping.path = /ping
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with PHP-FPM and collect all metrics:
app_checks:
- name: php-fpm
check_module: php_fpm
retry: false
pattern:
exe: "php-fpm: master process"
If you have a configuration other than those for PHP-FPM in
php-fpm.conf,
you can edit the Sysdig agent configuration in
dragent.yaml,
as shown in Example 1.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example
Replace the values of status_url
and ping_url
below with the values
set against pm.status_path
and ping.path
respectively in your
php-fpm.conf:
app_checks:
- name: php-fpm
check_module: php_fpm
pattern:
exe: "php-fpm: master process"
conf:
status_url: /mystatus
ping_url: /myping
ping_reply: mypingreply
Metrics Available
See PHP-FPM Metrics.
Result in the Monitor UI

8.6.2.21 -
PostgreSQL
PostgreSQL is a powerful, open-source,
object-relational database system that has earned a strong reputation
for reliability, feature robustness, and performance.
If PostgreSQL is installed in your environment, the Sysdig agent will
automatically connect in most cases. In some conditions, you may need to
create a specific user for Sysdig and edit the default entries to
connect.
See the Default Configuration section, below. The Sysdig agent
automatically collects all metrics with the default configuration when
correct credentials are provided.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
PostgreSQL Setup
PostgreSQL will be auto-discovered and the agent will connect through
the Unix socket using the Default Configuration with the
**postgres
**default user. If this does not work, you can create a
user for Sysdig Monitor and give it enough permissions to read Postgres
stats. To do this, execute the following example statements on your
server:
create user sysdig-cloud with password 'password';
grant SELECT ON pg_stat_database to sysdig_cloud;
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s default.dragent.yaml
uses the following code to
connect with Postgres.
app_checks:
- name: postgres
pattern:
comm: postgres
port: 5432
conf:
unix_sock: "/var/run/postgresql/"
username: postgres
If a special user for Sysdig is created, then update dragent.yaml
file
with the Expanded Example, below.
Never edit default.dragent.yaml
directly; always edit only
dragent.yaml
.
Example 1: Special User
Update the username and password created for the Sysdig agent in the
respective fields, as follows:
app_checks:
- name: postgres
pattern:
comm: postgres
port: 5432
conf:
username: sysdig-cloud
password: password
Example 2: Connecting on Unix Socket
If Postgres is listening on Unix socket /tmp/.s.PGSQL.5432
, set value
of unix_sock
to /tmp/
app_checks:
- name: postgres
pattern:
comm: postgres
port: 5432
conf:
unix_sock: "/tmp/"
username: postgres
Example 3: Relations
Lists of relations/tables can be specified to track per-relation
metrics.
A single relation can be specified in two ways:
If schemas
are not provided, all schemas will be included. dbname
is
to be provided if relations is specified.
app_checks:
- name: postgres
pattern:
comm: postgres
port: 5432
conf:
username: <username>
password: <password>
dbname: <user_db_name>
relations:
- relation_name: <table_name_1>
schemas:
- <schema_name_1>
- relation_regex: <table_pattern>
Example 4: Other Optional Parameters
app_checks:
- name: postgres
check_module: postgres
pattern:
comm: postgres
port: 5432
conf:
username: postgres
unix_sock: "/var/run/postgresql"
dbname: <user_db_name>
#collect_activity_metrics: true
#collect_default_database: true
#tag_replication_role: true
Optional Parameterscollect_activity_metrics | When set to true , it will enable metrics from pg_stat_activity . New metrics added will be: postgresql.active_queries postgresql.transactions.idle_in_transaction postgresql.transactions.open postgresql.waiting_queries
| false |
collect_default_database | When set to true , it will collect statistics from default database which is postgres. All metrics from postgres database will have tag db:postgres | false |
tag_replication_role | When set to true , metrics and checks will be tagged with replication_role:<master|standby> | false |
Optional Parameters
Example 5: Custom Metrics Using Custom Queries
Personalized custom metrics can be collected from Postgres using custom
queries.
app_checks:
- name: postgres
pattern:
comm: postgres
port: 5432
conf:
unix_sock: "/var/run/postgresql/"
username: postgres
custom_queries:
- metric_prefix: postgresql.custom
query: <QUERY>
columns:
- name: <COLUNMS_1_NAME>
type: <COLUMNS_1_TYPE>
- name: <COLUNMS_2_NAME>
type: <COLUMNS_2_TYPE>
tags:
- <TAG_KEY>:<TAG_VALUE>
Option | Required | Description |
---|
metric_prefix | Yes | Each metric starts with the chosen prefix. |
query | Yes | This is the SQL to execute. It can be a simple statement or a multi-line script. All of the rows of the results are evaluated. Use the pipe if you require a multi-line script |
columns | Yes | This is a list representing each column ordered sequentially from left to right. The number of columns must equal the number of columns returned in the query. There are 2 required pieces of data:- name : This is the suffix to append to the metric_prefix to form the full metric name. If the type is specified as tag , the column is instead applied as a tag to every metric collected by this query.- type : This is the submission method (gauge, count, rate, etc.). This can also be set to ’tag’ to tag each metric in the row with the name and value of the item in this column |
tags | No | A list of tags to apply to each metric (as specified above). |
Optional Parameters
Metrics Available
See PostgreSQL Metrics.
Result in the Monitor UI
Default Dashboard
The default PostgreSQL dashboard includes combined metrics and
individual metrics in an overview page.

Other Views
You can also view individual metric charts from a drop-down menu in an
Explore view.

8.6.2.22 -
RabbitMQ
RabbitMQ is an open-source message-broker
software (sometimes called message-oriented middleware) that implements
Advanced Message Queuing Protocol (AMQP). The RabbitMQ server is written
in the Erlang language and is built on the Open Telecom Platform
framework for clustering and fail-over. Client libraries to interface
with the broker are available in all major programming languages. If
RabbitMQ is installed on your environment, the Sysdig agent will
automatically connect. See the Default Configuration section, below.
The Sysdig agent automatically collects all metrics with the default
configuration. You may need to edit the dragent.yaml
file if a metrics
limit is reached.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
RabbitMQ Setup
Enable the RabbitMQ management plugin. See RabbitMQ’s
documentation to enable it.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with RabbitMQ and collect all metrics.
app_checks:
- name: rabbitmq
pattern:
port: 15672
conf:
rabbitmq_api_url: "http://localhost:15672/api/"
rabbitmq_user: guest
rabbitmq_pass: guest
The RabbitMQ app check tracks various entities, such as exchanges,
queues and nodes. Each of these entities has its maximum limits. If the
limit is reached, metrics can be controlled by editing the
dragent.yaml
file, as in the following examples.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1: Manage logging_interval
When a maximum limit is exceeded, the app check will log an info
message:
|
---|
rabbitmq: Too many <entity type> (<number of entities>) to fetch and maximum limit is (<configured limit>). You must choose the <entity type> you are interested in by editing the dragent.yaml configuration file |
This message is suppressed by a configuration parameter,
logging_interval
.
Its default value is 300 seconds. This can be altered by specifying a
different value in dragent.yaml
.
app_checks:
- name: rabbitmq
pattern:
port: 15672
conf:
rabbitmq_api_url: "http://localhost:15672/api/"
rabbitmq_user: guest
rabbitmq_pass: guest
logging_interval: 10 # Value in seconds. Default is 300
Example 2: Specify Nodes, Queues, or Exchanges
Each of the tracked RabbitMQ entities has its maximum limits. As of
Agent v10.5.1, the default limits are as follows:
Exchanges: 16 per-exchange metrics
Queues: 20 per-queue metrics
Nodes: 9 per-node metrics
The max_detailed_*
settings for the RabbitMQ app check do not limit
the reported number of queues, exchanges, and node, but the number of
generated metrics for the objects. For example, a single queue might
report up to 20 metrics, and therefore, set max_detailed_queues
to 20
times the actual number of queues.
The metrics for these entities are tagged. If any of these entities are
present but no transactions have occurred for them, the metrics are
still reported with 0 values, though without tags. Therefore, when
segmenting these metrics, the tags will show as unset
in the Sysdig
Monitor Explore view. However, all such entities are still counted
against the maximum limits. In such a scenario, you can specify the
entity names for which you want to collect metrics in the dragent.yaml
file.
app_checks:
- name: rabbitmq
pattern:
port: 15672
conf:
rabbitmq_api_url: "http://localhost:15672/api/"
rabbitmq_user: guest
rabbitmq_pass: guest
tags: ["queues:<queuename>"]
nodes:
- rabbit@localhost
- rabbit2@domain
nodes_regexes:
- bla.*
queues:
- queue1
- queue2
queues_regexes:
- thisqueue-.*
- another_\d+queue
exchanges:
- exchange1
- exchange2
exchanges_regexes:
- exchange*
Optional tags can be applied to every emitted metric, service check,
and/or event.
Names can be specified by exact name or regular expression.
app_checks:
- name: rabbitmq
pattern:
port: 15672
conf:
rabbitmq_api_url: "http://localhost:15672/api/"
rabbitmq_user: guest
rabbitmq_pass: guest
tags: ["some_tag:some_value"]
Example 4: filter_by_node
Use filter_by_node: true
if you want each node to report information
localized to the node. Without this option, each node reports
cluster-wide info (as presented by RabbitMQ itself). This option makes
it easier to view the metrics in the UI by removing redundant
information reported by individual nodes.
Default: false
.
Prerequisite: Sysdig agent v. 92.3 or higher.
app_checks:
- name: rabbitmq
pattern:
port: 15672
conf:
rabbitmq_api_url: "http://localhost:15672/api/"
rabbitmq_user: guest
rabbitmq_pass: guest
filter_by_node: true
Metrics Available
See RabbitMQ Metrics.
Result in the Monitor UI

8.6.2.23 -
RedisDB
Redis is an open-source (BSD licensed), in-memory
data structure store, used as a database, cache, and message broker. If
Redis is installed in your environment, the Sysdig agent will
automatically connect in most cases. You may need to edit the default
entries to get additional metrics. See the Default Configuration
section, below.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Application Setup
Redis will automatically expose all metrics. You do not need to
configure anything in the Redis instance.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with Redis and collect basic metrics:
app_checks:
- name: redis
check_module: redisdb
pattern:
comm: redis-server
conf:
host: 127.0.0.1
port: "{port}"
Some additional metrics can be collected by editing the configuration
file as shown in following examples. The options shown in Example 2 are
relevant if Redis requires authentication or if a Unix socket is used.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1: Key Lengths
The following example entry results in the metric redis.key.length
in
the Sysdig Monitor UI, displaying the length of specific keys (segmented
by: key
). To enable, provide the key names in dragent.yaml
as
follows.
Note that length is 0 (zero) for keys that have a type other than
list, set, hash,
or sorted set.
Keys can be expressed as patterns;
see https://redis.io/commands/keys.
Sample entry in dragent.yaml
:
app_checks:
- name: redis
check_module: redisdb
pattern:
comm: redis-server
conf:
host: 127.0.0.1
port: "{port}"
keys:
- "list_1"
- "list_9*"
Example 2: Additional Configuration Options
app_checks:
- name: redis
check_module: redisdb
pattern:
comm: redis-server
conf:
host: 127.0.0.1
port: "{port}"
# unix_socket_path: /var/run/redis/redis.sock # can be used in lieu of host/port
# password: mypassword # if your Redis requires auth
Example 3: COMMANDSTATS Metrics
You can also collect the INFO COMMANDSTATS
result as metrics
(redis.command.*
). This works with Redis >= 2.6
Sample implementation:
app_checks:
- name: redis
check_module: redisdb
pattern:
comm: redis-server
conf:
host: 127.0.0.1
port: "{port}"
command_stats: true
Metrics Available
See RedisDB Metrics.
Result in the Monitor UI

8.6.2.24 -
SNMP
Simple Network Management Protocol
(SNMP)
is an application-layer protocol used to manage and monitor network
devices and their functions. The Sysdig agent can connect to network
devices and collect metrics using SNMP.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
SNMP Overview
Simple Network Management Protocol
(SNMP)
is an Internet Standard protocol for collecting and configuring
information about devices in the networks. The network devices include
physical devices like switches, routers, servers etc.
SNMP has three primary versions ( SNMPv1, SNMPv2c and SNMPv3) and
SNMPv2c is most widely used.
SNMP allows device vendors to expose management data in the form of
variables on managed systems organized in a management information base
(MIB), which describe the system status and configuration. The devices
can be queried as well as configured remotely using these variables.
Certain MIBs are generic and supported by the majority of the device
vendors. Additionally, each vendor can have their own private/enterprise
MIBs for vendor-specific information.
SNMP MIB is a collection of objects uniquely identified by an Object
Identifier (OID). OIDs are represented in the form of x.0, where x is
the name of object in the MIB definition.
For example, suppose one wanted to identify an instance of the variable sysDescr
The object class for sysDescr is:
iso org dod internet mgmt mib system sysDescr
1 3 6 1 2 1 1 1
Hence, the object type, x, would be 1.3.6.1.2.1.1.1
SNMP Agent Configuration
To monitor the servers with the Sysdig agent, the SNMP agent must be
installed on the servers to query the system information.
For Ubuntu-based servers, use the following commands to install the SNMP
Daemon:
$sudo apt-get update
$sudo apt-get install snmpd
Next, configure this SNMP agent to respond to queries from the SNMP
manager by updating the configuration file located at
/etc/snmp/snmpd.conf
Below are the important fields that must be configured:
snmpd.conf
# Listen for connections on all interfaces (both IPv4 *and* IPv6)
agentAddress udp:161,udp6:[::1]:161
## ACCESS CONTROL
## system + hrSystem groups only
view systemonly included .1.3.6.1.2.1.1
view systemonly included .1.3.6.1.2.1.25.1
view systemonly included .1.3.6.1.2.1.31.1
view systemonly included .1.3.6.1.2.1.2.2.1.1
# Default access to basic system info
rocommunity public default -V systemonly
# rocommunity6 is for IPv6
rocommunity6 public default -V systemonly
After making changes to the config file, restart the snmpd
service
using:
$sudo service snmpd restart
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
No default configuration is present for SNMP check.
You must specify the OID/MIB for every parameter you want to
collect, as in the following example.
The OIDs configured in dragent.yaml
are included in the
snmpd.conf
configuration under the ‘ACCESS CONTROL’ section
Ensure that the community_string
is same as configured in the
system configuration (rocommunity
).
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example
app_checks:
- name: snmp
pattern:
comm: python
arg: /opt/draios/bin/sdchecks
interval: 30
conf:
mibs_folder: /usr/share/mibs/ietf/
ip_address: 52.53.158.103
port: 161
community_string: public
# Only required for snmp v1, will default to 2
# snmp_version: 2
# Optional tags can be set with each metric
tags:
- vendor:EMC
- array:VNX5300
- location:front
metrics:
- OID: 1.3.6.1.2.1.25.2.3.1.5
name: hrStorageSize
- OID: 1.3.6.1.2.1.1.7
name: sysServices
- MIB: TCP-MIB
symbol: tcpActiveOpens
- MIB: UDP-MIB
symbol: udpInDatagrams
- MIB: IP-MIB
table: ipSystemStatsTable
symbols:
- ipSystemStatsInReceives
metric_tags:
- tag: ipversion
index: 1 # specify which index you want to read the tag value from
- MIB: IF-MIB
table: ifTable
symbols:
- ifInOctets
- ifOutOctets
metric_tags:
- tag: interface
column: ifDescr # specify which column to read the tag value from
The Sysdig agent allows you to monitor the SNMP counters and gauge of
your choice. For each device, specify the metrics that you want to
monitor in the metrics
subsection using one of the following methods:
Specify a MIB and the symbol that you want to export
metrics:
- MIB: UDP-MIB
symbol: udpInDatagrams
Specify an OID and the name you want the metric to appear under in
Sysdig Monitor:
metrics:
- OID: 1.3.6.1.2.1.6.5
name: tcpActiveOpens
#The name here is the one specified in the MIB but you could use any name.
Specify an MIB and a table from which to extract information:
metrics:
- MIB: IF-MIB
table: ifTable
symbols:
- ifInOctets
metric_tags:
- tag: interface
column: ifDescr
Metrics Available
The SNMP check does not have default metrics. All metrics mentioned in
dragent.yaml
file will be seen with snmp.*
prefix/
Result in the Monitor UI

8.6.2.25 -
Supervisord
Supervisor daemon is a client/server system
that allows its users to monitor and control a number of processes on
UNIX-like operating systems., The Supervisor check monitors the uptime,
status, and number of processes running under Supervisord.
No default configuration is provided for the Supervisor check; you must
provide the configuration in the dragent.yaml
file for the Sysdig
agent to collect the data provided by Supervisor.
This page describes the setup steps required on Supervisor, how to edit
the Sysdig agent configuration to collect additional information, the
metrics available for integration, and a sample result in the Sysdig
Monitor UI.
Supervisor Setup
Configuration
The Sysdig agent can collect data from Supervisor via HTTP server or
UNIX socket. The agent collects the same data regardless of the
configured collection method.
Un-comment the following or add them if they are not present in
/etc/supervisor/supervisord.conf
[inet_http_server]
port=localhost:9001
username=user # optional
password=pass # optional
...
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock
...
[unix_http_server]
file=/tmp/supervisor.sock
chmod=777 # make sure chmod is set so that non-root users can read the socket.
...
[program:foo]
command=/bin/cat
The programs controlled by Supervisor are given by different [program]
sections in the configuration. Each program you want to manage by
Supervisor must be specified in the Supervisor configuration file, with
its supported options in the [program]
section. See Supervisor’s
sample.conf
file for details.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
does not have any
configuration to connect the agent with Supervisor. Edit dragent.yaml
following the Examples given to connect with Supervisor and collect
supervisor.*
metrics.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1: Connect by UNIX Socket
- name: supervisord
pattern:
comm: supervisord
conf:
socket: "unix:///tmp/supervisor.sock"
Example 2: Connect by Host Name and Port, Optional Authentication
- name: supervisord
pattern:
comm: supervisord
conf:
host: localhost
port: 9001
# user: user # Optional. Required only if a username is configured.
# pass: pass # Optional. Required only if a password is configured.
Metrics Available
supervisord.process.count (gauge) | The number of supervisord monitored processes shown as process |
supervisord.process.uptime (gauge) | The process uptime shown as second |
See also Supervisord
Metrics.
Service Check
supervisored.can.connect:
Returns CRITICAL
if the Sysdig agent cannot connect to the HTTP server
or UNIX socket configured, otherwise OK.
supervisord.process.status:
SUPERVISORD STATUS | SUPERVISORD.PROCESS.STATUS |
---|
STOPPED | CRITICAL |
STARTING | UNKNOWN |
RUNNING | OK |
BACKOFF | CRITICAL |
STOPPING | CRITICAL |
EXITED | CRITICAL |
FATAL | CRITICAL |
UNKNOWN | UNKNOWN |
Result in the Monitor UI

8.6.2.26 -
TCP
You can monitor the status of your custom application’s port using the
TCP check. This check will routinely connect to the designated port and
send Sysdig Monitor a simple on/off metric and response time.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
TCP Application Setup
Any application listening on a TCP port can be monitored with
tcp_check
.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
No default configuration is provided in the default settings file; you
must add the entries in Example 1 to the user settings config file
dragent.yaml.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example
- name: tcp_check
check_module: tcp_check
pattern:
comm: httpd
arg: DFOREGROUND
conf:
port: 80
collect_response_time: true
This example shows monitoring a TCP check on an Apache process running
on the host on port 80.
comm:
is the command for running the Apache server on port 80.
If you want the response time for your port, meaning the amount of time
the process takes to accept the connection, you can add the
collect_response_time: true
parameter under the conf:
section and the additional metric network.tcp.response_time
will
appear in the Metrics list.
Do not use port:
under the pattern
: section in this case,
because if the process is not listening it will not be matched and the
metric will not be sent to Sysdig Monitor.
Metrics Available
network.tcp.response_time (gauge) | The response time of a given host and TCP port, tagged with url, e.g. 'url:192.168.1.100:22'. shown as second |
See TCP Metrics.
Service Checks
tcp.can_connect
:
DOWN if the agent
cannot connect to the configured host and port,
otherwise UP.
Result in the Monitor UI

8.6.2.27 -
Varnish
Varnish HTTP Cache is a web application
accelerator, also known as a “caching HTTP reverse proxy.” You install
it in front of any server that speaks HTTP and configure it to cache the
contents. If Varnish is installed on your environment, the Sysdig agent
will automatically connect. See the Default Configuration section,
below.
The Sysdig Agent automatically collects all metrics. You can also edit
the configuration to emit service checks for the back end.
This page describes the default configuration settings, how to edit the
configuration to collect additional information, the metrics available
for integration, and a sample result in the Sysdig Monitor UI.
Varnish Setup
Varnish will automatically expose all metrics. You do not need to add
anything to the Varnish instance.
Sysdig Agent Configuration
Review how to Edit dragent.yaml to Integrate or Modify Application
Checks.
Default Configuration
By default, Sysdig’s dragent.default.yaml
uses the following code to
connect with Varnish and collect all but the VBE metrics. See Example 2
Enable Varnish VBE
Metrics.
metrics_filter:
- exclude: varnish.VBE.*
app_checks:
- name: varnishapp_checks:
interval: 15
pattern:
comm: varnishd
conf:
varnishstat: /usr/bin/varnishstat
Optionally, if you want to submit service checks for the health of each
back end, you can configure varnishadm
and edit dragent.yaml
as in
Example 1.
Remember! Never edit dragent.default.yaml
directly; always edit
only dragent.yaml
.
Example 1 Service Health Checks with varnishadm
When varnishadm
is configured, the Sysdig agent requires privileges to
execute the binary with root privileges. Add the following to your
/etc/sudoers
file:
sysdig-agent ALL=(ALL) NOPASSWD:/usr/bin/varnishadm
Then edit dragent.yaml
as follows. Note: If you have configured
varnishadm
and your secret file is NOT /etc/varnish/secret
, you can
comment out secretfile.
app_checks:
- name: varnish
interval: 15
pattern:
comm: varnishd
conf:
varnishstat: /usr/bin/varnishstat
varnishadm: /usr/bin/varnishadm
secretfile: /etc/varnish/secret
This example will enable following service check.
varnish.backend_healthy
: The agent submits a service check for each
Varnish backend, tagging each with backend:<backend_name>
.
Example 2 Enable Varnish VBE Metrics
Varnish VBE metrics are dynamically generated (and therefore are not
listed in the Metrics
Dictionary). Because they
generate unique metric names with timestamps, they can clutter metric
handling and are filtered out by default. If you want to collect these
metrics, use include
in the metrics_filter
in dragent.yaml
:
metrics_filter:
- include: varnish.VBE.*
app_checks:
- name: varnishapp_checks:
interval: 15
pattern:
comm: varnishd
conf:
varnishstat: /usr/bin/varnishstat
Metrics Available
See Varnish Metrics.
Result in the Monitor UI

8.6.3 -
(Legacy) Create a Custom App Check
Application checks are integrations that allow the Sysdig agent to poll
specific metrics exposed by any application, and the built-in app checks
currently supported are listed on the App Checks main
page. Many other Java-based
applications are also supported out-of-the-box.
If your application is not already supported though, you have a few
options:
Utilize Prometheus, StatsD, or JMX to collect custom metrics:
Send a request at support@sysdig.com, and we’ll do our best to add
support for your application.
Create your own check by following the instructions below.
If you do write a custom check, let us know. We love hearing about how
our users extend Sysdig Monitor, and we can also consider embedding your
app check automatically in the Sysdig agent.
See also Understanding the Agent Config
Files for details on
accessing and editing the agent configuration files in general.
Check Anatomy
Essentially, an app check is a Python Class that extends
AgentCheck
:
from checks import AgentCheck
class MyCustomCheck(AgentCheck):
# namespaces of the monitored process to join
# right now we support 'net', 'mnt' and 'uts'
# put there the minimum necessary namespaces to join
# usually 'net' is enough. In this case you can also omit the variable
# NEEDED_NS = ( 'net', )
# def __init__(self, name, init_config, agentConfig):
# '''
# Optional, define it if you need custom initialization
# remember to accept these parameters and pass them to the superclass
# '''
# AgentCheck.__init__(self, name, init_config, agentConfig)
# self.myvar = None
def check(self, instance):
'''
This function gets called to perform the check.
Connect to the application, parse the metrics and add them to aggregation using
superclass methods like `self.gauge(metricname, value, tags)`
'''
server_port = instance['port']
self.gauge("testmetric", 1)
Put this file into /opt/draios/lib/python/checks.custom.d
(create
the directory if not present) and it will be available to the Sysdig
agent. To run your checks, you need to supply configuration information
in the agent’s config file, dragent.yaml
as is done with bundled
checks:
app_checks:
- name: voltdb # check name, must be unique
# name of your .py file, if it's the same of the check name you can omit it
# check_module: voltdb
pattern: # pattern to match the application
comm: java
arg: org.voltdb.VoltDB
conf:
port: 21212 # any key value config you need on `check(self, instance_conf)` function
Check Interface Detail
As you can see, the most important piece of the check interface is the
check function. The function declaration is:
def check(self, instance)
instance
is a dict containing the configuration of the check. It
will contain all the attributes found in the conf:
section in
dragent.yaml
plus the following:
name
: The check unique name.
ports
: An array of all listening ports of the process.
port
: The first listening port of the process.
These attributes are available as defaults and allow you to
automatically configure your check. The conf:
section as higher
priority on these values.
Inside the check function you can call these methods to send metrics:
self.gauge(metric_name, value, tags) # Sample a gauge metric
self.rate(metric_name, value, tags) # Sample a point, with the rate calculated at the end of the check
self.increment(metric_name, value, tags) # Increment a counter metric
self.decrement(metric_name, value, tags) # Decrement a counter metric
self.histogram(metric_name, value, tags) # Sample a histogram metric
self.count(metric_name, value, tags) # Sample a raw count metric
self.monotonic_count(metric_name, value, tags) # Sample an increasing counter metric
Usually the most used are gauge
and rate
. Besides
metric_name
and value
parameters that are quite obvious, you
can also add tags
to your metric using this format:
tags = [ "key:value", "key2:value2", "key_without_value"]
It is an array of string representing tags in both single or key/value
approach. They will be useful in Sysdig Monitor for graph segmentation.
You can also send service checks which are on/off metrics, using this
interface:
self.service_check(name, status, tags)
Where status can be:
AgentCheck.OK
AgentCheck.WARNING
AgentCheck.CRITICAL
AgentCheck.UNKNOWN
Testing
To test your check you can launch Sysdig App Checks from the command
line to avoid running the full agent and iterate faster:
# from /opt/draios directory
./bin/sdchecks runCheck <check_unique_name> <process_pid> [<process_vpid>] [<process_port>]
check_unique_name
: The check name as on config file.
pid
: Process pid seen from host.
vpid
: Optional, process pid seen inside the container,
defaults to 1.
port
: Optional, port where the process is listening, defaults
to None.
Example:
./bin/sdchecks runCheck redis 1254 1 6379
5658:INFO:Starting
5658:INFO:Container support: True
5658:INFO:Run AppCheck for {'ports': [6379], 'pid': 5625, 'check': 'redis', 'vpid': 1}
Conf: {'port': 6379, 'socket_timeout': 5, 'host': '127.0.0.1', 'name': 'redis', 'ports': [6379]}
Metrics: # metrics array
Checks: # metrics check
Exception: None # exceptions
The output is intentionally raw to allow you to better debug what the
check is doing.
8.6.4 -
(Legacy) Create Per-Container Custom App Checks
Sysdig supports adding custom application check-script configurations
for each individual container in the infrastructure. This avoids
multiple edits and entries to achieve container specific customization.
In particular, this enables PaaS to work smarter, by delegating
application teams to configure their own checks.
See also Understanding the Agent Config
Files for details on
accessing and editing the agent configuration files in general.
How It Works
The SYSDIG_AGENT_CONF variable stores a YAML-formatted configuration
for your app check and will be used to match app check configurations.
All original app_checks are
available, and the syntax is the same as for dragent.yaml
. You can add
the environment variable directly to the Docker file.
Example with Dockerfile
This example defines a per container app-check for Redis. Normally you
would have a YAML formatted entry installed into the agent’s
/opt/draios/etc/dragent.yaml
file that would look like this:
app_checks:
- name: redis
check_module: redisdb
pattern:
comm: redis-server
conf:
host: 127.0.0.1
port: "{port}"
password: protected
For the per-container method, convert and add the above entry to the
Docker file via the SYSDIG_AGENT_CONF environment variable:
FROM redis
# This config file adds a password for accessing redis instance
ADD redis.conf /
ENV SYSDIG_AGENT_CONF { "app_checks": [{ "name": "redis", "check_module": "redisdb", "pattern": {"comm": "redis-server"}, "conf": { "host": "127.0.0.1", "port": "6379", "password": "protected"} }] }
ENTRYPOINT ["redis-server"]
CMD [ "/redis.conf" ]
Example with Docker CLI
You can add parameters starting a container with
dockerrunusing-e/–envflag or injecting it using orchestration systems
like Kubernetes:
PER_CONTAINER_CONF='{ "app_checks": [{ "name": "redis", "check_module": "redisdb", "pattern": {"comm": "redis-server"}, "conf": { "host": "127.0.0.1", "port": "6379", "password": "protected"} }] }'
docker run --name redis -v /tmp/redis.conf:/etc/redis.conf -e SYSDIG_AGENT_CONF="${PER_CONTAINER_CONF}" -d redis /etc/redis.conf
9 -
Captures
Sysdig capture files contain system calls and other OS events that can
be analyzed with either the open-source sysdig
or csysdig
(curses-based) utilities, and are displayed in the Captures module.
The Captures module contains a table listing the capture file name, the
host it was retrieved from, the time frame, and the size of the capture.
When the capture file status is uploaded, the file has been successfully
transmitted from the Sysdig agent to the storage bucket, and is
available for download and analysis.
Store Capture Files
Sysdig capture files are stored in Sysdig’s AWS S3 storage (for SaaS
environments), or in the Cassandra DB (for on-premises environments) by
default.
Learn more about creating, configuring, and analyzing capture files:
9.1 -
Create a Capture File From an Alert
While configuring your alert in the Act
section toggle on the Activate Sysdig Capture
Parameter | Description |
---|
Storage | The storage location for the capture files. The default storage location is the Sysdig Cloud Amazon S3 bucket. To configure a custom S3 storage bucket, refer to Configure AWS Capture File Storage. |
File Name | The name of the capture file. The default name includes the date and time stamp the capture was created. |
Time frame | The period of time captured. The default time is 15 seconds; the maximum capture time available is 24 hours. The capture file size limit is 100MB. The capture time starts from the time the alert threshold was breached (it does not capture syscalls from before the alert was triggered) Note: Sysdig recommends using the default time to ensure captures are small and manageable. |
Filter | Restricts the amount of trace information collected. For more information, including examples of available filters, refer to the Sysdig Github page. |
Create a Capture File Manually
To create a capture file:
From the Explore module, select a host or container.
Click the Key Page Action drop-down menu, and select
Sysdig Capture
.

The Sysdig Capture pop-up window will open.
Define the following parameters, and click the Start Capture
button:
Parameter | Description |
---|
Storage | The storage location for the capture files. The default storage location is the Sysdig Cloud Amazon S3 bucket. To configure a custom S3 storage bucket, refer to Configure AWS Capture File Storage. |
Capture path and name | The name of the capture file. The default name includes the date and time stamp the capture was created. |
Time frame | The period of time captured. The default time is 15 seconds; the maximum capture time available is 24 hours. The capture file size limit is 100MB. Note: Sysdig recommends using the default time to ensure captures are small and manageable. |
Filter | Restricts the amount of trace information collected. For more information, including examples of available filters, refer to the Sysdig Github page. |
The Sysdig agent will be signaled to start a capture, and send back the
resulting trace file. The file will then be displayed in the Captures
module.
Download a Capture File
To download a capture file:
From the Captures
module, navigate to the target capture file.
Select the target capture file.
Click the Download button. A capture file will be automatically
downloaded to your local machine.
Delete Capture Files
To delete a single capture file:
From the Captures
module, select the capture file to be deleted.
Click the Delete
button at the bottom of the Captures
module:

On the Keep File prompt, click the Delete
button to confirm,
or the Keep File
button to cancel.
To delete all capture files:
From the Captures
module, click the Delete All
button:

Click the Yes, Delete Captures
button to confirm, or the
Cancel button.
9.2 -
Review a Capture File
Explore a Capture File
From the Captures
module, navigate to the target capture file.
Select the target capture file. You will see some action buttons at
the bottom of the interface.
Click the Explore button. You will be directed to the Explore tab
view of the capture.
Inspect a Capture File
From the Captures
module, navigate to the target capture file.
Select the target capture file. You will see some action buttons at
the bottom of the interface.
Click the Inspect button. You will be directed to the Sysdig
Inspect page of the capture.
10 -
Metrics Dictionary
The Sysdig metrics dictionary lists all the metrics, both in Sysdig legacy and Prometheus-compatible notation,
supported by the Sysdig product suite, as well as kube state and cloud
provider metrics. The Metrics Dictionary is a living document and is
updated as new metrics are added to the product.
10.1 -
Metrics and Label Mapping
This topic outlines the mapping between the metrics and label naming conventions in the Sysdig
legacy datastore and the new Sysdig datastore.
10.1.1 -
Mapping Classic Metrics with Context-Specific PromQL Metrics
Sysdig classic metrics such as cpu.used.percent
previously returned values from a process, container, or host depending on the query segmentation or scope. You can now use context-explicit metrics which aligns with the flat model and resource specific semantics of Prometheus naming schema.
Your existing dashboards and alerts will be automatically migrated to the new naming convention.
Sysdig Classic Metrics | Context-Specific Metrics in Prometheus Notation |
---|
cpu.cores.used | sysdig_container_cpu_cores_used sysdig_host_cpu_cores_used sysdig_program_cpu_cores_used |
cpu.cores.used.percent | sysdig_container_cpu_cores_used_percent sysdig_host_cpu_cores_used_percent sysdig_program_cpu_cores_used_percent |
cpu.used.percent | sysdig_container_cpu_used_percent sysdig_host_cpu_used_percent sysdig_program_cpu_used_percent |
fd.used.percent | sysdig_container_fd_used_percent sysdig_host_fd_used_percent sysdig_program_fd_used_percent |
file.bytes.in | sysdig_container_file_in_bytes sysdig_host_file_in_bytes sysdig_program_file_in_bytes |
file.bytes.out | sysdig_container_file_out_bytes sysdig_host_file_out_bytes sysdig_program_file_out_bytes |
file.bytes.total | sysdig_container_file_total_bytes sysdig_host_file_total_bytes sysdig_program_file_total_bytes |
file.error.open.count | sysdig_container_file_error_open_count sysdig_host_file_error_open_count sysdig_program_file_error_open_count |
file.error.total.count | sysdig_container_file_error_total_count sysdig_host_file_error_total_count sysdig_program_file_error_total_count |
file.iops.in | sysdig_container_file_in_iops sysdig_host_file_in_iops sysdig_program_file_in_iops |
file.iops.out | sysdig_container_file_out_iops sysdig_host_file_out_iops sysdig_program_file_out_iops |
file.iops.total | sysdig_container_file_total_iops sysdig_host_file_total_iops sysdig_program_file_total_iops |
file.open.count | sysdig_container_file_open_count sysdig_host_file_open_count sysdig_program_file_open_count |
file.time.in | sysdig_container_file_in_time sysdig_host_file_in_time sysdig_program_file_in_time |
file.time.out | sysdig_container_file_out_time sysdig_host_file_out_time sysdig_program_file_out_time |
file.time.total | sysdig_container_file_total_time sysdig_host_file_total_time sysdig_program_file_total_time |
fs.bytes.free | sysdig_container_fs_free_bytes sysdig_fs_free_bytes sysdig_host_fs_free_bytes |
fs.bytes.total | sysdig_container_fs_total_bytes sysdig_fs_total_bytes sysdig_host_fs_total_bytes |
fs.bytes.used | sysdig_container_fs_used_bytes sysdig_fs_used_bytes sysdig_host_fs_used_bytes |
fs.free.percent | sysdig_container_fs_free_percent sysdig_fs_free_percent sysdig_host_fs_free_percent |
fs.inodes.total.count | sysdig_container_fs_inodes_total_count sysdig_fs_inodes_total_count sysdig_host_fs_inodes_total_count |
fs.inodes.used.count | sysdig_container_fs_inodes_used_count sysdig_fs_inodes_used_count sysdig_host_fs_inodes_used_count |
fs.inodes.used.percent | sysdig_container_fs_inodes_used_percent sysdig_fs_inodes_used_percent sysdig_host_fs_inodes_used_percent |
fs.largest.used.percent | sysdig_container_fs_largest_used_percent sysdig_host_fs_largest_used_percent |
fs.root.used.percent | sysdig_container_fs_root_used_percent sysdig_host_fs_root_used_percent |
fs.used.percent | sysdig_container_fs_used_percent sysdig_fs_used_percent sysdig_host_fs_used_percent |
host.error.count | sysdig_container_syscall_error_count sysdig_host_syscall_error_count |
info | sysdig_agent_info sysdig_container_info sysdig_host_info |
memory.bytes.total | sysdig_host_memory_total_bytes sysdig_container_memory_used_bytes sysdig_host_memory_used_bytes sysdig_program_memory_used_bytes |
memory.bytes.virtual | sysdig_container_memory_virtual_bytes sysdig_host_memory_virtual_bytes |
memory.swap.bytes.used | sysdig_container_memory_swap_used_bytes sysdig_host_memory_swap_used_bytes |
memory.used.percent | sysdig_container_memory_used_percent sysdig_host_memory_used_percent |
net.bytes.in | sysdig_connection_net_in_bytes sysdig_container_net_in_bytes sysdig_host_net_in_bytes sysdig_program_net_in_bytes |
net.bytes.out | sysdig_connection_net_out_bytes sysdig_container_net_out_bytes sysdig_host_net_out_bytes sysdig_program_net_out_bytes |
net.bytes.total | sysdig_connection_net_total_bytes sysdig_container_net_total_bytes sysdig_host_net_total_bytes sysdig_program_net_total_bytes |
net.connection.count.in | sysdig_connection_net_connection_in_count sysdig_container_net_connection_in_count sysdig_host_net_connection_in_count sysdig_program_net_connection_in_count |
net.connection.count.out | sysdig_connection_net_connection_out_count sysdig_container_net_connection_out_count sysdig_host_net_connection_out_count sysdig_program_net_connection_out_count |
net.connection.count.total | sysdig_connection_net_connection_total_count sysdig_container_net_connection_total_count sysdig_host_net_connection_total_count sysdig_program_net_connection_total_count |
net.request.count | sysdig_connection_net_request_count sysdig_container_net_request_count sysdig_host_net_request_count sysdig_program_net_request_count |
net.error.count | sysdig_container_net_error_count sysdig_host_net_error_count sysdig_program_net_error_count |
net.request.count.in | sysdig_connection_net_request_in_count sysdig_container_net_request_in_count sysdig_host_net_request_in_count sysdig_program_net_request_in_count |
net.request.count.out | sysdig_connection_net_request_out_count sysdig_container_net_request_out_count sysdig_host_net_request_out_count sysdig_program_net_request_out_count |
net.request.time | sysdig_connection_net_request_time sysdig_container_net_request_time sysdig_host_net_request_time sysdig_program_net_request_time |
net.request.time.in | sysdig_connection_net_request_in_time sysdig_container_net_request_in_time sysdig_host_net_request_in_time sysdig_program_net_request_in_time |
net.request.time.out | sysdig_connection_net_request_out_time sysdig_container_net_request_out_time sysdig_host_net_request_out_time sysdig_program_net_request_out_time |
net.server.bytes.in | sysdig_container_net_server_in_bytes sysdig_host_net_server_in_bytes |
net.server.bytes.out | sysdig_container_net_server_out_bytes sysdig_host_net_server_out_bytes |
net.server.bytes.total | sysdig_container_net_server_total_bytes sysdig_host_net_server_total_bytes |
net.sql.error.count | sysdig_container_net_sql_error_count sysdig_host_net_sql_error_count |
net.sql.request.count | sysdig_container_net_sql_request_count sysdig_host_net_sql_request_count |
net.tcp.queue.len | sysdig_container_net_tcp_queue_len sysdig_host_net_tcp_queue_len sysdig_program_net_tcp_queue_len |
proc.count | sysdig_container_proc_count sysdig_host_proc_count sysdig_program_proc_count |
thread.count | sysdig_container_thread_count sysdig_host_thread_count sysdig_program_thread_count |
uptime | sysdig_container_up sysdig_host_up sysdig_program_up |
10.1.2 -
Mapping Between Classic Metrics and PromQL Metrics
Starting SaaS v 3.2.6, Sysdig classic metrics and labels have been
renamed to be aligned with Prometheus naming convention. For example,
Sysdig classic metrics have a dot-oriented hierarchy, whereas Prometheus
has label-based metric organization. The table below helps you identify
the Prometheus metrics and labels and the corresponding ones in the
Sysdig classic system.
host | info | sysdig_host_info | Not exposed | host_mac host instance_id agent_tag_{*}
| host.mac host.hostName host.instanceId agent.tag.{*}
|
| | sysdig_cloud_provider_info | | host_mac provider_id account_id region availability_zone instance_type tag_{*} security_groups host_ip_public host_ip_private host_name name
| host.mac cloudProvider.id cloudProvider.account.id cloudProvider.region cloudProvider.availabilityZone cloudProvider.instance.type cloudProvider.tag.{*} cloudProvider.securityGroups cloudProvider.host.ip.public cloudProvider.host.ip.private cloudProvider.host.name cloudProvider.name
| | |
| data | sysdig_host_cpu_used_percent | cpu.used.percent | | | |
| | sysdig_host_cpu_cores_used | cpu.cores.used | | |
| | sysdig_host_cpu_user_percent | cpu.user.percent | | |
| | sysdig_host_cpu_idle_percent | cpu.idle.percent | | |
| | sysdig_host_cpu_iowait_percent | cpu.iowait.percent | | |
| | sysdig_host_cpu_nice_percent | cpu.nice.percent | | |
| | sysdig_host_cpu_stolen_percent | cpu.stolen.percent | | |
| | sysdig_host_cpu_system_percent | cpu.system.percent | | |
| | sysdig_host_fd_used_percent | fd.used.percent | | |
| | sysdig_host_file_error_open_count | file.error.open.count | | |
| | sysdig_host_file_error_total_count | file.error.total.count | | |
| | sysdig_host_file_in_bytes | file.bytes.in | | |
| | sysdig_host_file_in_iops | file.iops.in | | |
| | sysdig_host_file_in_time | file.time.in | | |
| | sysdig_host_file_open_count | file.open.count | | |
| | sysdig_host_file_out_bytes | file.bytes.out | | |
| | sysdig_host_file_out_iops | file.iops.out | | |
| | sysdig_host_file_out_time | file.time.out | | |
| | sysdig_host_load_average_15m | load.average.15m | | |
| | sysdig_host_load_average_1m | load.average.1m | | |
| | sysdig_host_load_average_5m | load.average.5m | | |
| | sysdig_host_memory_available_bytes | memory.bytes.available | | |
| | sysdig_host_memory_total_bytes | memory.bytes.total | | |
| | sysdig_host_memory_used_bytes | memory.bytes.used | | |
| | sysdig_host_memory_swap_available_bytes | memory.swap.bytes.available | | |
| | sysdig_host_memory_swap_total_bytes | memory.swap.bytes.total | | |
| | sysdig_host_memory_swap_used_bytes | memory.swap.bytes.used | | |
| | sysdig_host_memory_virtual_bytes | memory.bytes.virtual | | |
| | sysdig_host_net_connection_in_count | net.connection.count.in | | |
| | sysdig_host_net_connection_out_count | net.connection.count.out | | |
| | sysdig_host_net_error_count | net.error.count | | |
| | sysdig_host_net_in_bytes | net.bytes.in | | |
| | sysdig_host_net_out_bytes | net.bytes.out | | |
| | sysdig_host_net_tcp_queue_len | net.tcp.queue.len | | |
| | sysdig_host_proc_count | proc.count | | |
| | sysdig_host_system_uptime | system.uptime | | |
| | sysdig_host_thread_count | thread.count | | |
container | info | sysdig_container_info | Not exposed | container_id | container_id |
| | | | container_full_id | none |
| | | | host_mac | host.mac |
| | | | container | container.name |
| | | | container_type | container.type |
| | | | image | container.image |
| | | | image_id | container.image.id |
| | | | mesos_task_id | container.mesosTaskId Only available in Mesos orchestrator. |
| | | | cluster | kubernetes.cluster.name Present only if the container is part of Kubernetes. |
| | | | pod | kubernetes.pod.name Present only if the container is part of Kubernetes |
| | | | namespace | kubernetes.namespace.name Present only if the container is part of Kubernetes. |
| data | sysdig_container_cpu_used_percent | cpu.used.percent | host_mac container_id container_type container
| host.mac container.id container.type container.name
|
| | sysdig_container_cpu_cores_used | cpu.cores.used | | |
| | sysdig_container_cpu_cores_used_percent | cpu.cores.used.percent | | |
| | sysdig_container_cpu_quota_used_percent | cpu.quota.used.percent | | |
| | sysdig_container_cpu_shares | cpu.shares.count | | |
| | sysdig_container_cpu_shares_used_percent | cpu.shares.used.percent | | |
| | sysdig_container_fd_used_percent | fd.used.percent | | |
| | sysdig_container_file_error_open_count | file.error.open.count | | |
| | sysdig_container_file_error_total_count | file.error.total.count | | |
| | sysdig_container_file_in_bytes | file.bytes.in | | |
| | sysdig_container_file_in_iops | file.iops.in | | |
| | sysdig_container_file_in_time | file.time.in | | |
| | sysdig_container_file_open_count | file.open.count | | |
| | sysdig_container_file_out_bytes | file.bytes.out | | |
| | sysdig_container_file_out_iops | file.iops.out | | |
| | sysdig_container_file_out_time | file.time.out | | |
| | sysdig_container_memory_limit_bytes | memory.limit.bytes | | |
| | sysdig_container_memory_limit_used_percent | memory.limit.used.percent | | |
| | sysdig_container_memory_swap_available_bytes | memory.swap.bytes.available | | |
| | sysdig_container_memory_swap_total_bytes | memory.swap.bytes.total | | |
| | sysdig_container_memory_swap_used_bytes | memory.swap.bytes.used | | |
| | sysdig_container_memory_used_bytes | memory.bytes.used | | |
| | sysdig_container_memory_virtual_bytes | memory.bytes.virtual | | |
| | sysdig_container_net_connection_in_count | net.connection.count.in | | |
| | sysdig_container_net_connection_out_count | net.connection.count.out | | |
| | sysdig_container_net_error_count | net.error.count | | |
| | sysdig_container_net_in_bytes | net.bytes.in | | |
| | sysdig_container_net_out_bytes | net.bytes.out | | |
| | sysdig_container_net_tcp_queue_len | net.tcp.queue.len | | |
| | sysdig_container_proc_count | proc.count | | |
| | sysdig_container_swap_limit_bytes | swap.limit.bytes | | |
| | sysdig_container_thread_count | thread.count | | |
Process/ Program | Info | sysdig_program_info | not exposed | program | proc.name |
| | | | cmd_line | proc.commandLine |
| | | | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| data | sysdig_program_cpu_used_percent | cpu.used.percent | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | program | proc.name |
| | | | cmd_line | proc.commandLine |
| | sysdig_program_memory_used_bytes | memory.bytes.used | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | program | proc.name |
| | | | cmd_line | proc.commandLine |
| | sysdig_program_net_in_bytes | net.bytes.in | container_id | container.id |
| | | | host_mac | host.mac |
| | | | container_type | container.type |
| | | | program | proc.name |
| | | | cmd_line | proc.commandLine |
| | sysdig_program_net_out_bytes | net.bytes.out | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | program | proc.name |
| | | | cmd_line | proc.commandLine |
| | sysdig_program_proc_count | proc.count | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | program | proc.name |
| | | | cmd_line | proc.commandLine |
| | sysdig_program_thread_count | thread.count | host_mac | host.mac |
| | | | | |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | program | proc.name |
cmd_line | proc.commandLine | | | | |
| | | | | |
fs | info | sysdig_fs_info | not exposed | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | device | fs.device |
| | | | mount_dir | fs.mountDir |
| | | | type | fs.type |
| data | sysdig_fs_free_bytes | fs.bytes.free | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | device | fs.device |
| | sysdig_fs_inodes_total_count | fs.inodes.total.count | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | device | fs.device |
| | sysdig_fs_inodes_used_count | fs.inodes.used.count | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | device | fs.device |
| | sysdig_fs_total_bytes | fs.bytes.total | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | device | fs.device |
| | fs.bytes.used | | host_mac | host.mac |
| | | | container_id | container.id |
| | | | container_type | container.type |
| | | | devide | fs.device |
10.1.3 -
Mapping Legacy Sysdig Kubernetes Metrics with Prometheus Metrics
Prometheus metrics, in Kubernetes parlance, are nothing but Kube State
Metrics. These metrics are available in Sysdig PromQL and can be mapped
to existing Sysdig Kubernetes metrics.
For descriptions on Kubernetes State Metrics, see Kubernetes State
Metrics.
Pod | kubernetes.pod.containers.waiting | kube_pod_container_status_waiting | | |
| kubernetes.pod.resourceLimits.cpuCores kubernetes.pod.resourceLimits.memBytes | kube_pod_container_resource_limits kube_pod_sysdig_resource_limits_memory_bytes kube_pod_sysdig_resource_limits_cpu_cores | | {namespace="default",pod="pod0",container="pod1_con1",resource="cpu",unit="core"} {namespace="default",pod="pod0",container="pod1_con1",resource="memory",unit="byte"} |
| kubernetes.pod.resourceRequests.cpuCores kubernetes.pod.resourceRequests.memBytes | kube_pod_container_resource_requests kube_pod_sysdig_resource_requests_cpu_cores kube_pod_sysdig_resource_requests_memory_bytes | | {namespace="default",pod="pod0",container="pod1_con1",resource="cpu",unit="core"} {namespace="default",pod="pod0",container="pod1_con1",resource="memory",unit="byte"} |
| kubernetes.pod.status.ready | kube_pod_status_ready | | |
| | kube_pod_info | | {namespace="default",pod="pod0",host_ip="1.1.1.1",pod_ip="1.2.3.4",uid="abc-0",node="node1",created_by_kind="<none>",created_by_name="<none>",priority_class=""} |
| | kube_pod_owner | | {namespace="default",pod="pod0",owner_kind="<none>",owner_name="<none>;",owner_is_controller="<none>"} |
| | kube_pod_labels | | {namespace="default",pod="pod0", label_app="myApp"} |
| | kube_pod_container_info | | {namespace="default",pod="pod0",container="container2",image="k8s.gcr.io/hyperkube2",image_id="docker://sha256:bbb",container_id="docker://cd456"} |
node | kubernetes.node.allocatable.cpuCores | kube_node_status_allocatable_cpu_cores | node=<node-address> resource=<resource-name> unit=<resource-unit> node=<node-address>
| resource/unit have one of the values: (cpu, core); (memory, byte); (pods, integer). Sysdig currently supports only CPU, pods, and memory resources for kube_node_status_capacity metrics. "# HELP kube_node_status_capacity The capacity for different resources of a node.
kube_node_status_capacity{node=""k8s-master"",resource=""hugepages_1Gi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-master"",resource=""hugepages_2Mi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-master"",resource=""memory"",unit=""byte""} 4.16342016e+09
kube_node_status_capacity{node=""k8s-master"",resource=""pods"",unit=""integer""} 110
kube_node_status_capacity{node=""k8s-node1"",resource=""pods"",unit=""integer""} 110
kube_node_status_capacity{node=""k8s-node1"",resource=""cpu"",unit=""core""} 2
kube_node_status_capacity{node=""k8s-node1"",resource=""hugepages_1Gi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-node1"",resource=""hugepages_2Mi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-node1"",resource=""memory"",unit=""byte""} 6.274154496e+09
kube_node_status_capacity{node=""k8s-node2"",resource=""hugepages_1Gi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-node2"",resource=""hugepages_2Mi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-node2"",resource=""memory"",unit=""byte""} 6.274154496e+09
kube_node_status_capacity{node=""k8s-node2"",resource=""pods"",unit=""integer""} 110
kube_node_status_capacity{node=""k8s-node2"",resource=""cpu"",unit=""core""} 2
|
| kubernetes.node.allocatable.memBytes | kube_node_status_allocatable_memory_bytes | | |
| kubernetes.node.allocatable.pods | kube_node_status_allocatable_pods | | |
| kubernetes.node.capacity.cpuCores | kube_node_status_capacity_cpu_cores | node=<node-address> resource=<resource-name> unit=<resource-unit> node=<node-address>
| |
| kubernetes.node.capacity.memBytes | kube_node_status_capacity_memory_bytes | | |
| kubernetes.node.capacity.pod | kube_node_status_capacity_pods | | |
| kubernetes.node.diskPressure | kube_node_status_condition | | |
| kubernetes.node.memoryPressure | | | |
| kubernetes.node.networkUnavailable | | | |
| kubernetes.node.outOfDisk | | | |
| kubernetes.node.ready | | | |
| kubernetes.node.unschedulable | kube_node_spec_unschedulable | | |
| | kube_node_info | | |
| | kube_node_labels | | |
Deployment | kubernetes.deployment.replicas.available | kube_deployment_status_replicas_available | | |
| kubernetes.deployment.replicas.desired | kube_deployment_spec_replicas | | |
| kubernetes.deployment.replicas.paused | kube_deployment_spec_paused | | |
| kubernetes.deployment.replicas.running | kube_deployment_status_replicas | | |
| kubernetes.deployment.replicas.unavailable | kube_deployment_status_replicas_unavailable | | |
| kubernetes.deployment.replicas.updated | kube_deployment_status_replicas_updated | | |
| | kube_deployment_labels | | |
job | kubernetes.job.completions | kube_job_spec_completions | | |
| kubernetes.job.numFailed | kube_job_failed | | |
| kubernetes.job.numSucceeded | kube_job_complete | | |
| kubernetes.job.parallelism | kube_job_spec_parallelism | | |
| | kube_job_status_active | | |
| | kube_job_info | | |
| | kube_job_owner | | |
| | kube_job_labels | | |
daemonSet | kubernetes.daemonSet.pods.desired | kube_daemonset_status_desired_number_scheduled | | |
| kubernetes.daemonSet.pods.misscheduled | kube_daemonset_status_number_misscheduled | | |
| kubernetes.daemonSet.pods.ready | kube_daemonset_status_number_ready | | |
| kubernetes.daemonSet.pods.scheduled | kube_daemonset_status_current_number_scheduled | | |
| | kube_daemonset_labels | daemonset=<daemonset-name> namespace=<daemonset-namespace> label_daemonset_label=<daemonset_label>
| |
replicaSet | kubernetes.replicaSet.replicas.fullyLabeled | kube_replicaset_status_fully_labeled_replicas | | |
| kubernetes.replicaSet.replicas.ready | kube_replicaset_status_ready_replicas | | |
| kubernetes.replicaSet.replicas.running | kube_replicaset_status_replicas | | |
| kubernetes.replicaSet.replicas.desired | kube_replicaset_spec_replicas | | |
| | kube_replicaset_owner | | | |
| | kube_replicaset_labels | label_replicaset_label=<replicaset_label> replicaset=<replicaset-name> namespace=<replicaset-namespace>
|
statefulset | kubernetes.statefulset.replicas | kube_statefulset_replicas | | |
| kubernetes.statefulset.status.replicas | kube_statefulset_status_replicas | | |
| kubernetes.statefulset.status.replicas.current | kube_statefulset_status_replicas_current | | |
| kubernetes.statefulset.status.replicas.ready | kube_statefulset_status_replicas_ready | | |
| kubernetes.statefulset.status.replicas.updated | kube_statefulset_status_replicas_updated | | |
| | kube_statefulset_labels | | |
hpa | kubernetes.hpa.replicas.min | kube_horizontalpodautoscaler_spec_min_replicas | | |
| kubernetes.hpa.replicas.max | kube_horizontalpodautoscaler_spec_max_replicas | | |
| kubernetes.hpa.replicas.current | kube_horizontalpodautoscaler_status_current_replicas | | |
| kubernetes.hpa.replicas.desired | kube_horizontalpodautoscaler_status_desired_replicas | | |
| | kube_horizontalpodautoscaler_labels | | |
resourcequota | kubernetes.resourcequota.configmaps.hard kubernetes.resourcequota.configmaps.used kubernetes.resourcequota.limits.cpu.hard kubernetes.resourcequota.limits.cpu.used kubernetes.resourcequota.limits.memory.hard kubernetes.resourcequota.limits.memory.used kubernetes.resourcequota.persistentvolumeclaims.hard kubernetes.resourcequota.persistentvolumeclaims.used kubernetes.resourcequota.cpu.hard kubernetes.resourcequota.memory.hard kubernetes.resourcequota.pods.hard kubernetes.resourcequota.pods.used kubernetes.resourcequota.replicationcontrollers.hard kubernetes.resourcequota.replicationcontrollers.used kubernetes.resourcequota.requests.cpu.hard kubernetes.resourcequota.requests.cpu.used kubernetes.resourcequota.requests.memory.hard kubernetes.resourcequota.requests.memory.used kubernetes.resourcequota.requests.storage.hard kubernetes.resourcequota.requests.storage.used kubernetes.resourcequota.resourcequotas.hard kubernetes.resourcequota.resourcequotas.used kubernetes.resourcequota.secrets.hard kubernetes.resourcequota.secrets.used kubernetes.resourcequota.services.hard kubernetes.resourcequota.services.used kubernetes.resourcequota.services.loadbalancers.hard kubernetes.resourcequota.services.loadbalancers.used kubernetes.resourcequota.services.nodeports.hard kubernetes.resourcequota.services.nodeports.used | kube_resourcequota | | |
namespace | | kube_namespace_labels | | |
replicationcontroller | kubernetes.replicationcontroller.replicas.desired | kube_replicationcontroller_spec_replicase | | |
| kubernetes.replicationcontroller.replicas.running | kube_replicationcontroller_status_replicas | | |
| | kube_replicationcontroller_status_fully_labeled_replicas kube_replicationcontroller_status_ready_replicas kube_replicationcontroller_status_available_replicas kube_replicationcontroller_status_observed_generation kube_replicationcontroller_metadata_generation kube_replicationcontroller_created | | |
| | kube_replicationcontroller_owner | | |
service | | kube_service_info | service=<service-name> namespace=<service-namespace> cluster_ip=<service cluster ip> external_name=<service external name> load_balancer_ip=<service load balancer ip>
| |
| | kube_service_labels | | |
persistentvolume | kubernetes.persistentvolume.storage | kube_persistentvolume_capacity_bytes | | |
| | kube_persistentvolume_info | | |
| | kube_persistentvolume_labels | | |
persistentvolumeclaim | kubernetes.persistentvolumeclaim.requests.storage | kube_persistentvolumeclaim_resource_requests_storage_bytes | | |
| | kube_persistentvolumeclaim_info | | |
| | kube_persistentvolumeclaim_labels | persistentvolumeclaim=<persistentvolumeclaim-name> namespace=<persistentvolumeclaim-namespace> label_persistentvolumeclaim_label=<persistentvolumeclaim_label>
| |
10.1.4 -
Run PromQL Queries Faster with Extended Label Set
Sysdig allows you to run PromQL queries smoother and faster with the
extended label set. The extended label set is created by augmenting the
incoming data with the rich metadata associated with your infrastructure
and making it available in PromQL.
With this, you can troubleshoot a problem or building Dashboards and
Alerts without the need to write complex queries. Sysdig automatically
enriches your metrics with Kubernetes and application context without
the need to instrument additional labels in your environment. This
reduces operational complexity and cost—the enrichment takes place in
Sysdig metric ingestion pipeline after time series have been sent to the
backend.
Calculate Memory Usage by Deployment in a Cluster
Using the vector matching operation, you could run the following query
and calculate the memory usage by deployment in a cluster:
sum by(cluster,namespace,owner_name) ((sysdig_container_memory_used_bytes * on(container_id) group_left(pod,namespace,cluster) kube_pod_container_info) * on(pod,namespace,cluster) group_left(owner_name) kube_pod_owner{owner_kind="Deployment",owner_name=~".+",cluster=~".+",namespace=~".+"})
To get the result, you need to write a query to perform a join (vector
match) of various metrics, usually in the following order:
Grab a metric you need that is defined on a container level. For
example, a Prometheus metric or some of the Sysdig provided metrics,
such as sysdig_container_memory_used_byte
.
Perform a vector match on container ID with the metric
kube_pod_container_info
to get the pod metadata.
Perform a vector match on the pod, namespace, and cluster with the
kube_pod_owner
metric.
In the case of Sysdig’s extended label set for PromQL, all the metrics
inherit the metadata, so that necessary container, host, and Kubernetes
metadata are set on all the metrics. This simplifies the query so you
can build and run it quickly.
Likewise, the above query can be simplified as follows:
sum by (kube_cluster_name,kube_namespace_name,kube_deployment_name) (sysdig_container_memory_used_bytes{kube_cluster_name!="",kube_namespace_name!="",kube_deployment_name!=""})

The advantages of using a simplified query are:
Complex vector matching operations (the group_left and group_right
operators) are no longer required. All the labels are already
available on each of the metrics, and therefore, any filtering can
be performed directly on the metric itself.
The metrics now will have a huge amount of labels. You can use
PromQL Explorer to
deal with this rich metadata.
The metadata is distinguishable from user-defined labels. For
example, Kubernetes metadata labels start with kube_
. For
instance, cluster
is replaced with kube_cluster_name
.
Create a dashboard panel or an alert from the PromQL query you run
in the PromQL Query Explore.
Filter data by applying the comparison operators on the label values
given in the table.
Examples for Simplifying Queries
Given below are some of the examples of using the extended label set to
simplify complex query operations.
Memory Usage in a Kubernetes Cluster
Query with core label set:
avg by (agent_tag_cluster) ((sysdig_host_memory_used_bytes/sysdig_host_memory_total_bytes) * on(host,agent_tag_cluster) sysdig_host_info{agent_tag_cluster=~".+"}) * 100
Query with the extended label set:
avg by (agent_tag_cluster) (sysdig_host_memory_used_bytes/sysdig_host_memory_total_bytes) * 100

CPU Usage in Containers
Query with the core label set:
sum by (cluster,namespace)(sysdig_container_cpu_cores_used * on (container_id) group_left(cluster,pod,namespace) kube_pod_container_info{cluster=~".+"})
Simplified query with the extended label set:
sum by (kube_cluster_name,kube_namespace_name)(sysdig_container_cpu_cores_used{kube_cluster_name=~".+"})

Memory Usage in Daemonset
Query with the core label set:
sum by(cluster,namespace,owner_name) (sum by(pod) (label_replace(sysdig_container_memory_used_bytes * on(container_id,host_mac) group_left(label_io_kubernetes_pod_namespace,label_io_kubernetes_pod_name,label_io_kubernetes_container_name) sysdig_container_info{label_io_kubernetes_pod_namespace=~".*",cluster=~".*"},"pod","$1","label_io_kubernetes_pod_name","(.*)")) * on(pod) group_right sum by(cluster,namespace,owner_name,pod) (kube_pod_owner{owner_kind=~"DaemonSet",owner_name=~".*",cluster=~".*",namespace=~".*"}))
Simplified query with the extended label set:
sum by(kube_cluster_name,kube_namespace_name,kube_daemonset_name) (sysdig_container_memory_used_bytes{kube_daemonset_name=~".*",kube_cluster_name=~".*",kube_namespace_name=~".*"})

Pod Restarts in a Kubernetes Cluster
Query with the core label set:
sum by(cluster,namespace,owner_name)(changes(kube_pod_status_ready{condition="true",cluster=~$cluster,namespace=~$namespace}[$__interval]) * on(cluster,namespace,pod) group_left(owner_name) kube_pod_owner{owner_kind="Deployment",owner_name=~".+",cluster=~$cluster,namespace=~$namespace})
Simplified query with the extended label set:
sum by (kube_cluster_name,kube_namespace_name,kube_deployment_name)(changes(kube_pod_status_ready{condition="true",kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_deployment_name=~".+"}[$__interval]))

Containers per Image
Query with the core label set:
count by (owner_name,image,cluster,namespace)((sysdig_container_info{cluster=~$cluster,namespace=~$namespace}) * on(pod,namespace,cluster) group_left(owner_name) max by (pod,namespace,cluster,owner_name)(kube_pod_owner{owner_kind="Deployment",owner_name=~".+"}))
Simplified query with the extended label set:
count by (kube_deployment_name,image,kube_cluster_name,kube_namespace_name)(sysdig_container_info{kube_deployment_name=~".+",kube_cluster_name=~$cluster,kube_namespace_name=~$namespace})

Average TCP Queue per Node
Query with the core label set:
avg by (agent_tag_cluster,host)( sysdig_host_net_tcp_queue_len * on (host_mac) group_left(agent_tag_cluster,host) sysdig_host_info{agent_tag_cluster=~$cluster,host=~".+"})
Simplified query with the extended label set:
avg by (agent_tag_cluster,host_hostname) (sysdig_host_net_tcp_queue_len{agent_tag_cluster =~ $cluster})

10.2 -
Agent
sysdig_agent_info
|Prometheus ID |sysdig_agent_info |
|Legacy ID |info |
|Metric Type |gauge |
|Unit |number |
|Description |The metrics will always have the value of 1.|
|Addional Notes| |
sysdig_agent_timeseries_count_appcheck
|Prometheus ID |sysdig_agent_timeseries_count_appcheck |
|Legacy ID |metricCount.appCheck |
|Metric Type |gauge |
|Unit |number |
|Description |The total number of time series received from appcheck integrations.|
|Addional Notes| |
sysdig_agent_timeseries_count_jmx
|Prometheus ID |sysdig_agent_timeseries_count_jmx |
|Legacy ID |metricCount.jmx |
|Metric Type |gauge |
|Unit |number |
|Description |The total number of time series received from JMX integrations.|
|Addional Notes| |
sysdig_agent_timeseries_count_prometheus
|Prometheus ID |sysdig_agent_timeseries_count_prometheus |
|Legacy ID |metricCount.prometheus |
|Metric Type |gauge |
|Unit |number |
|Description |The total number of time series received from Prometheus integrations.|
|Addional Notes| |
sysdig_agent_timeseries_count_statsd
|Prometheus ID |sysdig_agent_timeseries_count_statsd |
|Legacy ID |metricCount.statsd |
|Metric Type |gauge |
|Unit |number |
|Description |The total number of time series received from StatsD integrations.|
|Addional Notes| |
10.3 -
Containers
sysdig_container_count
|Prometheus ID |sysdig_container_count |
|Legacy ID |container.count |
|Metric Type |gauge |
|Unit |number |
|Description |The count of the number of containers. |
|Addional Notes|This metric is perfect for dashboards and alerts. In particular, you can create alerts that notify you when you have too many (or too few) containers of a certain type in a certain group or node - try segmenting by container.image, .id or .name. See also: host.count.|
sysdig_container_cpu_cgroup_used_percent
|Prometheus ID |sysdig_container_cpu_cgroup_used_percent |
|Legacy ID |cpu.cgroup.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of a container’s cgroup limit that is actually used. This is the minimum usage for the underlying cgroup limits: cpuset.limit and quota.limit.|
|Addional Notes| |
sysdig_container_cpu_cores_cgroup_limit
|Prometheus ID |sysdig_container_cpu_cores_cgroup_limit |
|Legacy ID |cpu.cores.cgroup.limit |
|Metric Type |gauge |
|Unit |number |
|Description |The number of CPU cores assigned to a container. This is the minimum of the cgroup limits: cpuset.limit and quota.limit.|
|Addional Notes| |
sysdig_container_cpu_cores_quota_limit
|Prometheus ID |sysdig_container_cpu_cores_quota_limit |
|Legacy ID |cpu.cores.quota.limit |
|Metric Type |gauge |
|Unit |number |
|Description |The number of CPU cores assigned to a container. Technically, the container’s cgroup quota and period. This is a way of creating a CPU limit for a container.|
|Addional Notes| |
sysdig_container_cpu_cores_used
|Prometheus ID |sysdig_container_cpu_cores_used |
|Legacy ID |cpu.cores.used |
|Metric Type |gauge |
|Unit |number |
|Description |The CPU core usage of each container is obtained from cgroups, and is equal to the number of cores used by the container. For example, if a container uses two of an available four cores, the value of sysdig_container_cpu_cores_used
will be two.|
|Addional Notes| |
sysdig_container_cpu_cores_used_percent
|Prometheus ID |sysdig_container_cpu_cores_used_percent |
|Legacy ID |cpu.cores.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The CPU core usage percent for each container is obtained from cgroups, and is equal to the number of cores multiplied by 100. For example, if a container uses three cores, the value of sysdig_container_cpu_cores_used_percent
would be 300%.|
|Addional Notes| |
sysdig_container_cpu_quota_used_percent
|Prometheus ID |sysdig_container_cpu_quota_used_percent |
|Legacy ID |cpu.quota.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of a container’s CPU Quota that is actually used. CPU Quotas are a common way of creating a CPU limit for a container. CPU Quotas are based on a percentage of time - a container can only spend its quota of time on CPU cycles across a given time period (default period is 100ms). Note that, unlike CPU Shares, CPU Quota is a hard limit to the amount of CPU the container can use - so this metric, CPU Quota %, should not exceed 100%.|
|Addional Notes| |
sysdig_container_cpu_shares_count
|Prometheus ID |sysdig_container_cpu_shares_count |
|Legacy ID |cpu.shares.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of CPU shares assigned to a container (technically, the container’s cgroup) - this is a common way of creating a CPU limit for a container. CPU Shares represent a relative weight used by the kernel to distribute CPU cycles across different containers. The default value for a container is 1024. Each container receives its own allocation of CPU cycles, according to the ratio of it’s share count vs to the total number of shares claimed by all containers. For example, if you have three containers, each with 1024 shares, then each will recieve 1/3 of the CPU cycles. Note that this is not a hard limit: a container can consume more than its allocation, if the CPU has cycles that aren’t being consumed by the container they were originally allocated to.|
|Addional Notes| |
sysdig_container_cpu_shares_used_percent
|Prometheus ID |sysdig_container_cpu_shares_used_percent |
|Legacy ID |cpu.shares.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of a container’s allocated CPU shares that are actually used. CPU Shares are a common way of creating a CPU limit for a container. CPU Shares represent a relative weight used by the kernel to distribute CPU cycles across different containers. The default value for a container is 1024. Each container receives its own allocation of CPU cycles, according to the ratio of it’s share count vs to the total number of shares claimed by all containers. For example, if you have three containers, each with 1024 shares, then each will recieve 1/3 of the CPU cycles. Note that this is not a hard limit: a container can consume more than its allocation, if the CPU has cycles that aren’t being consumed by the container they were originally allocated to - so this metric, CPU Shares %, can actually exceed 100%.|
|Addional Notes| |
sysdig_container_cpu_used_percent
|Prometheus ID |sysdig_container_cpu_used_percent |
|Legacy ID |cpu.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The CPU usage for each container is obtained from cgroups, and normalized by dividing by the number of cores to determine an overall percentage. For example, if the environment contains six cores on a host, and the container or processes are assigned two cores, Sysdig will report CPU usage of 2/6 * 100% = 33.33%. This metric is calculated differently for hosts and processes.|
|Addional Notes| |
sysdig_container_fd_used_percent
|Prometheus ID |sysdig_container_fd_used_percent |
|Legacy ID |fd.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of used file descriptors out of the maximum available. |
|Addional Notes|Usually, when a process reaches its FD limit it will stop operating properly and possibly crash. As a consequence, this is a metric you want to monitor carefully, or even better use for alerts.|
sysdig_container_file_error_open_count
|Prometheus ID |sysdig_container_file_error_open_count |
|Legacy ID |file.error.open.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of errors in opening files. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_file_error_total_count
|Prometheus ID |sysdig_container_file_error_total_count |
|Legacy ID |file.error.total.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of error caused by file access. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_file_in_bytes
|Prometheus ID |sysdig_container_file_in_bytes |
|Legacy ID |file.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description |The amount of bytes read from file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_file_in_iops
|Prometheus ID |sysdig_container_file_in_iops |
|Legacy ID |file.iops.in |
|Metric Type |counter |
|Unit |number |
|Description |The number of file read operations per second. |
|Addional Notes|This is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.|
sysdig_container_file_in_time
|Prometheus ID |sysdig_container_file_in_time |
|Legacy ID |file.time.in |
|Metric Type |counter |
|Unit |time |
|Description |The time spent in file reading. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_file_open_count
|Prometheus ID |sysdig_container_file_open_count |
|Legacy ID |file.open.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of time the file has been opened.|
|Addional Notes| |
sysdig_container_file_out_bytes
|Prometheus ID |sysdig_container_file_out_bytes |
|Legacy ID |file.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description |The number of of bytes written to file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_file_out_iops
|Prometheus ID |sysdig_container_file_out_iops |
|Legacy ID |file.iops.out |
|Metric Type |counter |
|Unit |number |
|Description |The Number of file write operations per second. |
|Addional Notes|This is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.|
sysdig_container_file_out_time
|Prometheus ID |sysdig_container_file_out_time |
|Legacy ID |file.time.out |
|Metric Type |counter |
|Unit |time |
|Description |The time spent in file writing. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_file_total_bytes
|Prometheus ID |sysdig_container_file_total_bytes |
|Legacy ID |file.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description |The number of bytes read from and written to file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_file_total_iops
|Prometheus ID |sysdig_container_file_total_iops |
|Legacy ID |file.iops.total |
|Metric Type |counter |
|Unit |number |
|Description |The number of read and write file operations per second. |
|Addional Notes|This is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.|
sysdig_container_file_total_time
|Prometheus ID |sysdig_container_file_total_time |
|Legacy ID |file.time.total |
|Metric Type |counter |
|Unit |time |
|Description |The time spent in file I/O. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_fs_free_bytes
|Prometheus ID |sysdig_container_fs_free_bytes |
|Legacy ID |fs.bytes.free |
|Metric Type |gauge |
|Unit |data |
|Description |The available space in the filesystem.|
|Addional Notes| |
sysdig_container_fs_free_percent
|Prometheus ID |sysdig_container_fs_free_percent |
|Legacy ID |fs.free.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of free space in the filesystem.|
|Addional Notes| |
sysdig_container_fs_inodes_total_count
|Prometheus ID |sysdig_container_fs_inodes_total_count |
|Legacy ID |fs.inodes.total.count |
|Metric Type |gauge |
|Unit |number |
|Description |The total number of inodes in the filesystem.|
|Addional Notes| |
sysdig_container_fs_inodes_used_count
|Prometheus ID |sysdig_container_fs_inodes_used_count |
|Legacy ID |fs.inodes.used.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of inodes used in the filesystem.|
|Addional Notes| |
sysdig_container_fs_inodes_used_percent
|Prometheus ID |sysdig_container_fs_inodes_used_percent |
|Legacy ID |fs.inodes.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of inodes usage in the filesystem.|
|Addional Notes| |
sysdig_container_fs_largest_used_percent
|Prometheus ID |sysdig_container_fs_largest_used_percent |
|Legacy ID |fs.largest.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of the largest filesystem in use.|
|Addional Notes| |
sysdig_container_fs_root_used_percent
|Prometheus ID |sysdig_container_fs_root_used_percent |
|Legacy ID |fs.root.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of the root filesystem in use in the container.|
|Addional Notes| |
sysdig_container_fs_total_bytes
|Prometheus ID |sysdig_container_fs_total_bytes |
|Legacy ID |fs.bytes.total |
|Metric Type |gauge |
|Unit |data |
|Description |The size of container filesystem.|
|Addional Notes| |
sysdig_container_fs_used_bytes
|Prometheus ID |sysdig_container_fs_used_bytes |
|Legacy ID |fs.bytes.used |
|Metric Type |gauge |
|Unit |data |
|Description |The used space in the container filesystem.|
|Addional Notes| |
sysdig_container_fs_used_percent
|Prometheus ID |sysdig_container_fs_used_percent |
|Legacy ID |fs.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of the sum of all filesystems in use in the container.|
|Addional Notes| |
sysdig_container_info
|Prometheus ID |sysdig_container_info |
|Legacy ID |info |
|Metric Type |gauge |
|Unit |number |
|Description |The info metrics will always have the value of 1.|
|Addional Notes| |
sysdig_container_memory_limit_bytes
|Prometheus ID |sysdig_container_memory_limit_bytes |
|Legacy ID |memory.limit.bytes |
|Metric Type |gauge |
|Unit |data |
|Description |The memory limit in bytes assigned to a container.|
|Addional Notes| |
sysdig_container_memory_limit_used_percent
|Prometheus ID |sysdig_container_memory_limit_used_percent |
|Legacy ID |memory.limit.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of memory limit used by a container.|
|Addional Notes| |
sysdig_container_memory_used_bytes
|Prometheus ID |sysdig_container_memory_used_bytes |
|Legacy ID |memory.bytes.used |
|Metric Type |gauge |
|Unit |data |
|Description |The amount of physical memory currently in use. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_container_memory_used_percent
|Prometheus ID |sysdig_container_memory_used_percent |
|Legacy ID |memory.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of physical memory in use. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_memory_virtual_bytes
|Prometheus ID |sysdig_container_memory_virtual_bytes |
|Legacy ID |memory.bytes.virtual |
|Metric Type |gauge |
|Unit |data |
|Description |The virtual memory size of the process, in bytes. This value is obtained from Sysdig events.|
|Addional Notes| |
sysdig_container_net_connection_in_count
|Prometheus ID |sysdig_container_net_connection_in_count |
|Legacy ID |net.connection.count.in |
|Metric Type |counter |
|Unit |number |
|Description |The number of currently established client (inbound) connections. |
|Addional Notes|This metric is especially useful when segmented by protocol, port or process.|
sysdig_container_net_connection_out_count
|Prometheus ID |sysdig_container_net_connection_out_count |
|Legacy ID |net.connection.count.out |
|Metric Type |counter |
|Unit |number |
|Description |The number of currently established server (outbound) connections. |
|Addional Notes|This metric is especially useful when segmented by protocol, port or process.|
sysdig_container_net_connection_total_count
|Prometheus ID |sysdig_container_net_connection_total_count |
|Legacy ID |net.connection.count.total |
|Metric Type |counter |
|Unit |number |
|Description |The number of currently established connections. This value may exceed the sum of the inbound and outbound metrics since it represents client and server inter-host connections as well as internal only connections.|
|Addional Notes|This metric is especially useful when segmented by protocol, port or process. |
sysdig_container_net_error_count
|Prometheus ID |sysdig_container_net_error_count |
|Legacy ID |net.error.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of network errors. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_net_http_error_count
|Prometheus ID |sysdig_container_net_http_error_count |
|Legacy ID |net.http.error.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of failed HTTP requests as counted from 4xx/5xx status codes.|
|Addional Notes| |
sysdig_container_net_http_request_count
|Prometheus ID |sysdig_container_net_http_request_count|
|Legacy ID |net.http.request.count |
|Metric Type |counter |
|Unit |number |
|Description |The count of HTTP requests. |
|Addional Notes| |
sysdig_container_net_http_request_time
|Prometheus ID |sysdig_container_net_http_request_time |
|Legacy ID |net.http.request.time |
|Metric Type |counter |
|Unit |time |
|Description |The average time taken for HTTP requests.|
|Addional Notes| |
sysdig_container_net_http_statuscode_error_count
|Prometheus ID |sysdig_container_net_http_statuscode_error_count|
|Legacy ID |net.http.statuscode.error.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of HTTP error codes returned. |
|Addional Notes| |
sysdig_container_net_http_statuscode_request_count
|Prometheus ID |sysdig_container_net_http_statuscode_request_count|
|Legacy ID |net.http.statuscode.request.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of HTTP status codes requests. |
|Addional Notes| |
sysdig_container_net_http_url_error_count
|Prometheus ID |sysdig_container_net_http_url_error_count|
|Legacy ID |net.http.url.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_container_net_http_url_request_count
|Prometheus ID |sysdig_container_net_http_url_request_count|
|Legacy ID |net.http.url.request.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of HTTP URLs requests. |
|Addional Notes| |
sysdig_container_net_http_url_request_time
|Prometheus ID |sysdig_container_net_http_url_request_time|
|Legacy ID |net.http.url.request.time |
|Metric Type |counter |
|Unit |time |
|Description |The time taken for requesting HTTP URLs. |
|Addional Notes| |
sysdig_container_net_in_bytes
|Prometheus ID |sysdig_container_net_in_bytes |
|Legacy ID |net.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description |The number of inbound network bytes. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_net_mongodb_error_count
|Prometheus ID |sysdig_container_net_mongodb_error_count|
|Legacy ID |net.mongodb.error.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of Failed MongoDB requests. |
|Addional Notes| |
sysdig_container_net_mongodb_request_count
|Prometheus ID |sysdig_container_net_mongodb_request_count|
|Legacy ID |net.mongodb.request.count |
|Metric Type |counter |
|Unit |number |
|Description |The total number of MongoDB requests. |
|Addional Notes| |
sysdig_container_net_out_bytes
|Prometheus ID |sysdig_container_net_out_bytes |
|Legacy ID |net.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description |The number of outbound network bytes. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_net_request_count
|Prometheus ID |sysdig_container_net_request_count |
|Legacy ID |net.request.count |
|Metric Type |counter |
|Unit |number |
|Description |The total number of network requests. Note, this value may exceed the sum of inbound and outbound requests, because this count includes requests over internal connections.|
|Addional Notes| |
sysdig_container_net_request_in_count
|Prometheus ID |sysdig_container_net_request_in_count |
|Legacy ID |net.request.count.in |
|Metric Type |counter |
|Unit |number |
|Description |The number of inbound network requests.|
|Addional Notes| |
sysdig_container_net_request_in_time
|Prometheus ID |sysdig_container_net_request_in_time |
|Legacy ID |net.request.time.in |
|Metric Type |counter |
|Unit |time |
|Description |The average time to serve an inbound request.|
|Addional Notes| |
sysdig_container_net_request_out_count
|Prometheus ID |sysdig_container_net_request_out_count |
|Legacy ID |net.request.count.out |
|Metric Type |counter |
|Unit |number |
|Description |The number of outbound network requests.|
|Addional Notes| |
sysdig_container_net_request_out_time
|Prometheus ID |sysdig_container_net_request_out_time |
|Legacy ID |net.request.time.out |
|Metric Type |counter |
|Unit |time |
|Description |The average time spent waiting for an outbound request.|
|Addional Notes| |
sysdig_container_net_request_time
|Prometheus ID |sysdig_container_net_request_time |
|Legacy ID |net.request.time |
|Metric Type |counter |
|Unit |time |
|Description |The average time to serve a network request.|
|Addional Notes| |
sysdig_container_net_server_connection_in_count
|Prometheus ID |sysdig_container_net_server_connection_in_count|
|Legacy ID |net.server.connection.count.in |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_container_net_server_in_bytes
|Prometheus ID |sysdig_container_net_server_in_bytes|
|Legacy ID |net.server.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description | |
|Addional Notes| |
sysdig_container_net_server_out_bytes
|Prometheus ID |sysdig_container_net_server_out_bytes|
|Legacy ID |net.server.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description | |
|Addional Notes| |
sysdig_container_net_server_total_bytes
|Prometheus ID |sysdig_container_net_server_total_bytes|
|Legacy ID |net.server.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description | |
|Addional Notes| |
sysdig_container_net_sql_error_count
|Prometheus ID |sysdig_container_net_sql_error_count|
|Legacy ID |net.sql.error.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of failed SQL requests. |
|Addional Notes| |
sysdig_container_net_sql_query_error_count
|Prometheus ID |sysdig_container_net_sql_query_error_count|
|Legacy ID |net.sql.query.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_container_net_sql_query_request_count
|Prometheus ID |sysdig_container_net_sql_query_request_count|
|Legacy ID |net.sql.query.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_container_net_sql_query_request_time
|Prometheus ID |sysdig_container_net_sql_query_request_time|
|Legacy ID |net.sql.query.request.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
sysdig_container_net_sql_querytype_error_count
|Prometheus ID |sysdig_container_net_sql_querytype_error_count|
|Legacy ID |net.sql.querytype.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_container_net_sql_querytype_request_count
|Prometheus ID |sysdig_container_net_sql_querytype_request_count|
|Legacy ID |net.sql.querytype.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_container_net_sql_querytype_request_time
|Prometheus ID |sysdig_container_net_sql_querytype_request_time|
|Legacy ID |net.sql.querytype.request.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
sysdig_container_net_sql_request_count
|Prometheus ID |sysdig_container_net_sql_request_count|
|Legacy ID |net.sql.request.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of SQL requests. |
|Addional Notes| |
sysdig_container_net_sql_request_time
|Prometheus ID |sysdig_container_net_sql_request_time |
|Legacy ID |net.sql.request.time |
|Metric Type |counter |
|Unit |time |
|Description |The average time to complete an SQL request.|
|Addional Notes| |
sysdig_container_net_sql_table_error_count
|Prometheus ID |sysdig_container_net_sql_table_error_count|
|Legacy ID |net.sql.table.error.count |
|Metric Type |counter |
|Unit |number |
|Description |The total number of SQL errors returned. |
|Addional Notes| |
sysdig_container_net_sql_table_request_count
|Prometheus ID |sysdig_container_net_sql_table_request_count|
|Legacy ID |net.sql.table.request.count |
|Metric Type |counter |
|Unit |number |
|Description |The total number of SQL table requests. |
|Addional Notes| |
sysdig_container_net_sql_table_request_time
|Prometheus ID |sysdig_container_net_sql_table_request_time |
|Legacy ID |net.sql.table.request.time |
|Metric Type |counter |
|Unit |time |
|Description |The average time to serve an SQL table request.|
|Addional Notes| |
sysdig_container_net_tcp_queue_len
|Prometheus ID |sysdig_container_net_tcp_queue_len |
|Legacy ID |net.tcp.queue.len |
|Metric Type |counter |
|Unit |number |
|Description |The length of the TCP request queue.|
|Addional Notes| |
sysdig_container_net_total_bytes
|Prometheus ID |sysdig_container_net_total_bytes |
|Legacy ID |net.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description |The total number of network bytes, including inbound and outbound connections. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_proc_count
|Prometheus ID |sysdig_container_proc_count |
|Legacy ID |proc.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of processes on host or container.|
|Addional Notes| |
sysdig_container_swap_limit_bytes
|Prometheus ID |sysdig_container_swap_limit_bytes |
|Legacy ID |swap.limit.bytes |
|Metric Type |gauge |
|Unit |data |
|Description |The swap limit in bytes assigned to a container.|
|Addional Notes| |
sysdig_container_swap_limit_used_percent
|Prometheus ID |sysdig_container_swap_limit_used_percent |
|Legacy ID |swap.limit.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of swap limit used by the container.|
|Addional Notes| |
sysdig_container_syscall_count
|Prometheus ID |sysdig_container_syscall_count |
|Legacy ID |syscall.count |
|Metric Type |gauge |
|Unit |number |
|Description |The total number of syscalls seen. |
|Addional Notes|Syscalls are resource intensive. This metric tracks how many have been made by a given process or container|
sysdig_container_syscall_error_count
|Prometheus ID |sysdig_container_syscall_error_count |
|Legacy ID |host.error.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of system call errors. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_container_thread_count
|Prometheus ID |sysdig_container_thread_count |
|Legacy ID |thread.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of threads running in a container.|
|Addional Notes| |
sysdig_container_timeseries_count_appcheck
|Prometheus ID |sysdig_container_timeseries_count_appcheck|
|Legacy ID |metricCount.appCheck |
|Metric Type |gauge |
|Unit |number |
|Description |The number of appcheck custom metrics. |
|Addional Notes| |
sysdig_container_timeseries_count_jmx
|Prometheus ID |sysdig_container_timeseries_count_jmx|
|Legacy ID |metricCount.jmx |
|Metric Type |gauge |
|Unit |number |
|Description |The number of JMX custom metrics. |
|Addional Notes| |
sysdig_container_timeseries_count_prometheus
|Prometheus ID |sysdig_container_timeseries_count_prometheus|
|Legacy ID |metricCount.prometheus |
|Metric Type |gauge |
|Unit |number |
|Description |The number of Prometheus custom metrics. |
|Addional Notes| |
sysdig_container_timeseries_count_statsd
|Prometheus ID |sysdig_container_timeseries_count_statsd|
|Legacy ID |metricCount.statsd |
|Metric Type |gauge |
|Unit |number |
|Description |The number of StatsD custom metrics. |
|Addional Notes| |
sysdig_container_up
|Prometheus ID |sysdig_container_up |
|Legacy ID |uptime |
|Metric Type |gauge |
|Unit |number |
|Description |The percentage of time the selected entity was down during the visualized time sample. This can be used to determine if a machine (or a group of machines) went down.|
|Addional Notes| |
10.4 -
File
sysdig_filestats_host_file_error_total_count
|Prometheus ID |sysdig_filestats_host_file_error_total_count |
|Legacy ID |file.error.total.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of error caused by file access. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_filestats_host_file_in_bytes
|Prometheus ID |sysdig_filestats_host_file_in_bytes |
|Legacy ID |file.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description |Amount of bytes read from file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_filestats_host_file_open_count
|Prometheus ID |sysdig_filestats_host_file_open_count |
|Legacy ID |file.open.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of time the file has been opened.|
|Addional Notes| |
sysdig_filestats_host_file_out_bytes
|Prometheus ID |sysdig_filestats_host_file_out_bytes |
|Legacy ID |file.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description |Amount of bytes written to file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_filestats_host_file_total_bytes
|Prometheus ID |sysdig_filestats_host_file_total_bytes |
|Legacy ID |file.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description |Amount of bytes read from and written to file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_filestats_host_file_total_time
|Prometheus ID |sysdig_filestats_host_file_total_time |
|Legacy ID |file.time.total |
|Metric Type |counter |
|Unit |time |
|Description |Time spent in file I/O. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_fs_free_bytes
|Prometheus ID |sysdig_fs_free_bytes |
|Legacy ID |fs.bytes.free |
|Metric Type |gauge |
|Unit |data |
|Description |Filesystem available space.|
|Addional Notes| |
sysdig_fs_free_percent
|Prometheus ID |sysdig_fs_free_percent |
|Legacy ID |fs.free.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of filesystem free space.|
|Addional Notes| |
sysdig_fs_inodes_total_count
|Prometheus ID |sysdig_fs_inodes_total_count|
|Legacy ID |fs.inodes.total.count |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_fs_inodes_used_count
|Prometheus ID |sysdig_fs_inodes_used_count|
|Legacy ID |fs.inodes.used.count |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_fs_inodes_used_percent
|Prometheus ID |sysdig_fs_inodes_used_percent|
|Legacy ID |fs.inodes.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_fs_total_bytes
|Prometheus ID |sysdig_fs_total_bytes|
|Legacy ID |fs.bytes.total |
|Metric Type |gauge |
|Unit |data |
|Description |Filesystem size. |
|Addional Notes| |
sysdig_fs_used_bytes
|Prometheus ID |sysdig_fs_used_bytes |
|Legacy ID |fs.bytes.used |
|Metric Type |gauge |
|Unit |data |
|Description |Filesystem used space.|
|Addional Notes| |
sysdig_fs_used_percent
|Prometheus ID |sysdig_fs_used_percent |
|Legacy ID |fs.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of the sum of all filesystems in use.|
|Addional Notes| |
10.5 -
Host
sysdig_host_container_count
|Prometheus ID |sysdig_host_container_count |
|Legacy ID |container.count |
|Metric Type |gauge |
|Unit |number |
|Description |Count of the number of containers. |
|Addional Notes|This metric is perfect for dashboards and alerts. In particular, you can create alerts that notify you when you have too many (or too few) containers of a certain type in a certain group or node - try segmenting by container.image, .id or .name. See also: host.count.|
sysdig_host_container_start_count
|Prometheus ID |sysdig_host_container_start_count|
|Legacy ID |host.container.start.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_count
|Prometheus ID |sysdig_host_count |
|Legacy ID |host.count |
|Metric Type |gauge |
|Unit |number |
|Description |Count of the number of hosts. |
|Addional Notes|This metric is perfect for dashboards and alerts. In particular, you can create alerts that notify you when you have too many (or too few) machines of a certain type in a certain group - try segment by tag or hostname. See also: container.count.|
sysdig_host_cpu_cores_used
|Prometheus ID |sysdig_host_cpu_cores_used|
|Legacy ID |cpu.cores.used |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_cpu_cores_used_percent
|Prometheus ID |sysdig_host_cpu_cores_used_percent|
|Legacy ID |cpu.cores.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_host_cpu_idle_percent
|Prometheus ID |sysdig_host_cpu_idle_percent |
|Legacy ID |cpu.idle.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_cpu_iowait_percent
|Prometheus ID |sysdig_host_cpu_iowait_percent |
|Legacy ID |cpu.iowait.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_cpu_nice_percent
|Prometheus ID |sysdig_host_cpu_nice_percent |
|Legacy ID |cpu.nice.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of CPU utilization that occurred while executing at the user level with nice priority. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_cpu_stolen_percent
|Prometheus ID |sysdig_host_cpu_stolen_percent |
|Legacy ID |cpu.stolen.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |CPU steal time is a measure of the percent of time that a virtual machine’s CPU is in a state of involuntary wait due to the fact that the physical CPU is shared among virtual machines. In calculating steal time, the operating system kernel detects when it has work available but does not have access to the physical CPU to perform that work. |
|Addional Notes|If the percent of steal time is consistently high, you may want to stop and restart the instance (since it will most likely start on different physical hardware) or upgrade to a virtual machine with more CPU power. Also see the metric ‘capacity total percent’ to see how steal time directly impacts the number of server requests that could not be handled. On AWS EC2, steal time does not depend on the activity of other virtual machine neighbours. EC2 is simply making sure your instance is not using more CPU cycles than paid for.|
sysdig_host_cpu_system_percent
|Prometheus ID |sysdig_host_cpu_system_percent |
|Legacy ID |cpu.system.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of CPU utilization that occurred while executing at the system level (kernel). |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_cpu_used_percent
|Prometheus ID |sysdig_host_cpu_used_percent |
|Legacy ID |cpu.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The CPU usage for each container is obtained from cgroups, and normalized by dividing by the number of cores to determine an overall percentage. For example, if the environment contains six cores on a host, and the container or processes are assigned two cores, Sysdig will report CPU usage of 2/6 * 100% = 33.33%. This metric is calculated differently for hosts and processes.|
|Addional Notes| |
sysdig_host_cpu_user_percent
|Prometheus ID |sysdig_host_cpu_user_percent |
|Legacy ID |cpu.user.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of CPU utilization that occurred while executing at the user level (application). |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_cpucore_idle_percent
|Prometheus ID |sysdig_host_cpucore_idle_percent|
|Legacy ID |cpucore.idle.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_host_cpucore_iowait_percent
|Prometheus ID |sysdig_host_cpucore_iowait_percent|
|Legacy ID |cpucore.iowait.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_host_cpucore_nice_percent
|Prometheus ID |sysdig_host_cpucore_nice_percent|
|Legacy ID |cpucore.nice.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_host_cpucore_stolen_percent
|Prometheus ID |sysdig_host_cpucore_stolen_percent|
|Legacy ID |cpucore.stolen.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_host_cpucore_system_percent
|Prometheus ID |sysdig_host_cpucore_system_percent|
|Legacy ID |cpucore.system.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_host_cpucore_used_percent
|Prometheus ID |sysdig_host_cpucore_used_percent|
|Legacy ID |cpucore.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_host_cpucore_user_percent
|Prometheus ID |sysdig_host_cpucore_user_percent|
|Legacy ID |cpucore.user.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_host_fd_used_percent
|Prometheus ID |sysdig_host_fd_used_percent |
|Legacy ID |fd.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of used file descriptors out of the maximum available. |
|Addional Notes|Usually, when a process reaches its FD limit it will stop operating properly and possibly crash. As a consequence, this is a metric you want to monitor carefully, or even better use for alerts.|
sysdig_host_file_error_open_count
|Prometheus ID |sysdig_host_file_error_open_count |
|Legacy ID |file.error.open.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of errors in opening files. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_file_error_total_count
|Prometheus ID |sysdig_host_file_error_total_count |
|Legacy ID |file.error.total.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of error caused by file access. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_file_in_bytes
|Prometheus ID |sysdig_host_file_in_bytes |
|Legacy ID |file.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description |Amount of bytes read from file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_file_in_iops
|Prometheus ID |sysdig_host_file_in_iops |
|Legacy ID |file.iops.in |
|Metric Type |counter |
|Unit |number |
|Description |Number of file read operations per second. |
|Addional Notes|This is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.|
sysdig_host_file_in_time
|Prometheus ID |sysdig_host_file_in_time |
|Legacy ID |file.time.in |
|Metric Type |counter |
|Unit |time |
|Description |Time spent in file reading. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_file_open_count
|Prometheus ID |sysdig_host_file_open_count |
|Legacy ID |file.open.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of time the file has been opened.|
|Addional Notes| |
sysdig_host_file_out_bytes
|Prometheus ID |sysdig_host_file_out_bytes |
|Legacy ID |file.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description |Amount of bytes written to file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_file_out_iops
|Prometheus ID |sysdig_host_file_out_iops |
|Legacy ID |file.iops.out |
|Metric Type |counter |
|Unit |number |
|Description |Number of file write operations per second. |
|Addional Notes|This is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.|
sysdig_host_file_out_time
|Prometheus ID |sysdig_host_file_out_time |
|Legacy ID |file.time.out |
|Metric Type |counter |
|Unit |time |
|Description |Time spent in file writing. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_file_total_bytes
|Prometheus ID |sysdig_host_file_total_bytes |
|Legacy ID |file.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description |Amount of bytes read from and written to file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_file_total_iops
|Prometheus ID |sysdig_host_file_total_iops |
|Legacy ID |file.iops.total |
|Metric Type |counter |
|Unit |number |
|Description |Number of read and write file operations per second. |
|Addional Notes|This is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.|
sysdig_host_file_total_time
|Prometheus ID |sysdig_host_file_total_time |
|Legacy ID |file.time.total |
|Metric Type |counter |
|Unit |time |
|Description |Time spent in file I/O. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_fs_free_bytes
|Prometheus ID |sysdig_host_fs_free_bytes |
|Legacy ID |fs.bytes.free |
|Metric Type |gauge |
|Unit |data |
|Description |Filesystem available space.|
|Addional Notes| |
sysdig_host_fs_free_percent
|Prometheus ID |sysdig_host_fs_free_percent |
|Legacy ID |fs.free.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of filesystem free space.|
|Addional Notes| |
sysdig_host_fs_inodes_total_count
|Prometheus ID |sysdig_host_fs_inodes_total_count|
|Legacy ID |fs.inodes.total.count |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_fs_inodes_used_count
|Prometheus ID |sysdig_host_fs_inodes_used_count|
|Legacy ID |fs.inodes.used.count |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_fs_inodes_used_percent
|Prometheus ID |sysdig_host_fs_inodes_used_percent|
|Legacy ID |fs.inodes.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description | |
|Addional Notes| |
sysdig_host_fs_largest_used_percent
|Prometheus ID |sysdig_host_fs_largest_used_percent |
|Legacy ID |fs.largest.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of the largest filesystem in use.|
|Addional Notes| |
sysdig_host_fs_root_used_percent
|Prometheus ID |sysdig_host_fs_root_used_percent |
|Legacy ID |fs.root.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of the root filesystem in use.|
|Addional Notes| |
sysdig_host_fs_total_bytes
|Prometheus ID |sysdig_host_fs_total_bytes|
|Legacy ID |fs.bytes.total |
|Metric Type |gauge |
|Unit |data |
|Description |Filesystem size. |
|Addional Notes| |
sysdig_host_fs_used_bytes
|Prometheus ID |sysdig_host_fs_used_bytes|
|Legacy ID |fs.bytes.used |
|Metric Type |gauge |
|Unit |data |
|Description |Filesystem used space. |
|Addional Notes| |
sysdig_host_fs_used_percent
|Prometheus ID |sysdig_host_fs_used_percent |
|Legacy ID |fs.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Percentage of the sum of all filesystems in use.|
|Addional Notes| |
sysdig_host_info
|Prometheus ID |sysdig_host_info|
|Legacy ID |info |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_load_average_15m
|Prometheus ID |sysdig_host_load_average_15m |
|Legacy ID |load.average.15m |
|Metric Type |gauge |
|Unit |number |
|Description |The 15 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 15 minutes for all cores. The value should correspond to the third (and last) load average value displayed by ‘uptime’ command.|
|Addional Notes| |
sysdig_host_load_average_1m
|Prometheus ID |sysdig_host_load_average_1m |
|Legacy ID |load.average.1m |
|Metric Type |gauge |
|Unit |number |
|Description |The 1 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 1 minute for all cores. The value should correspond to the first (of three) load average values displayed by ‘uptime’ command.|
|Addional Notes| |
sysdig_host_load_average_5m
|Prometheus ID |sysdig_host_load_average_5m |
|Legacy ID |load.average.5m |
|Metric Type |gauge |
|Unit |number |
|Description |The 5 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 5 minutes for all cores. The value should correspond to the second (of three) load average values displayed by ‘uptime’ command.|
|Addional Notes| |
sysdig_host_load_average_percpu_15m
|Prometheus ID |sysdig_host_load_average_percpu_15m |
|Legacy ID |load.average.percpu.15m |
|Metric Type |gauge |
|Unit |number |
|Description |The 15 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 15 minutes, divided by number of system CPUs.|
|Addional Notes| |
sysdig_host_load_average_percpu_1m
|Prometheus ID |sysdig_host_load_average_percpu_1m |
|Legacy ID |load.average.percpu.1m |
|Metric Type |gauge |
|Unit |number |
|Description |The 1 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 1 minute, divided by number of system CPUs.|
|Addional Notes| |
sysdig_host_load_average_percpu_5m
|Prometheus ID |sysdig_host_load_average_percpu_5m |
|Legacy ID |load.average.percpu.5m |
|Metric Type |gauge |
|Unit |number |
|Description |The 5 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 5 minutes, divided by number of system CPUs.|
|Addional Notes| |
sysdig_host_memory_available_bytes
|Prometheus ID |sysdig_host_memory_available_bytes |
|Legacy ID |memory.bytes.available |
|Metric Type |gauge |
|Unit |data |
|Description |The available memory for a host is obtained from /proc/meminfo. For environments using Linux kernel version 3.12 and later, the available memory is obtained using the mem.available field in /proc/meminfo. For environments using earlier kernel versions, the formula is MemFree + Cached + Buffers.|
|Addional Notes| |
sysdig_host_memory_swap_available_bytes
|Prometheus ID |sysdig_host_memory_swap_available_bytes |
|Legacy ID |memory.swap.bytes.available |
|Metric Type |gauge |
|Unit |data |
|Description |Available amount of swap memory. |
|Addional Notes|Sum of free and cached swap memory. By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_memory_swap_total_bytes
|Prometheus ID |sysdig_host_memory_swap_total_bytes |
|Legacy ID |memory.swap.bytes.total |
|Metric Type |gauge |
|Unit |data |
|Description |Total amount of swap memory. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_memory_swap_used_bytes
|Prometheus ID |sysdig_host_memory_swap_used_bytes |
|Legacy ID |memory.swap.bytes.used |
|Metric Type |gauge |
|Unit |data |
|Description |Used amount of swap memory. |
|Addional Notes|The amount of used swap memory is calculated by subtracting available from total swap memory. By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_memory_swap_used_percent
|Prometheus ID |sysdig_host_memory_swap_used_percent |
|Legacy ID |memory.swap.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |Used percent of swap memory. |
|Addional Notes|The percentage of used swap memory is calculated as percentual ratio of used and total swap memory. By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_memory_total_bytes
|Prometheus ID |sysdig_host_memory_total_bytes |
|Legacy ID |memory.bytes.total |
|Metric Type |gauge |
|Unit |data |
|Description |The total memory of a host, in bytes. This value is obtained from /proc.|
|Addional Notes| |
sysdig_host_memory_used_bytes
|Prometheus ID |sysdig_host_memory_used_bytes |
|Legacy ID |memory.bytes.used |
|Metric Type |gauge |
|Unit |data |
|Description |The amount of physical memory currently in use. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_host_memory_used_percent
|Prometheus ID |sysdig_host_memory_used_percent |
|Legacy ID |memory.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of physical memory in use. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_memory_virtual_bytes
|Prometheus ID |sysdig_host_memory_virtual_bytes |
|Legacy ID |memory.bytes.virtual |
|Metric Type |gauge |
|Unit |data |
|Description |The virtual memory size of the process, in bytes. This value is obtained from Sysdig events.|
|Addional Notes| |
sysdig_host_net_connection_in_count
|Prometheus ID |sysdig_host_net_connection_in_count |
|Legacy ID |net.connection.count.in |
|Metric Type |counter |
|Unit |number |
|Description |Number of currently established client (inbound) connections. |
|Addional Notes|This metric is especially useful when segmented by protocol, port or process.|
sysdig_host_net_connection_out_count
|Prometheus ID |sysdig_host_net_connection_out_count |
|Legacy ID |net.connection.count.out |
|Metric Type |counter |
|Unit |number |
|Description |Number of currently established server (outbound) connections. |
|Addional Notes|This metric is especially useful when segmented by protocol, port or process.|
sysdig_host_net_connection_total_count
|Prometheus ID |sysdig_host_net_connection_total_count |
|Legacy ID |net.connection.count.total |
|Metric Type |counter |
|Unit |number |
|Description |Number of currently established connections. This value may exceed the sum of the inbound and outbound metrics since it represents client and server inter-host connections as well as internal only connections.|
|Addional Notes|This metric is especially useful when segmented by protocol, port or process. |
sysdig_host_net_error_count
|Prometheus ID |sysdig_host_net_error_count |
|Legacy ID |net.error.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of network errors. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_net_http_error_count
|Prometheus ID |sysdig_host_net_http_error_count |
|Legacy ID |net.http.error.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of failed HTTP requests as counted from 4xx/5xx status codes.|
|Addional Notes| |
sysdig_host_net_http_request_count
|Prometheus ID |sysdig_host_net_http_request_count|
|Legacy ID |net.http.request.count |
|Metric Type |counter |
|Unit |number |
|Description |Count of HTTP requests. |
|Addional Notes| |
sysdig_host_net_http_request_time
|Prometheus ID |sysdig_host_net_http_request_time|
|Legacy ID |net.http.request.time |
|Metric Type |counter |
|Unit |time |
|Description |Average time for HTTP requests. |
|Addional Notes| |
sysdig_host_net_http_statuscode_error_count
|Prometheus ID |sysdig_host_net_http_statuscode_error_count|
|Legacy ID |net.http.statuscode.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_http_statuscode_request_count
|Prometheus ID |sysdig_host_net_http_statuscode_request_count|
|Legacy ID |net.http.statuscode.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_http_url_error_count
|Prometheus ID |sysdig_host_net_http_url_error_count|
|Legacy ID |net.http.url.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_http_url_request_count
|Prometheus ID |sysdig_host_net_http_url_request_count|
|Legacy ID |net.http.url.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_http_url_request_time
|Prometheus ID |sysdig_host_net_http_url_request_time|
|Legacy ID |net.http.url.request.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
sysdig_host_net_mongodb_collection_error_count
|Prometheus ID |sysdig_host_net_mongodb_collection_error_count|
|Legacy ID |net.mongodb.collection.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_mongodb_collection_request_count
|Prometheus ID |sysdig_host_net_mongodb_collection_request_count|
|Legacy ID |net.mongodb.collection.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_mongodb_collection_request_time
|Prometheus ID |sysdig_host_net_mongodb_collection_request_time|
|Legacy ID |net.mongodb.collection.request.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
sysdig_host_net_mongodb_error_count
|Prometheus ID |sysdig_host_net_mongodb_error_count|
|Legacy ID |net.mongodb.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_mongodb_operation_error_count
|Prometheus ID |sysdig_host_net_mongodb_operation_error_count|
|Legacy ID |net.mongodb.operation.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_mongodb_operation_request_count
|Prometheus ID |sysdig_host_net_mongodb_operation_request_count|
|Legacy ID |net.mongodb.operation.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_mongodb_operation_request_time
|Prometheus ID |sysdig_host_net_mongodb_operation_request_time|
|Legacy ID |net.mongodb.operation.request.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
sysdig_host_net_mongodb_request_count
|Prometheus ID |sysdig_host_net_mongodb_request_count|
|Legacy ID |net.mongodb.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_mongodb_request_time
|Prometheus ID |sysdig_host_net_mongodb_request_time|
|Legacy ID |net.mongodb.request.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
sysdig_host_net_in_bytes
|Prometheus ID |sysdig_host_net_in_bytes |
|Legacy ID |net.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description |Inbound network bytes. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_net_out_bytes
|Prometheus ID |sysdig_host_net_out_bytes |
|Legacy ID |net.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description |Outbound network bytes. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_net_request_count
|Prometheus ID |sysdig_host_net_request_count |
|Legacy ID |net.request.count |
|Metric Type |counter |
|Unit |number |
|Description |Total number of network requests. Note, this value may exceed the sum of inbound and outbound requests, because this count includes requests over internal connections.|
|Addional Notes| |
sysdig_host_net_request_in_count
|Prometheus ID |sysdig_host_net_request_in_count |
|Legacy ID |net.request.count.in |
|Metric Type |counter |
|Unit |number |
|Description |Number of inbound network requests.|
|Addional Notes| |
sysdig_host_net_request_in_time
|Prometheus ID |sysdig_host_net_request_in_time |
|Legacy ID |net.request.time.in |
|Metric Type |counter |
|Unit |time |
|Description |Average time to serve an inbound request.|
|Addional Notes| |
sysdig_host_net_request_out_count
|Prometheus ID |sysdig_host_net_request_out_count |
|Legacy ID |net.request.count.out |
|Metric Type |counter |
|Unit |number |
|Description |Number of outbound network requests.|
|Addional Notes| |
sysdig_host_net_request_out_time
|Prometheus ID |sysdig_host_net_request_out_time |
|Legacy ID |net.request.time.out |
|Metric Type |counter |
|Unit |time |
|Description |Average time spent waiting for an outbound request.|
|Addional Notes| |
sysdig_host_net_request_time
|Prometheus ID |sysdig_host_net_request_time |
|Legacy ID |net.request.time |
|Metric Type |counter |
|Unit |time |
|Description |Average time to serve a network request.|
|Addional Notes| |
sysdig_host_net_server_connection_in_count
|Prometheus ID |sysdig_host_net_server_connection_in_count|
|Legacy ID |net.server.connection.count.in |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_server_in_bytes
|Prometheus ID |sysdig_host_net_server_in_bytes|
|Legacy ID |net.server.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description | |
|Addional Notes| |
sysdig_host_net_server_out_bytes
|Prometheus ID |sysdig_host_net_server_out_bytes|
|Legacy ID |net.server.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description | |
|Addional Notes| |
sysdig_host_net_server_total_bytes
|Prometheus ID |sysdig_host_net_server_total_bytes|
|Legacy ID |net.server.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description | |
|Addional Notes| |
sysdig_host_net_sql_error_count
|Prometheus ID |sysdig_host_net_sql_error_count|
|Legacy ID |net.sql.error.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of Failed SQL requests. |
|Addional Notes| |
sysdig_host_net_sql_query_error_count
|Prometheus ID |sysdig_host_net_sql_query_error_count|
|Legacy ID |net.sql.query.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_sql_query_request_count
|Prometheus ID |sysdig_host_net_sql_query_request_count|
|Legacy ID |net.sql.query.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_sql_query_request_time
|Prometheus ID |sysdig_host_net_sql_query_request_time|
|Legacy ID |net.sql.query.request.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
sysdig_host_net_sql_querytype_error_count
|Prometheus ID |sysdig_host_net_sql_querytype_error_count|
|Legacy ID |net.sql.querytype.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_sql_querytype_request_count
|Prometheus ID |sysdig_host_net_sql_querytype_request_count|
|Legacy ID |net.sql.querytype.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_sql_querytype_request_time
|Prometheus ID |sysdig_host_net_sql_querytype_request_time|
|Legacy ID |net.sql.querytype.request.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
sysdig_host_net_sql_request_count
|Prometheus ID |sysdig_host_net_sql_request_count|
|Legacy ID |net.sql.request.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of SQL requests. |
|Addional Notes| |
sysdig_host_net_sql_request_time
|Prometheus ID |sysdig_host_net_sql_request_time |
|Legacy ID |net.sql.request.time |
|Metric Type |counter |
|Unit |time |
|Description |Average time to complete a SQL request.|
|Addional Notes| |
sysdig_host_net_sql_table_error_count
|Prometheus ID |sysdig_host_net_sql_table_error_count|
|Legacy ID |net.sql.table.error.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_sql_table_request_count
|Prometheus ID |sysdig_host_net_sql_table_request_count|
|Legacy ID |net.sql.table.request.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_net_sql_table_request_time
|Prometheus ID |sysdig_host_net_sql_table_request_time|
|Legacy ID |net.sql.table.request.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
sysdig_host_net_tcp_queue_len
|Prometheus ID |sysdig_host_net_tcp_queue_len |
|Legacy ID |net.tcp.queue.len |
|Metric Type |counter |
|Unit |number |
|Description |Length of the TCP request queue.|
|Addional Notes| |
sysdig_host_net_total_bytes
|Prometheus ID |sysdig_host_net_total_bytes |
|Legacy ID |net.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description |Total network bytes, inbound and outbound. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_proc_count
|Prometheus ID |sysdig_host_proc_count |
|Legacy ID |proc.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of processes on host or container.|
|Addional Notes| |
sysdig_host_syscall_count
|Prometheus ID |sysdig_host_syscall_count |
|Legacy ID |syscall.count |
|Metric Type |gauge |
|Unit |number |
|Description |Total number of syscalls seen |
|Addional Notes|Syscalls are resource intensive. This metric tracks how many have been made by a given process or container|
sysdig_host_syscall_error_count
|Prometheus ID |sysdig_host_syscall_error_count |
|Legacy ID |host.error.count |
|Metric Type |counter |
|Unit |number |
|Description |Number of system call errors. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_host_system_uptime
|Prometheus ID |sysdig_host_system_uptime |
|Legacy ID |system.uptime |
|Metric Type |gauge |
|Unit |time |
|Description |This metric is sent by the agent and represent the amount of seconds since host boot time. It is not available with container granularity.|
|Addional Notes| |
sysdig_host_thread_count
|Prometheus ID |sysdig_host_thread_count|
|Legacy ID |thread.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_timeseries_count_appcheck
|Prometheus ID |sysdig_host_timeseries_count_appcheck|
|Legacy ID |metricCount.appCheck |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_timeseries_count_jmx
|Prometheus ID |sysdig_host_timeseries_count_jmx|
|Legacy ID |metricCount.jmx |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_timeseries_count_prometheus
|Prometheus ID |sysdig_host_timeseries_count_prometheus|
|Legacy ID |metricCount.prometheus |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_timeseries_count_statsd
|Prometheus ID |sysdig_host_timeseries_count_statsd|
|Legacy ID |metricCount.statsd |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_host_up
|Prometheus ID |sysdig_host_up |
|Legacy ID |uptime |
|Metric Type |gauge |
|Unit |number |
|Description |The percentage of time the selected entity was down during the visualized time sample. This can be used to determine if a machine (or a group of machines) went down.|
|Addional Notes| |
10.6 -
JMX/JVM
jmx_jvm_class_loaded
|Prometheus ID |jmx_jvm_class_loaded |
|Legacy ID |jvm.class.loaded |
|Metric Type |gauge |
|Unit |number |
|Description |The number of classes that are currently loaded in the JVM. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_class_unloaded
|Prometheus ID |jmx_jvm_class_unloaded|
|Legacy ID |jvm.class.unloaded |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
jmx_jvm_gc_ConcurrentMarkSweep_count
|Prometheus ID |jmx_jvm_gc_ConcurrentMarkSweep_count |
|Legacy ID |jvm.gc.ConcurrentMarkSweep.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of times the Concurrent Mark-Sweep garbage collector has run.|
|Addional Notes| |
jmx_jvm_gc_ConcurrentMarkSweep_time
|Prometheus ID |jmx_jvm_gc_ConcurrentMarkSweep_time |
|Legacy ID |jvm.gc.ConcurrentMarkSweep.time |
|Metric Type |counter |
|Unit |time |
|Description |The amount of time the Concurrent Mark-Sweep garbage collector has run.|
|Addional Notes| |
jmx_jvm_gc_Copy_count
|Prometheus ID |jmx_jvm_gc_Copy_count|
|Legacy ID |jvm.gc.Copy.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
jmx_jvm_gc_Copy_time
|Prometheus ID |jmx_jvm_gc_Copy_time|
|Legacy ID |jvm.gc.Copy.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
jmx_jvm_gc_G1_Old_Generation_count
|Prometheus ID |jmx_jvm_gc_G1_Old_Generation_count|
|Legacy ID |jvm.gc.G1_Old_Generation.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
jmx_jvm_gc_G1_Old_Generation_time
|Prometheus ID |jmx_jvm_gc_G1_Old_Generation_time|
|Legacy ID |jvm.gc.G1_Old_Generation.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
jmx_jvm_gc_G1_Young_Generation_count
|Prometheus ID |jmx_jvm_gc_G1_Young_Generation_count|
|Legacy ID |jvm.gc.G1_Young_Generation.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
jmx_jvm_gc_G1_Young_Generation_time
|Prometheus ID |jmx_jvm_gc_G1_Young_Generation_time|
|Legacy ID |jvm.gc.G1_Young_Generation.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
jmx_jvm_gc_MarkSweepCompact_count
|Prometheus ID |jmx_jvm_gc_MarkSweepCompact_count|
|Legacy ID |jvm.gc.MarkSweepCompact.count |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
jmx_jvm_gc_MarkSweepCompact_time
|Prometheus ID |jmx_jvm_gc_MarkSweepCompact_time|
|Legacy ID |jvm.gc.MarkSweepCompact.time |
|Metric Type |counter |
|Unit |time |
|Description | |
|Addional Notes| |
jmx_jvm_gc_PS_MarkSweep_count
|Prometheus ID |jmx_jvm_gc_PS_MarkSweep_count |
|Legacy ID |jvm.gc.PS_MarkSweep.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of times the parallel scavenge Mark-Sweep old generation garbage collector has run.|
|Addional Notes| |
jmx_jvm_gc_PS_MarkSweep_time
|Prometheus ID |jmx_jvm_gc_PS_MarkSweep_time |
|Legacy ID |jvm.gc.PS_MarkSweep.time |
|Metric Type |counter |
|Unit |time |
|Description |The amount of time the parallel scavenge Mark-Sweep old generation garbage collector has run.|
|Addional Notes| |
jmx_jvm_gc_PS_Scavenge_count
|Prometheus ID |jmx_jvm_gc_PS_Scavenge_count |
|Legacy ID |jvm.gc.PS_Scavenge.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of times the parallel eden/survivor space garbage collector has run.|
|Addional Notes| |
jmx_jvm_gc_PS_Scavenge_time
|Prometheus ID |jmx_jvm_gc_PS_Scavenge_time |
|Legacy ID |jvm.gc.PS_Scavenge.time |
|Metric Type |counter |
|Unit |time |
|Description |The amount of time the parallel eden/survivor space garbage collector has run.|
|Addional Notes| |
jmx_jvm_gc_ParNew_count
|Prometheus ID |jmx_jvm_gc_ParNew_count |
|Legacy ID |jvm.gc.ParNew.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of times the parallel garbage collector has run.|
|Addional Notes| |
jmx_jvm_gc_ParNew_time
|Prometheus ID |jmx_jvm_gc_ParNew_time |
|Legacy ID |jvm.gc.ParNew.time |
|Metric Type |counter |
|Unit |time |
|Description |The amount of time the parallel garbage collector has run.|
|Addional Notes| |
jmx_jvm_heap_committed
|Prometheus ID |jmx_jvm_heap_committed |
|Legacy ID |jvm.heap.committed |
|Metric Type |counter |
|Unit |number |
|Description |The amount of memory that is currently allocated to the JVM for heap memory. Heap memory is the storage area for Java objects. The JVM may release memory to the system and Heap Committed could decrease below Heap Init; but Heap Committed can never increase above Heap Max. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_heap_init
|Prometheus ID |jmx_jvm_heap_init |
|Legacy ID |jvm.heap.init |
|Metric Type |counter |
|Unit |number |
|Description |The initial amount of memory that the JVM requests from the operating system for heap memory during startup (defined by the –Xms option). The JVM may request additional memory from the operating system and may also release memory to the system over time. The value of Heap Init may be undefined. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_heap_max
|Prometheus ID |jmx_jvm_heap_max |
|Legacy ID |jvm.heap.max |
|Metric Type |counter |
|Unit |number |
|Description |The maximum size allocation of heap memory for the JVM (defined by the –Xmx option). Any memory allocation attempt that would exceed this limit will cause an OutOfMemoryError exception to be thrown. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_heap_used
|Prometheus ID |jmx_jvm_heap_used |
|Legacy ID |jvm.heap.used |
|Metric Type |counter |
|Unit |number |
|Description |The amount of allocated heap memory (ie Heap Committed) currently in use. Heap memory is the storage area for Java objects. An object in the heap that is referenced by another object is ’live’, and will remain in the heap as long as it continues to be referenced. Objects that are no longer referenced are garbage and will be cleared out of the heap to reclaim space.|
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI. |
jmx_jvm_heap_used_percent
|Prometheus ID |jmx_jvm_heap_used_percent |
|Legacy ID |jvm.heap.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The ratio between Heap Used and Heap Committed. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_nonHeap_committed
|Prometheus ID |jmx_jvm_nonHeap_committed |
|Legacy ID |jvm.nonHeap.committed |
|Metric Type |counter |
|Unit |number |
|Description |The amount of memory that is currently allocated to the JVM for non-heap memory. Non-heap memory is used by Java to store loaded classes and other meta-data. The JVM may release memory to the system and Non-Heap Committed could decrease below Non-Heap Init; but Non-Heap Committed can never increase above Non-Heap Max.|
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI. |
jmx_jvm_nonHeap_init
|Prometheus ID |jmx_jvm_nonHeap_init |
|Legacy ID |jvm.nonHeap.init |
|Metric Type |counter |
|Unit |number |
|Description |The initial amount of memory that the JVM requests from the operating system for non-heap memory during startup. The JVM may request additional memory from the operating system and may also release memory to the system over time. The value of Non-Heap Init may be undefined. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_nonHeap_max
|Prometheus ID |jmx_jvm_nonHeap_max |
|Legacy ID |jvm.nonHeap.max |
|Metric Type |counter |
|Unit |number |
|Description |The maximum size allocation of non-heap memory for the JVM. This memory is used by Java to store loaded classes and other meta-data. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_nonHeap_used
|Prometheus ID |jmx_jvm_nonHeap_used |
|Legacy ID |jvm.nonHeap.used |
|Metric Type |counter |
|Unit |number |
|Description |The amount of allocated non-heap memory (ie Non-Heap Committed) currently in use. Non-heap memory is used by Java to store loaded classes and other meta-data. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_nonHeap_used_percent
|Prometheus ID |jmx_jvm_nonHeap_used_percent |
|Legacy ID |jvm.nonHeap.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The ratio between Non-Heap Used and Non-Heap Committed. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_thread_count
|Prometheus ID |jmx_jvm_thread_count |
|Legacy ID |jvm.thread.count |
|Metric Type |gauge |
|Unit |number |
|Description |The current number of live daemon and non-daemon threads. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
jmx_jvm_thread_daemon
|Prometheus ID |jmx_jvm_thread_daemon |
|Legacy ID |jvm.thread.daemon |
|Metric Type |gauge |
|Unit |number |
|Description |The current number of live daemon threads. Daemon threads are used for background supporting tasks and are only needed while normal threads are executing. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
10.7 -
Kubernetes
kube_daemonset_labels
|Prometheus ID |kube_daemonset_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_daemonset_status_current_number_scheduled
|Prometheus ID |kube_daemonset_status_current_number_scheduled |
|Legacy ID |kubernetes.daemonSet.pods.scheduled |
|Metric Type |gauge |
|Unit |number |
|Description |The number of nodes that running at least one daemon and are supposed to.|
|Addional Notes| |
kube_daemonset_status_desired_number_scheduled
|Prometheus ID |kube_daemonset_status_desired_number_scheduled |
|Legacy ID |kubernetes.daemonSet.pods.desired |
|Metric Type |gauge |
|Unit |number |
|Description |The number of nodes that should be running the daemon Pod.|
|Addional Notes| |
kube_daemonset_status_number_misscheduled
|Prometheus ID |kube_daemonset_status_number_misscheduled |
|Legacy ID |kubernetes.daemonSet.pods.misscheduled |
|Metric Type |gauge |
|Unit |number |
|Description |The number of nodes running a daemon Pod that are not supposed to.|
|Addional Notes| |
kube_daemonset_status_number_ready
|Prometheus ID |kube_daemonset_status_number_ready |
|Legacy ID |kubernetes.daemonSet.pods.ready |
|Metric Type |gauge |
|Unit |number |
|Description |The number of nodes that should be running the daemon Pod and have one or more of the daemon Pod running and ready.|
|Addional Notes| |
kube_deployment_labels
|Prometheus ID |kube_deployment_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_deployment_spec_paused
|Prometheus ID |kube_deployment_spec_paused |
|Legacy ID |kubernetes.deployment.replicas.paused |
|Metric Type |gauge |
|Unit |number |
|Description |The number of paused Pods per deployment. These Pods will not be processed by the deployment controller.|
|Addional Notes| |
kube_deployment_spec_replicas
|Prometheus ID |kube_deployment_spec_replicas |
|Legacy ID |kubernetes.deployment.replicas.desired |
|Metric Type |gauge |
|Unit |number |
|Description |The number of desired Pods per deployment.|
|Addional Notes| |
kube_deployment_status_replicas
|Prometheus ID |kube_deployment_status_replicas |
|Legacy ID |kubernetes.deployment.replicas.running |
|Metric Type |gauge |
|Unit |number |
|Description |The number of running Pods per deployment.|
|Addional Notes| |
kube_deployment_status_replicas_available
|Prometheus ID |kube_deployment_status_replicas_available |
|Legacy ID |kubernetes.deployment.replicas.available |
|Metric Type |gauge |
|Unit |number |
|Description |The number of available Pods per deployment.|
|Addional Notes| |
kube_deployment_status_replicas_unavailable
|Prometheus ID |kube_deployment_status_replicas_unavailable |
|Legacy ID |kubernetes.deployment.replicas.unavailable |
|Metric Type |gauge |
|Unit |number |
|Description |The number of unavailable Pods per deployment.|
|Addional Notes| |
kube_deployment_status_replicas_updated
|Prometheus ID |kube_deployment_status_replicas_updated |
|Legacy ID |kubernetes.deployment.replicas.updated |
|Metric Type |gauge |
|Unit |number |
|Description |The number of updated Pods per deployment.|
|Addional Notes| |
kube_hpa_labels
|Prometheus ID |kube_hpa_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_hpa_spec_max_replicas
|Prometheus ID |kube_hpa_spec_max_replicas |
|Legacy ID |kubernetes.hpa.replicas.max |
|Metric Type |gauge |
|Unit |number |
|Description |Upper limit for the number of Pods that can be set by the autoscaler.|
|Addional Notes| |
kube_hpa_spec_min_replicas
|Prometheus ID |kube_hpa_spec_min_replicas |
|Legacy ID |kubernetes.hpa.replicas.min |
|Metric Type |gauge |
|Unit |number |
|Description |Lower limit for the number of Pods that can be set by the autoscaler.|
|Addional Notes| |
kube_hpa_status_current_replicas
|Prometheus ID |kube_hpa_status_current_replicas |
|Legacy ID |kubernetes.hpa.replicas.current |
|Metric Type |gauge |
|Unit |number |
|Description |Current number of replicas of Pods managed by this autoscaler.|
|Addional Notes| |
kube_hpa_status_desired_replicas
|Prometheus ID |kube_hpa_status_desired_replicas |
|Legacy ID |kubernetes.hpa.replicas.desired |
|Metric Type |gauge |
|Unit |number |
|Description |Desired number of replicas of Pods managed by this autoscaler.|
|Addional Notes| |
kube_job_complete
|Prometheus ID |kube_job_complete |
|Legacy ID |kubernetes.job.numSucceeded |
|Metric Type |gauge |
|Unit |number |
|Description |The number of Pods which reached Phase Succeeded.|
|Addional Notes| |
kube_job_failed
|Prometheus ID |kube_job_failed |
|Legacy ID |kubernetes.job.numFailed |
|Metric Type |gauge |
|Unit |number |
|Description |The number of Pods which reached Phase Failed.|
|Addional Notes| |
kube_job_info
|Prometheus ID |kube_job_info|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_job_labels
|Prometheus ID |kube_job_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_job_owner
|Prometheus ID |kube_job_owner|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_job_spec_completions
|Prometheus ID |kube_job_spec_completions |
|Legacy ID |kubernetes.job.completions |
|Metric Type |gauge |
|Unit |number |
|Description |The desired number of successfully finished Pods that the job should be run with.|
|Addional Notes| |
kube_job_spec_parallelism
|Prometheus ID |kube_job_spec_parallelism |
|Legacy ID |kubernetes.job.parallelism |
|Metric Type |gauge |
|Unit |number |
|Description |The maximum desired number of Pods that the job should run at any given time.|
|Addional Notes| |
kube_job_status_active
|Prometheus ID |kube_job_status_active |
|Legacy ID |kubernetes.job.status.active |
|Metric Type |gauge |
|Unit |number |
|Description |The number of actively running Pods.|
|Addional Notes| |
kube_namespace_labels
|Prometheus ID |kube_namespace_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_namespace_sysdig_count
|Prometheus ID |kube_namespace_sysdig_count|
|Legacy ID |kubernetes.namespace.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of namespaces. |
|Addional Notes| |
kube_namespace_sysdig_deployment_count
|Prometheus ID |kube_namespace_sysdig_deployment_count |
|Legacy ID |kubernetes.namespace.deployment.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of deployments per namespace.|
|Addional Notes| |
kube_namespace_sysdig_hpa_count
|Prometheus ID |kube_namespace_sysdig_hpa_count |
|Legacy ID |kubernetes.namespace.hpa.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of HPA per namespace.|
|Addional Notes| |
kube_namespace_sysdig_job_count
|Prometheus ID |kube_namespace_sysdig_job_count |
|Legacy ID |kubernetes.namespace.job.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of jobs per namespace.|
|Addional Notes| |
kube_namespace_sysdig_persistentvolumeclaim_count
|Prometheus ID |kube_namespace_sysdig_persistentvolumeclaim_count |
|Legacy ID |kubernetes.namespace.persistentvolumeclaim.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of persistentvolumeclaim per namespace.|
|Addional Notes| |
kube_namespace_sysdig_pod_available_count
|Prometheus ID |kube_namespace_sysdig_pod_available_count |
|Legacy ID |kubernetes.namespace.pod.available.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of available Pods per namespace.|
|Addional Notes| |
kube_namespace_sysdig_pod_desired_count
|Prometheus ID |kube_namespace_sysdig_pod_desired_count |
|Legacy ID |kubernetes.namespace.pod.desired.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of desired Pods per namespace.|
|Addional Notes| |
kube_namespace_sysdig_pod_running_count
|Prometheus ID |kube_namespace_sysdig_pod_running_count|
|Legacy ID |kubernetes.namespace.pod.running.count |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_namespace_sysdig_replicaset_count
|Prometheus ID |kube_namespace_sysdig_replicaset_count |
|Legacy ID |kubernetes.namespace.replicaSet.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of replicaSets per namespace.|
|Addional Notes| |
kube_namespace_sysdig_resourcequota_count
|Prometheus ID |kube_namespace_sysdig_resourcequota_count |
|Legacy ID |kubernetes.namespace.resourcequota.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of resource quota per namespace.|
|Addional Notes| |
kube_namespace_sysdig_service_count
|Prometheus ID |kube_namespace_sysdig_service_count |
|Legacy ID |kubernetes.namespace.service.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of services per namespace.|
|Addional Notes| |
kube_namespace_sysdig_statefulset_count
|Prometheus ID |kube_namespace_sysdig_statefulset_count |
|Legacy ID |kubernetes.namespace.statefulSet.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of statefulset per namespace.|
|Addional Notes| |
kube_node_info
|Prometheus ID |kube_node_info|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_node_labels
|Prometheus ID |kube_node_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_node_spec_unschedulable
|Prometheus ID |kube_node_spec_unschedulable |
|Legacy ID |kubernetes.node.unschedulable |
|Metric Type |gauge |
|Unit |number |
|Description |The number of nodes unavailable to schedule new Pods.|
|Addional Notes| |
kube_node_status_allocatable
|Prometheus ID |kube_node_status_allocatable|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_node_status_allocatable_cpu_cores
|Prometheus ID |kube_node_status_allocatable_cpu_cores |
|Legacy ID |kubernetes.node.allocatable.cpuCores |
|Metric Type |gauge |
|Unit |number |
|Description |The CPU resources of a node that are available for scheduling.|
|Addional Notes| |
kube_node_status_allocatable_memory_bytes
|Prometheus ID |kube_node_status_allocatable_memory_bytes |
|Legacy ID |kubernetes.node.allocatable.memBytes |
|Metric Type |gauge |
|Unit |data |
|Description |The memory resources of a node that are available for scheduling.|
|Addional Notes| |
kube_node_status_allocatable_pods
|Prometheus ID |kube_node_status_allocatable_pods |
|Legacy ID |kubernetes.node.allocatable.pods |
|Metric Type |gauge |
|Unit |number |
|Description |The Pod resources of a node that are available for scheduling.|
|Addional Notes| |
kube_node_status_capacity
|Prometheus ID |kube_node_status_capacity|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_node_status_capacity_cpu_cores
|Prometheus ID |kube_node_status_capacity_cpu_cores |
|Legacy ID |kubernetes.node.capacity.cpuCores |
|Metric Type |gauge |
|Unit |number |
|Description |The maximum CPU resources of the node.|
|Addional Notes| |
kube_node_status_capacity_memory_bytes
|Prometheus ID |kube_node_status_capacity_memory_bytes |
|Legacy ID |kubernetes.node.capacity.memBytes |
|Metric Type |gauge |
|Unit |data |
|Description |The maximum memory resources of the node.|
|Addional Notes| |
kube_node_status_capacity_pods
|Prometheus ID |kube_node_status_capacity_pods |
|Legacy ID |kubernetes.node.capacity.pods |
|Metric Type |gauge |
|Unit |number |
|Description |The maximum number of Pods of the node.|
|Addional Notes| |
kube_node_status_condition
|Prometheus ID |kube_node_status_condition|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_node_sysdig_disk_pressure
|Prometheus ID |kube_node_sysdig_disk_pressure |
|Legacy ID |kubernetes.node.diskPressure |
|Metric Type |gauge |
|Unit |number |
|Description |The number of nodes with disk pressure.|
|Addional Notes| |
kube_node_sysdig_host
|Prometheus ID |kube_node_sysdig_host|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_node_sysdig_memory_pressure
|Prometheus ID |kube_node_sysdig_memory_pressure |
|Legacy ID |kubernetes.node.memoryPressure |
|Metric Type |gauge |
|Unit |number |
|Description |The number of nodes with memory pressure.|
|Addional Notes| |
kube_node_sysdig_network_unavailable
|Prometheus ID |kube_node_sysdig_network_unavailable |
|Legacy ID |kubernetes.node.networkUnavailable |
|Metric Type |gauge |
|Unit |number |
|Description |The number of nodes with network unavailable.|
|Addional Notes| |
kube_node_sysdig_ready
|Prometheus ID |kube_node_sysdig_ready |
|Legacy ID |kubernetes.node.ready |
|Metric Type |gauge |
|Unit |number |
|Description |The number of nodes that are ready.|
|Addional Notes| |
kube_persistentvolume_capacity_bytes
|Prometheus ID |kube_persistentvolume_capacity_bytes|
|Legacy ID |kubernetes.persistentvolume.storage |
|Metric Type |gauge |
|Unit |number |
|Description |The persistent volume’s capacity. |
|Addional Notes| |
kube_persistentvolume_claim_ref
|Prometheus ID |kube_persistentvolume_claim_ref|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_persistentvolume_info
|Prometheus ID |kube_persistentvolume_info|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_persistentvolume_labels
|Prometheus ID |kube_persistentvolume_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_persistentvolume_status_phase
|Prometheus ID |kube_persistentvolume_status_phase|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_persistentvolumeclaim_access_mode
|Prometheus ID |kube_persistentvolumeclaim_access_mode|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_persistentvolumeclaim_info
|Prometheus ID |kube_persistentvolumeclaim_info|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_persistentvolumeclaim_labels
|Prometheus ID |kube_persistentvolumeclaim_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_persistentvolumeclaim_resource_requests_storage_bytes
|Prometheus ID |kube_persistentvolumeclaim_resource_requests_storage_bytes|
|Legacy ID |kubernetes.persistentvolumeclaim.requests.storage |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_persistentvolumeclaim_status_phase
|Prometheus ID |kube_persistentvolumeclaim_status_phase|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_persistentvolumeclaim_sysdig_storage
|Prometheus ID |kube_persistentvolumeclaim_sysdig_storage |
|Legacy ID |kubernetes.persistentvolumeclaim.storage |
|Metric Type |gauge |
|Unit |number |
|Description |The actual resources of the underlying volume.|
|Addional Notes| |
kube_pod_container_info
|Prometheus ID |kube_pod_container_info|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_resource_limits
|Prometheus ID |kube_pod_container_resource_limits|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_resource_requests
|Prometheus ID |kube_pod_container_resource_requests|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_status_last_terminated_reason
|Prometheus ID |kube_pod_container_status_last_terminated_reason|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_status_ready
|Prometheus ID |kube_pod_container_status_ready|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_status_restarts_total
|Prometheus ID |kube_pod_container_status_restarts_total|
|Legacy ID | |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_status_running
|Prometheus ID |kube_pod_container_status_running|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_status_terminated
|Prometheus ID |kube_pod_container_status_terminated|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_status_terminated_reason
|Prometheus ID |kube_pod_container_status_terminated_reason|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_status_waiting
|Prometheus ID |kube_pod_container_status_waiting|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_container_status_waiting_reason
|Prometheus ID |kube_pod_container_status_waiting_reason|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_info
|Prometheus ID |kube_pod_info |
|Legacy ID |kubernetes.pod.info|
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_resource_limits
|Prometheus ID |kube_pod_init_container_resource_limits|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_resource_requests
|Prometheus ID |kube_pod_init_container_resource_requests|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_status_last_terminated_reason
|Prometheus ID |kube_pod_init_container_status_last_terminated_reason|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_status_ready
|Prometheus ID |kube_pod_init_container_status_ready|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_status_restarts_total
|Prometheus ID |kube_pod_init_container_status_restarts_total|
|Legacy ID | |
|Metric Type |counter |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_status_running
|Prometheus ID |kube_pod_init_container_status_running|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_status_terminated
|Prometheus ID |kube_pod_init_container_status_terminated|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_status_terminated_reason
|Prometheus ID |kube_pod_init_container_status_terminated_reason|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_status_waiting
|Prometheus ID |kube_pod_init_container_status_waiting|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_init_container_status_waiting_reason
|Prometheus ID |kube_pod_init_container_status_waiting_reason|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_labels
|Prometheus ID |kube_pod_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_owner
|Prometheus ID |kube_pod_owner|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_spec_volumes_persistentvolumeclaims_info
|Prometheus ID |kube_pod_spec_volumes_persistentvolumeclaims_info|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_spec_volumes_persistentvolumeclaims_readonly
|Prometheus ID |kube_pod_spec_volumes_persistentvolumeclaims_readonly|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_sysdig_containers_waiting
|Prometheus ID |kube_pod_sysdig_containers_waiting |
|Legacy ID |kubernetes.pod.containers.waiting |
|Metric Type |gauge |
|Unit |number |
|Description |The number of containers waiting for a Pod.|
|Addional Notes| |
kube_pod_sysdig_resource_limits_cpu_cores
|Prometheus ID |kube_pod_sysdig_resource_limits_cpu_cores |
|Legacy ID |kubernetes.pod.resourceLimits.cpuCores |
|Metric Type |gauge |
|Unit |number |
|Description |The limit on CPU cores to be used by a container.|
|Addional Notes| |
kube_pod_sysdig_resource_limits_memory_bytes
|Prometheus ID |kube_pod_sysdig_resource_limits_memory_bytes |
|Legacy ID |kubernetes.pod.resourceLimits.memBytes |
|Metric Type |gauge |
|Unit |data |
|Description |The limit on memory to be used by a container in bytes.|
|Addional Notes| |
kube_pod_sysdig_resource_requests_cpu_cores
|Prometheus ID |kube_pod_sysdig_resource_requests_cpu_cores |
|Legacy ID |kubernetes.pod.resourceRequests.cpuCores |
|Metric Type |gauge |
|Unit |number |
|Description |The number of CPU cores requested by containers in the Pod.|
|Addional Notes| |
kube_pod_sysdig_resource_requests_memory_bytes
|Prometheus ID |kube_pod_sysdig_resource_requests_memory_bytes |
|Legacy ID |kubernetes.pod.resourceRequests.memBytes |
|Metric Type |gauge |
|Unit |data |
|Description |The number of memory bytes requested by containers in the Pod.|
|Addional Notes| |
kube_pod_sysdig_restart_count
|Prometheus ID |kube_pod_sysdig_restart_count |
|Legacy ID |kubernetes.pod.restart.count |
|Metric Type |gauge |
|Unit |number |
|Description |The number of container restarts for the Pod.|
|Addional Notes| |
kube_pod_sysdig_restart_rate
|Prometheus ID |kube_pod_sysdig_restart_rate|
|Legacy ID |kubernetes.pod.restart.rate |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_pod_sysdig_status_ready
|Prometheus ID |kube_pod_sysdig_status_ready |
|Legacy ID |kubernetes.pod.status.ready |
|Metric Type |gauge |
|Unit |number |
|Description |The number of pods ready to serve requests.|
|Addional Notes| |
kube_replicaset_labels
|Prometheus ID |kube_replicaset_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_replicaset_owner
|Prometheus ID |kube_replicaset_owner|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_replicaset_spec_replicas
|Prometheus ID |kube_replicaset_spec_replicas |
|Legacy ID |kubernetes.replicaSet.replicas.desired |
|Metric Type |gauge |
|Unit |number |
|Description |The number of desired Pods per replicaSet.|
|Addional Notes| |
kube_replicaset_status_fully_labeled_replicas
|Prometheus ID |kube_replicaset_status_fully_labeled_replicas |
|Legacy ID |kubernetes.replicaSet.replicas.fullyLabeled |
|Metric Type |gauge |
|Unit |number |
|Description |The number of fully labeled Pods per replicaSet.|
|Addional Notes| |
kube_replicaset_status_ready_replicas
|Prometheus ID |kube_replicaset_status_ready_replicas |
|Legacy ID |kubernetes.replicaSet.replicas.ready |
|Metric Type |gauge |
|Unit |number |
|Description |The number of ready Pods per replicaSet.|
|Addional Notes| |
kube_replicaset_status_replicas
|Prometheus ID |kube_replicaset_status_replicas |
|Legacy ID |kubernetes.replicaSet.replicas.running |
|Metric Type |gauge |
|Unit |number |
|Description |The number of running Pods per replicaSet.|
|Addional Notes| |
kube_resourcequota
|Prometheus ID |kube_resourcequota|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_resourcequota_sysdig_limits_cpu_hard
|Prometheus ID |kube_resourcequota_sysdig_limits_cpu_hard|
|Legacy ID |kubernetes.resourcequota.limits.cpu.hard |
|Metric Type |gauge |
|Unit |number |
|Description |Enforced CPU Limit quota per namespace. |
|Addional Notes| |
kube_resourcequota_sysdig_limits_cpu_used
|Prometheus ID |kube_resourcequota_sysdig_limits_cpu_used |
|Legacy ID |kubernetes.resourcequota.limits.cpu.used |
|Metric Type |gauge |
|Unit |number |
|Description |Current observed CPU limit usage per namespace.|
|Addional Notes| |
kube_resourcequota_sysdig_limits_memory_hard
|Prometheus ID |kube_resourcequota_sysdig_limits_memory_hard|
|Legacy ID |kubernetes.resourcequota.limits.memory.hard |
|Metric Type |gauge |
|Unit |number |
|Description |Enforced memory limit quota per namespace. |
|Addional Notes| |
kube_resourcequota_sysdig_limits_memory_used
|Prometheus ID |kube_resourcequota_sysdig_limits_memory_used |
|Legacy ID |kubernetes.resourcequota.limits.memory.used |
|Metric Type |gauge |
|Unit |number |
|Description |Current observed memory limit usage per namespace.|
|Addional Notes| |
kube_resourcequota_sysdig_persistentvolumeclaims_hard
|Prometheus ID |kube_resourcequota_sysdig_persistentvolumeclaims_hard|
|Legacy ID |kubernetes.resourcequota.persistentvolumeclaims.hard |
|Metric Type |gauge |
|Unit |number |
|Description |Enforced Peristentvolumeclaim quota per namespace. |
|Addional Notes| |
kube_resourcequota_sysdig_persistentvolumeclaims_used
|Prometheus ID |kube_resourcequota_sysdig_persistentvolumeclaims_used |
|Legacy ID |kubernetes.resourcequota.persistentvolumeclaims.used |
|Metric Type |gauge |
|Unit |number |
|Description |Current observed Persistentvolumeclaim usage per namespace.|
|Addional Notes| |
kube_resourcequota_sysdig_pods_hard
|Prometheus ID |kube_resourcequota_sysdig_pods_hard|
|Legacy ID |kubernetes.resourcequota.pods.hard |
|Metric Type |gauge |
|Unit |number |
|Description |Enforced Pod quota per namespace. |
|Addional Notes| |
kube_resourcequota_sysdig_pods_used
|Prometheus ID |kube_resourcequota_sysdig_pods_used |
|Legacy ID |kubernetes.resourcequota.pods.used |
|Metric Type |gauge |
|Unit |number |
|Description |Current observed Pod usage per namespace.|
|Addional Notes| |
kube_resourcequota_sysdig_requests_cpu_hard
|Prometheus ID |kube_resourcequota_sysdig_requests_cpu_hard|
|Legacy ID |kubernetes.resourcequota.requests.cpu.hard |
|Metric Type |gauge |
|Unit |number |
|Description |Enforced CPU request quota per namespace. |
|Addional Notes| |
kube_resourcequota_sysdig_requests_cpu_used
|Prometheus ID |kube_resourcequota_sysdig_requests_cpu_used |
|Legacy ID |kubernetes.resourcequota.requests.cpu.used |
|Metric Type |gauge |
|Unit |number |
|Description |Current observed CPU request usage per namespace.|
|Addional Notes| |
kube_resourcequota_sysdig_requests_memory_hard
|Prometheus ID |kube_resourcequota_sysdig_requests_memory_hard|
|Legacy ID |kubernetes.resourcequota.requests.memory.hard |
|Metric Type |gauge |
|Unit |number |
|Description |Enforced memory request quota per namespace. |
|Addional Notes| |
kube_resourcequota_sysdig_requests_memory_used
|Prometheus ID |kube_resourcequota_sysdig_requests_memory_used |
|Legacy ID |kubernetes.resourcequota.requests.memory.used |
|Metric Type |gauge |
|Unit |number |
|Description |Current observed memory request usage per namespace.|
|Addional Notes| |
kube_resourcequota_sysdig_services_hard
|Prometheus ID |kube_resourcequota_sysdig_services_hard|
|Legacy ID |kubernetes.resourcequota.services.hard |
|Metric Type |gauge |
|Unit |number |
|Description |Enforced service quota per namespace. |
|Addional Notes| |
kube_resourcequota_sysdig_services_used
|Prometheus ID |kube_resourcequota_sysdig_services_used |
|Legacy ID |kubernetes.resourcequota.services.used |
|Metric Type |gauge |
|Unit |number |
|Description |Current observed service usage per namespace.|
|Addional Notes| |
kube_service_info
|Prometheus ID |kube_service_info|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_service_labels
|Prometheus ID |kube_service_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_statefulset_labels
|Prometheus ID |kube_statefulset_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_statefulset_replicas
|Prometheus ID |kube_statefulset_replicas |
|Legacy ID |kubernetes.statefulSet.replicas |
|Metric Type |gauge |
|Unit |number |
|Description |Desired number of replicas of the given Template.|
|Addional Notes| |
kube_statefulset_status_replicas
|Prometheus ID |kube_statefulset_status_replicas |
|Legacy ID |kubernetes.statefulSet.status.replicas |
|Metric Type |gauge |
|Unit |number |
|Description |Number of Pods created by the StatefulSet controller.|
|Addional Notes| |
kube_statefulset_status_replicas_current
|Prometheus ID |kube_statefulset_status_replicas_current |
|Legacy ID |kubernetes.statefulSet.status.replicas.current |
|Metric Type |gauge |
|Unit |number |
|Description |The number of Pods created by the StatefulSet controller from the StatefulSet version indicated by currrentRevision.|
|Addional Notes| |
kube_statefulset_status_replicas_ready
|Prometheus ID |kube_statefulset_status_replicas_ready |
|Legacy ID |kubernetes.statefulSet.status.replicas.ready |
|Metric Type |gauge |
|Unit |number |
|Description |Number of Pods created by the StatefulSet controller that have a Ready Condition.|
|Addional Notes| |
kube_statefulset_status_replicas_updated
|Prometheus ID |kube_statefulset_status_replicas_updated |
|Legacy ID |kubernetes.statefulSet.status.replicas.updated |
|Metric Type |gauge |
|Unit |number |
|Description |Number of Pods created by the StatefulSet controller from the StatefulSet version indicated by updateRevision.|
|Addional Notes| |
kube_storageclass_created
|Prometheus ID |kube_storageclass_created|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_storageclass_info
|Prometheus ID |kube_storageclass_info|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_storageclass_labels
|Prometheus ID |kube_storageclass_labels|
|Legacy ID | |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_workload_pods_status_phase
|Prometheus ID |kube_workload_pods_status_phase |
|Legacy ID |kubernetes.workload.pods.status.phase|
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_workload_status_replicas_misscheduled
|Prometheus ID |kube_workload_status_replicas_misscheduled |
|Legacy ID |kubernetes.workload.status.replicas.misscheduled|
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_workload_status_replicas_scheduled
|Prometheus ID |kube_workload_status_replicas_scheduled |
|Legacy ID |kubernetes.workload.status.replicas.scheduled|
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_workload_status_replicas_updated
|Prometheus ID |kube_workload_status_replicas_updated |
|Legacy ID |kubernetes.workload.status.replicas.updated|
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_workload_status_running
|Prometheus ID |kube_workload_status_running |
|Legacy ID |kubernetes.workload.status.running|
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
kube_workload_status_unavailable
|Prometheus ID |kube_workload_status_unavailable |
|Legacy ID |kubernetes.workload.status.unavailable|
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
10.8 -
Network
sysdig_connection_net_request_count
|Prometheus ID |sysdig_connection_net_request_count |
|Legacy ID |net.request.count |
|Metric Type |- |
|Unit |- |
|Description |The total number of network requests. This value may exceed the sum of inbound and outbound requests, because this count includes requests over internal connections..|
|Addional Notes| |
sysdig_connection_net_connection_in_count
|Prometheus ID |sysdig_connection_net_connection_in_count |
|Legacy ID |net.connection.count.in |
|Metric Type |counter |
|Unit |number |
|Description |The number of currently established client (inbound) connections. |
|Addional Notes|This metric is especially useful when segmented by protocol, port or process.|
sysdig_connection_net_connection_out_count
|Prometheus ID |sysdig_connection_net_connection_out_count |
|Legacy ID |net.connection.count.out |
|Metric Type |counter |
|Unit |number |
|Description |The number of currently established server (outbound) connections. |
|Addional Notes|This metric is especially useful when segmented by protocol, port or process.|
sysdig_connection_net_connection_total_count
|Prometheus ID |sysdig_connection_net_connection_total_count |
|Legacy ID |net.connection.count.total |
|Metric Type |counter |
|Unit |number |
|Description |The number of currently established connections. This value may exceed the sum of the inbound and outbound metrics since it represents client and server inter-host connections as well as internal only connections.|
|Addional Notes|This metric is especially useful when segmented by protocol, port or process. |
sysdig_connection_net_in_bytes
|Prometheus ID |sysdig_connection_net_in_bytes |
|Legacy ID |net.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description |The number of inbound network bytes. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_connection_net_out_bytes
|Prometheus ID |sysdig_connection_net_out_bytes |
|Legacy ID |net.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description |The number of outbound network bytes. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_connection_net_request_count
|Prometheus ID |sysdig_connection_net_request_count |
|Legacy ID |net.request.count |
|Metric Type |counter |
|Unit |number |
|Description |The total number of network requests. Note, this value may exceed the sum of inbound and outbound requests, because this count includes requests over internal connections.|
|Addional Notes| |
sysdig_connection_net_request_in_count
|Prometheus ID |sysdig_connection_net_request_in_count |
|Legacy ID |net.request.count.in |
|Metric Type |counter |
|Unit |number |
|Description |The number of inbound network requests.|
|Addional Notes| |
sysdig_connection_net_request_in_time
|Prometheus ID |sysdig_connection_net_request_in_time |
|Legacy ID |net.request.time.in |
|Metric Type |counter |
|Unit |time |
|Description |The average time to serve an inbound request.|
|Addional Notes| |
sysdig_connection_net_request_out_count
|Prometheus ID |sysdig_connection_net_request_out_count |
|Legacy ID |net.request.count.out |
|Metric Type |counter |
|Unit |number |
|Description |The number of outbound network requests.|
|Addional Notes| |
sysdig_connection_net_request_out_time
|Prometheus ID |sysdig_connection_net_request_out_time |
|Legacy ID |net.request.time.out |
|Metric Type |counter |
|Unit |time |
|Description |The number of average time spent waiting for an outbound request.|
|Addional Notes| |
sysdig_connection_net_request_time
|Prometheus ID |sysdig_connection_net_request_time |
|Legacy ID |net.request.time |
|Metric Type |counter |
|Unit |time |
|Description |The number of average time to serve a network request.|
|Addional Notes| |
sysdig_connection_net_total_bytes
|Prometheus ID |sysdig_connection_net_total_bytes |
|Legacy ID |net.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description |The total network bytes, including both inbound and outbound connections. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
10.9 -
Program
sysdig_program_cpu_cores_used
|Prometheus ID |sysdig_program_cpu_cores_used |
|Legacy ID |cpu.cores.used |
|Metric Type |gauge |
|Unit |number |
|Description |The CPU core usage of each program is obtained from cgroups, and is equal to the number of cores used by the program. For example, if a program uses two of an available four cores, the value of sysdig_program_cpu_cores_used
will be two.|
|Addional Notes| |
sysdig_program_cpu_cores_used_percent
|Prometheus ID |sysdig_program_cpu_cores_used_percent |
|Legacy ID |cpu.cores.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The CPU core usage percent for each program is obtained from cgroups, and is equal to the number of cores multiplied by 100. For example, if a program uses three cores, the value of sysdig_program_cpu_cores_used_percent
would be 300%.|
|Addional Notes| |
sysdig_program_cpu_used_percent
|Prometheus ID |sysdig_program_cpu_used_percent |
|Legacy ID |cpu.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The CPU usage for each program is obtained from cgroups, and normalized by dividing by the number of cores to determine an overall percentage. For example, if the environment contains six cores on a host, and the processes are assigned two cores, Sysdig will report CPU usage of 2/6 * 100% = 33.33%. This metric is calculated differently for hosts and containers.|
|Addional Notes| |
sysdig_program_fd_used_percent
|Prometheus ID |sysdig_program_fd_used_percent |
|Legacy ID |fd.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of used file descriptors out of the maximum available. |
|Addional Notes|Usually, when a process reaches its FD limit it will stop operating properly and possibly crash. As a consequence, this is a metric you want to monitor carefully, or even better use for alerts.|
sysdig_program_file_error_open_count
|Prometheus ID |sysdig_program_file_error_open_count |
|Legacy ID |file.error.open.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of errors caused by opening files. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_file_error_total_count
|Prometheus ID |sysdig_program_file_error_total_count |
|Legacy ID |file.error.total.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of error caused by file access. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_file_in_bytes
|Prometheus ID |sysdig_program_file_in_bytes |
|Legacy ID |file.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description |The number of bytes read from file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_file_in_iops
|Prometheus ID |sysdig_program_file_in_iops |
|Legacy ID |file.iops.in |
|Metric Type |counter |
|Unit |number |
|Description |The number of file read operations per second. |
|Addional Notes|This is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.|
sysdig_program_file_in_time
|Prometheus ID |sysdig_program_file_in_time |
|Legacy ID |file.time.in |
|Metric Type |counter |
|Unit |time |
|Description |The time spent in file reading. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_file_open_count
|Prometheus ID |sysdig_program_file_open_count |
|Legacy ID |file.open.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of time the file has been opened.|
|Addional Notes| |
sysdig_program_file_out_bytes
|Prometheus ID |sysdig_program_file_out_bytes |
|Legacy ID |file.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description |The number of bytes written to file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_file_out_iops
|Prometheus ID |sysdig_program_file_out_iops |
|Legacy ID |file.iops.out |
|Metric Type |counter |
|Unit |number |
|Description |The number of file write operations per second. |
|Addional Notes|This is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.|
sysdig_program_file_out_time
|Prometheus ID |sysdig_program_file_out_time |
|Legacy ID |file.time.out |
|Metric Type |counter |
|Unit |time |
|Description |The time spent in file writing. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_file_total_bytes
|Prometheus ID |sysdig_program_file_total_bytes |
|Legacy ID |file.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description |The number of bytes read from and written to file. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_file_total_iops
|Prometheus ID |sysdig_program_file_total_iops |
|Legacy ID |file.iops.total |
|Metric Type |counter |
|Unit |number |
|Description |The number of read and write file operations per second. |
|Addional Notes|This is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.|
sysdig_program_file_total_time
|Prometheus ID |sysdig_program_file_total_time |
|Legacy ID |file.time.total |
|Metric Type |counter |
|Unit |time |
|Description |The time spent in file I/O. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_info
|Prometheus ID |sysdig_program_info|
|Legacy ID |info |
|Metric Type |gauge |
|Unit |number |
|Description | |
|Addional Notes| |
sysdig_program_memory_used_bytes
|Prometheus ID |sysdig_program_memory_used_bytes |
|Legacy ID |memory.bytes.used |
|Metric Type |gauge |
|Unit |data |
|Description |The amount of physical memory currently in use. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.|
sysdig_program_memory_used_percent
|Prometheus ID |sysdig_program_memory_used_percent |
|Legacy ID |memory.used.percent |
|Metric Type |gauge |
|Unit |percent |
|Description |The percentage of physical memory in use. |
|Addional Notes|By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_net_connection_in_count
|Prometheus ID |sysdig_program_net_connection_in_count |
|Legacy ID |net.connection.count.in |
|Metric Type |counter |
|Unit |number |
|Description |The number of currently established client (inbound) connections. |
|Addional Notes|This metric is especially useful when segmented by protocol, port or process.|
sysdig_program_net_connection_out_count
|Prometheus ID |sysdig_program_net_connection_out_count |
|Legacy ID |net.connection.count.out |
|Metric Type |counter |
|Unit |number |
|Description |The number of currently established server (outbound) connections. |
|Addional Notes|This metric is especially useful when segmented by protocol, port or process.|
sysdig_program_net_connection_total_count
|Prometheus ID |sysdig_program_net_connection_total_count |
|Legacy ID |net.connection.count.total |
|Metric Type |counter |
|Unit |number |
|Description |The number of currently established connections. This value may exceed the sum of the inbound and outbound metrics since it represents client and server inter-host connections as well as internal only connections.|
|Addional Notes|This metric is especially useful when segmented by protocol, port or process. |
sysdig_program_net_error_count
|Prometheus ID |sysdig_program_net_error_count |
|Legacy ID |net.error.count |
|Metric Type |counter |
|Unit |number |
|Description |The total number of network errors occurred in a second. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_net_in_bytes
|Prometheus ID |sysdig_program_net_in_bytes |
|Legacy ID |net.bytes.in |
|Metric Type |counter |
|Unit |data |
|Description |The number of inbound network bytes. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_net_out_bytes
|Prometheus ID |sysdig_program_net_out_bytes |
|Legacy ID |net.bytes.out |
|Metric Type |counter |
|Unit |data |
|Description |The number of outbound network bytes. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_net_request_count
|Prometheus ID |sysdig_program_net_request_count |
|Legacy ID |net.request.count |
|Metric Type |counter |
|Unit |number |
|Description |The total number of network requests. Note, this value may exceed the sum of inbound and outbound requests, because this count includes requests over internal connections.|
|Addional Notes| |
sysdig_program_net_request_in_count
|Prometheus ID |sysdig_program_net_request_in_count |
|Legacy ID |net.request.count.in |
|Metric Type |counter |
|Unit |number |
|Description |The number of inbound network requests.|
|Addional Notes| |
sysdig_program_net_request_in_time
|Prometheus ID |sysdig_program_net_request_in_time |
|Legacy ID |net.request.time.in |
|Metric Type |counter |
|Unit |time |
|Description |The average time to serve an inbound request.|
|Addional Notes| |
sysdig_program_net_request_out_count
|Prometheus ID |sysdig_program_net_request_out_count |
|Legacy ID |net.request.count.out |
|Metric Type |counter |
|Unit |number |
|Description |The number of outbound network requests.|
|Addional Notes| |
sysdig_program_net_request_out_time
|Prometheus ID |sysdig_program_net_request_out_time |
|Legacy ID |net.request.time.out |
|Metric Type |counter |
|Unit |time |
|Description |The average time spent waiting for an outbound request.|
|Addional Notes| |
sysdig_program_net_request_time
|Prometheus ID |sysdig_program_net_request_time |
|Legacy ID |net.request.time |
|Metric Type |counter |
|Unit |time |
|Description |Average time to serve a network request.|
|Addional Notes| |
sysdig_program_net_tcp_queue_len
|Prometheus ID |sysdig_program_net_tcp_queue_len |
|Legacy ID |net.tcp.queue.len |
|Metric Type |counter |
|Unit |number |
|Description |The length of the TCP request queue.|
|Addional Notes| |
sysdig_program_net_total_bytes
|Prometheus ID |sysdig_program_net_total_bytes |
|Legacy ID |net.bytes.total |
|Metric Type |counter |
|Unit |data |
|Description |The total network bytes, including inbound and outbound connections, in a program. |
|Addional Notes|By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.|
sysdig_program_proc_count
|Prometheus ID |sysdig_program_proc_count |
|Legacy ID |proc.count |
|Metric Type |counter |
|Unit |number |
|Description |The number of processes on a host or container.|
|Addional Notes| |
sysdig_program_syscall_count
|Prometheus ID |sysdig_program_syscall_count |
|Legacy ID |syscall.count |
|Metric Type |gauge |
|Unit |number |
|Description |The total number of syscalls seen |
|Addional Notes|Syscalls are resource intensive. This metric tracks how many have been made by a given process or container|
sysdig_program_thread_count
|Prometheus ID |sysdig_program_thread_count |
|Legacy ID |thread.count |
|Metric Type |counter |
|Unit |number |
|Description |The total number of threads running in a program.|
|Addional Notes| |
sysdig_program_timeseries_count_appcheck
|Prometheus ID |sysdig_program_timeseries_count_appcheck|
|Legacy ID |metricCount.appCheck |
|Metric Type |gauge |
|Unit |number |
|Description |The number of app check custom metrics. |
|Addional Notes| |
sysdig_program_timeseries_count_jmx
|Prometheus ID |sysdig_program_timeseries_count_jmx|
|Legacy ID |metricCount.jmx |
|Metric Type |gauge |
|Unit |number |
|Description |The number of JMS custom metrics. |
|Addional Notes| |
sysdig_program_timeseries_count_prometheus
|Prometheus ID |sysdig_program_timeseries_count_prometheus|
|Legacy ID |metricCount.prometheus |
|Metric Type |gauge |
|Unit |number |
|Description |The number of Prometheus custom metrics. |
|Addional Notes| |
sysdig_program_up
|Prometheus ID |sysdig_program_up |
|Legacy ID |uptime |
|Metric Type |gauge |
|Unit |number |
|Description |The percentage of time the selected entity was down during the visualized time sample. This can be used to determine if a machine (or a group of machines) went down.|
|Addional Notes| |
sysdig_program_cpu_used_percent
|Prometheus ID |sysdig_program_cpu_used_percent |
|Legacy ID |cpu.used.percent |
|Metric Type |- |
|Unit |- |
|Description |The CPU usage for each program is obtained from cgroups, and normalized by dividing by the number of cores to determine an overall percentage.|
|Addional Notes| |
sysdig_program_memory_used_percent
|Prometheus ID |sysdig_program_memory_used_percent |
|Legacy ID |memory.used.percent |
|Metric Type |- |
|Unit |- |
|Description |The percentage of swap memory used. By default, this metric displays the average value for the defined scope. For example, if the scope is set to a group of machines, the metric value will be the average value for the whole group.|
|Addional Notes| |
10.9.1 -
Program
sysdig_program_cpu_cores_used
Metadata | Value |
---|
publicId | sysdig_program_cpu_cores_used |
legacyId | cpu.cores.used |
description | The CPU core usage of each program is obtained from cgroups, and is equal to the number of cores used by the program. For example, if a program uses two of an available four cores, the value of sysdig_program_cpu_cores_used will be two. |
notes | |
sysdig_program_cpu_cores_used_percent
Metadata | Value |
---|
publicId | sysdig_program_cpu_cores_used_percent |
legacyId | cpu.cores.used.percent |
description | The CPU core usage percent for each program is obtained from cgroups, and is equal to the number of cores multiplied by 100. For example, if a program uses three cores, the value of sysdig_program_cpu_cores_used_percent would be 300%. |
notes | |
sysdig_program_cpu_used_percent
Metadata | Value |
---|
publicId | sysdig_program_cpu_used_percent |
legacyId | cpu.used.percent |
description | The CPU usage for each program is obtained from cgroups, and normalized by dividing by the number of cores to determine an overall percentage. For example, if the environment contains six cores on a host, and the processes are assigned two cores, Sysdig will report CPU usage of 2/6 * 100% = 33.33%. This metric is calculated differently for hosts and containers. |
notes | |
sysdig_program_fd_used_percent
Metadata | Value |
---|
publicId | sysdig_program_fd_used_percent |
legacyId | fd.used.percent |
description | The percentage of used file descriptors out of the maximum available. |
notes | Usually, when a process reaches its FD limit it will stop operating properly and possibly crash. As a consequence, this is a metric you want to monitor carefully, or even better use for alerts. |
sysdig_program_file_error_open_count
Metadata | Value |
---|
publicId | sysdig_program_file_error_open_count |
legacyId | file.error.open.count |
description | The number of errors caused by opening files. |
notes | By default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI. |
sysdig_program_file_error_total_count
Metadata | Value |
---|
publicId | sysdig_progra |