This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Sysdig Monitor

Sysdig Monitor is part of Sysdig’s container intelligence platform. Sysdig uses a unified platform to deliver security, monitoring, and forensics in a container- and microservices-friendly architecture. Sysdig Monitor is a monitoring, troubleshooting, and alerting suite offering deep, process-level visibility into dynamic, distributed production environments. Sysdig Monitor captures, correlates, and visualizes full-stack data, and provides dashboards for monitoring.

In the background, the Sysdig agent lives on the hosts being monitored and collects the appropriate metrics and events. Out of the box, the agent reports on a wide variety of pre-defined metrics. Additional metrics and custom parameters are available via agent configuration files. For more information, see the Sysdig Agent Documentation.

Major Benefits

  • Explore and monitor application performance at any level of the infrastructure stack

  • Correlate metrics and events, and compare with past performance

  • Observe platform state and health

  • Auto-detect anomalies

  • Visualize and share performance metrics with out-of-the-box and custom dashboards

  • Powerful, tuned, and flexible alerts

  • Proactively alert on incidents across services, hosts, containers and so on

  • Trigger system captures for offline troubleshooting and forensics

  • Analyze system call activity to accelerate problem resolution

Key Components

Monitor Interface

Log into the Sysdig Monitor interface, and get started with the basics.

Advisor

Operate and troubleshoot Kubernetes infrastructure easily with a curated and unified view of metrics, alerts, and events.

Explore the Infrastructure

Dive into Sysdig Monitor with a deeper understanding of the Explore module, data aggregation, and how to break down data.

This feature is available in the Enterprise tier of the Sysdig product. See https://sysdig.com/pricing for details, or contact sales@sysdig.com.

Metrics

The backbone of monitoring: learn more about metrics, integrate external platforms, and explore the complete metrics dictionary.

Alerts

Learn how to build alerts to notify users of infrastructure events, changes in behavior, and unauthorized access.

Dashboards

Learn how to build a custom dashboard, configure the default ones, or reconfigure panels to best suit your infrastructure.

Integrations

Integrate with various inbound and outbound data sources ranging from a number of platforms and orchestrators to a wide range of applications.

Events

Integrate Docker and Kubernetes events, customize event notifications, and review infrastructure history.

Captures

Create capture files containing system calls and other OS events to assist monitoring and troubleshooting the infrastructure.

1 - Getting Started with Sysdig Monitor

Sysdig Monitor allows you to maximize the visibility of your Kubernetes environments with native Prometheus support. You can troubleshoot issues faster with Sysdig’s eBPF derived metrics, out-of-the-box dashboards, and alerts.

You can choose Sysdig Monitor for a Free Trial option to quickly connect to a single cloud account with Sysdig and start with Prometheus-compatible Kubernetes and cloud monitoring.

Once connected, the Get Started page shows a subset of the options available in the 30-day trial or Enterprise.

Get Started Page

The Get Started page targets the key steps to ensure users are getting the most value out of Sysdig Monitor. The page is updated with new steps as users complete tasks and Sysdig adds new features to the product.

The Get Started page also serves as a linking page for

  • Documentation

  • Release Notes

  • The Sysdig Blog

  • Self-Paced Training

  • Support

Users can access the Get Started page at any time by clicking the rocketship in the side menu.

Install the Agent

Installing the agent on your infrastructure allows Sysdig to collect data for monitoring and security purposes. For more information, see Quick Install Sysdig Agent on Kubernetes.

(Optional) Connect Your Prometheus Servers

Connecting your Prometheus servers to Sysdig-managed Prometheus Service helps leverage Sysdig for scalable long-term storage of your Prometheus metrics, PromQL dashboards, centralized querying, and PromQL-based alerting. For more information, see Collect Prometheus Metrics.

Invite Your Team

Invite someone in your team to use this Sysdig Monitor account. They will be notified with an email. A user will be created for them and will be added to the default team. They are automatically assigned to the Advanced User role.

Monitor Your Kubernetes Clusters

Get a unified view of the health, risk, and capacity of your Kubernetes infrastructure in a multi- and hybrid-cloud environment. For more information, see Dashboard Templates.

Workload Status & Performance

Get deep insight into your Kubernetes workloads faster with the Workload Status & Performance Dashboard.

Pod Status & Performance

Drill down to workload pods and monitor pod-level resource usage and troubleshoot performance issues with the Pod Status & Performance Dashboard.

Cluster Capacity Planning

Verify if your cluster is sized properly for existing deployed applications, identify over-commit on resources that can lead to pod evictions, discover unused requested resources or containers without limits defined with the Cluster Capacity Planning Dashboard.

Cluster/Namespace Available Resources

Determine if your cluster has the capacity to deploy a new workload and ascertain if increasing CPU or memory requests or placing limits on an existing application is necessary with the Cluster/Namespace Available Resources Dashboard.

Pod Rightsizing & Workload Capacity Optimization

Identify resource-hogging workloads while optimizing your capacity with the Pod Rightsizing & Workload Capacity Optimization Dashboard.

Set Up Alert

Sysdig Monitor emits alerts to get proactive notification of events, anomalies, or any incident that requires attention. The alerting system provides out-of-the-box push gateways for regular email, Slack, Cloud-provider notification queues, and custom webhooks, among others. See Alerts.

Configure a Notification Channel

Alerts are used in Sysdig Monitor when Event thresholds have been crossed and can be sent over a variety of supported notification channels. Integrate Sysdig with your notification dispatchers and incident management workflows. See Set Up Notification Channels

Turn on Alerts

Turn on recommended alerts from our Alerts Library. Customize our recommendations or create your own alerts from scratch. See Alerts Library.

Monitor Your Services

Create a Dashboard

Create customized dashboards to display the most relevant views and metrics for the infrastructure in a single location. Each dashboard is comprised of a series of panels configured to display specific data in a number of different formats. See Dashboards.

Get Started with PromQL

Write PromQL queries easier with form-based querying available with Sysdig Monitor. All metrics are enriched with cloud and Kubernetes metadata avoiding complicated PromQL joins. See Using PromQL.

Monitoring Integrations

Sysdig discovers services running in infrastructure and recommends appropriate Monitoring Integrations that allow you to collect service-specific metrics. The integration bundle includes out-of-the-box dashboards and default alerts. See Configure Monitoring Integrations.

Advanced Actions

Integrate development tools:

2 - Advisor

Advisor brings your metrics, alerts, and events into a focused and curated view to help you operate and troubleshoot Kubernetes infrastructure.
Advisor is available to only our SaaS users. The feature is not currently available for on-prem environments.

Advisor presents your infrastructure grouped by cluster, namespace, workload, and pod. You cannot currently configure a custom grouping. Depending on the selection, you will see different curated views and you can switch between the following:

  • Advisories
  • Triggered alerts
  • Events from Kubernetes, container engines, and custom user events
  • Cluster usage and capacity
  • Key golden signals (requests, latency, errors) derived from system calls
  • Kubernetes metrics about the health and status of Kubernetes objects
  • Container live logs
  • Process and network telemetry (CPU, memory, network connections, etc.)
  • Monitoring Integrations

The time window of metrics displayed on Advisor is the last 1 hour of collected data. To see historical values for a metric, drill down to a related dashboard or explore a metric using the Explore UI.

Advisories

Advisories evaluate the thousands of data points being sent by the Sysdig agent, and displays a prioritized view of key problems in your infrastructure that affect the health and availability of your clusters and the workloads running on them.

When you select an advisory, relevant information related to the issue is surfaced, such as metrics, events, live logs, and remediation guidance. This enables you to pinpoint and resolve problems faster. Following SRE best practices, they are not necessarily symptoms of a problem, but instead causes that may not necessarily want to be alerted upon.

Example Issues Detected

Problem

Description

CrashLoopBackOff

A CrashLoopBackOff means that you have a pod starting, crashing, starting again, and then crashing again. This could cause applications to be degraded or unavailable.

Container Error

Persistent application error resulting in containers being terminated. An application error, or exit code 1, means the container was terminated due to an application problem.

CPU Throttling

Containers are hitting their CPU limit and being throttled. CPU throttling will not result in the container being killed, but will be starved of CPU resulting in application slow down.

OOM Kill

When a container reaches its memory limit it is terminated with an OOMKilled status, or exit code 137. This can lead to application instability or unavailability.

Image Pull Error

A container is failing to start as it cannot pull the image.

Advisories are automatically resolved when the problem is no longer detected. You cannot customize the Advisories evaluated. These are fully managed by Sysdig.

Live Logs

Advisor can display live logs for a container, which is the equivalent of running kubectl logs. This is useful for troubleshooting application errors or problems such as pods in a CrashLoopBackOff state.

When selecting a Pod, a Logs tab will appear. If there are multiple containers within a pod, you can select the container you wish to view logs for. Once requested, logs are streamed for 3 minutes before the session is automatically closed (you can simply re-start streaming if necessary).

Live logs are tailed on-demand and thus not persisted. After a session is closed they are no longer accessible.

Manage Access to Live Logs

By default live logs is available for users within the scope of their Sysdig Team. Use Custom Roles to manage live logs permissions.

Configure Agent for Live Logs

Live logs are enabled by default in agent 12.7.0 or newer. Agent 12.6.0 supports live logs but must be manually enabled by setting enabled: true. Older versions of the Sysdig Agent do not support live logs.

Live logs can be enabled or disabled within the agent configuration.

To turn live logs off globally for a cluster, add the following in the dragent.yaml file:

live_logs:
  enabled: false

If using Helm, this is configured via sysdig.settings. For example:

sysdig:
 # Advanced settings. Any option in here will be directly translated into dragent.yaml in the Configmap
 settings:
   live_logs:
     enabled: false

Agent Errors

Live Logs reports the following agent errors:

Error CodeCause
401The kubelet doesn’t have the bearer token authorization enabled.
403The agent ClusterRole,sysdig-agent doesn’t have the node/proxy permission.

2.1 - Overview

Overview leverages Sysdig’s unified data platform to monitor, secure, and troubleshoot your hosts and Kubernetes clusters and workloads.

The module provides a unified view of the health, risk, and capacity of your Kubernetes infrastructure— a single pane of glass for host machines as well as Kubernetes Clusters, Nodes, Namespaces, and Workloads across a multi- and hybrid-cloud environment. You can easily filter by any of these entities and view associated events and health data.

Overview shows metrics prioritized by event count and severity, allowing you to get to the root cause of the problem faster. Sysdig Monitor polls the infrastructure data every 10 minutes and refreshes the metrics and events on the Overview page with the system health.

Key Benefits

Overview provides the following benefits:

  • Show a unified view of the health, risk, resource use, and capacity of your infrastructure environment at scale

    • Render metrics, security events, compliance CIS benchmark results, and contextual events in a single location

    • Eliminate the need for stand-alone security, monitoring, and forensics tools

    • View data on-the-fly by workload or by infrastructure

  • Display contextual live event stream from alerts, Kubernetes, containers, policies, and image scanning results

  • Surface entities intelligently based on event count and severity

  • Drills down from Clusters to Nodes and Namespaces

  • Support Infrastructure monitoring of multi- and hybrid- cloud environments

  • Expose relevant information based on core operational users :

    • DevOps / Platform Ops

    • Security Analyst

    • Service Owner

Accessing the Overview User Interface

You can access and set the scope of Overview in the Sysdig Monitor UI or with the URL:

Click Overview in the left navigation, then select one of the Kubernetes entities:

About the Overview User Interface

The Overview interface opens to the Clusters Overview page. This section describes the major components of the interface and the navigation options.

Though the default landing page is Clusters Overview, when you have no Kubernetes clusters configured, the Overview tab opens to the Hosts view. In addition, when you reopen the Overview menu, the default view will be your last visited Overview page as it retains the visit history.

Overview Rows

Each row represents a Kubernetes entity: a cluster, node, namespace, or workload. In the screenshot above, each row shows a Kubernetes cluster.

  • Navigating rows is easy

    Click on the Overview icon in the left navigation and choose an Overview page, or drill down into the next Overview page to explore the next granular level of data. Each Overview page shows 10 rows by default and a maximum of 100 rows. Click Load More to display additional rows if there are more than 10 rows per page.

  • Ability to select a specific row in an Overview page

    Each row contains the scope of the relevant entity that it is showing data for. Clicking a specific row leads to deselecting the rest of the rows (for instance, selecting staging deselects all other rows in the screenshot above) to focus on the scope of the selected entity, including the events which are scoped out by that row. Pausing to focus on a single row provides a snapshot of what is going on until at the moment with the entity under purview.

  • Entities are ranked according to the severity and the number of events detected in them

    Rows are sorted by the count and severity level of the events associated with the entity and are displayed in descending order. The items with the highest number of high severity events are shown first, followed by medium, low, and info. This organization helps to highlight events demanding immediate attention and to streamline troubleshooting efforts, in environments that may include thousands of entities.

Scope Editor

Scope Editor allows targeting down to a specific entity, such as a particular workload or namespace, from environments that may include thousands of entities. The levels of scope, determined by Kubernetes hierarchy, progresses from Workload to Cluster where Cluster being at the top level. In smaller environments, using the Scope Editor is equivalent to clicking a single row in an Overview page where no scope has been applied.

Cluster: The highest level in the hierarchy. The only scope applied to the page is Cluster. It allows you to select a specific cluster from a list of available ones.

Node: The second level in the hierarchy. The scope is determined by Cluster and Node. Selection is narrowed down to a specific node in a selected cluster.

Namespace: The third level in the hierarchy. The scope is determined by Cluster and Namespace. Selection is narrowed down to a specific namespace in a selected cluster.

Workloads: The last entity in the hierarchy. The scope is initially determined by Cluster and Namespace, then the selection is narrowed to a specific Deployment, Service, or StatefulSet. Choosing all three options are not allowed.

Time Navigation

The Overview feature is based around time. Sysdig Monitor polls the infrastructure data every 10 second and refreshes the metrics and events on the Overview page with the system health. The time range is fixed at 12 hours. However, the gauge and compliance score widgets display the latest data sample, not an aggregation over the entire 12-hour time range.

The Overview feed is always live and cannot be paused.

Unified Stream of Events

The right panel of Overview provides a context-sensitive events feed.

Click an overview row to see relevant Events on the right. Each event is intelligently populated with end-to-end metadata to give context and enable troubleshooting.

Event Types

Overview renders the following event types:

  • Alert: See Alerts.

  • Custom: Ensure that Custom labels are enabled to view this type of events.

  • Containers: Events associated with containers.

  • Kubernetes: Events associated with Kubernetes infrastructure.

  • Scanning: See Image Scanning.

  • Policy: See Policies.

Event Statuses

Overview renders the following alert-generated event statuses:

  • Triggered: The alert condition has been met and still persists.

  • Resolved: A previously existed alert condition no longer persists.

  • Acknowledged: The event has been acknowledged by the intended recipient.

  • Un-acknowledged: The event has not been acknowledged by an intended recipient. All events are by default marked as Un-acknowledged.

  • Silenced: The alert event has been silenced for a specified scope. No alert notification will be sent out to the channels during the silenced window.

General Guidelines

First-Time Usage

  • If the environment is created for the first time, Sysdig Monitor fetches data and generates associated pages. The Overview feature is immediately enabled. However, wait for, at the maximum, 1 hour to see the Overview pages with the necessary data.

  • Overview uses time windows in segments of 1H, 6H and 1D, and therefore wait respectively for 1H, 6H and 1D to be able to see data on the Overview pages.

  • If enough data is not available for the first 1 hour, the “No Data Available” page will be presented until the first 1 hour passes.

Tuning Overview Data

Sysdig Monitor leverages a caching mechanism to fetch pre-computed data for the Overview screens.

If pre-computed data is unavailable, data fetched will be non-computed data, which must be calculated before displaying. This additional computational time adds delays. Caching is enabled for Overview but for optimum performance, you must wait for 1H, 6H, and 1D windows the first time you use Overview. After the specified time has passed, the data will be automatically be cached with every passing minute.

Enabling Overview for On-Prem Deployments

The Overview feature is not available by default on On-Prem deployments. Use the following API to enable it:

  1. Get the Beta settings as follows:

    curl -X GET 'https://<Sysdig URL>/api/on-prem/settings/overviews' \
    -H 'Authorization: Bearer <GLOBAL_SUPER_ADMIN_SDC_TOKEN>' \
    -H 'X-Sysdig-Product: SDC' -k
    

    Replace <Sysdig URL> with the Sysdig URL associated with your deployment and <GLOBAL_SUPER_ADMIN_SDC_TOKEN> with the SDC token associated with your deployment.

  2. Copy the payload and change the desired values in the settings.

  3. Update the settings as follows:

    curl X PUT 'https://<Sysdig URL>/api/on-prem/settings/overview' \
    -H 'Authorization: Bearer <GLOBAL_SUPER_ADMIN_SDC_TOKEN>' \
    -H 'X-Sysdig-Product: SDC' \
    -d '{  "overviews": true,  "eventScopeExpansion": true}'
    

Feature Flags

  • overviews: Set overviews to true to enable the backend components and the UI.

  • eventScopeExpansion: Set eventScopeExpansion to true to enable scope expansion for all the Event types.

2.1.1 - Clusters Data

This topic discusses the Clusters Overview page and helps you understand its gauge charts and the data displayed on them.

About Clusters Overview

In Kubernetes, a pool of nodes combine together their resources to form a more powerful machine, that is a Cluster. The Cluster Overview page provides key metrics indicating the health, risk, capacity, and compliance of each cluster. Your cluster can reside in any cloud or multi-cloud environment of your choice.

Each row in the Clusters page represents a cluster. Clusters are sorted by the severity of corresponding events in order to highlight the area that needs attention. For example, a cluster with high severity events is bubbled up to the top of the page to highlight the issue. You can further drill down to the Nodes or Namespaces Overview page for investigating at each level.

In environments where no Sysdig Secure is enabled, Network I/O is shown instead of the Compliance score.

Interpret the Cluster Data

This topic gives insight into the metrics displayed on the Clusters Overview screen.

Node Ready Status

The chart shows the latest value returned by avg(min(kubernetes.node.ready)).

What Is It?

The number shows the readiness for nodes to accept pods across the entire cluster. The numeric availability indicates the percentage of time the nodes are reported as ready by Kubernetes. For example:

  • 100% is displayed when 10 out of 10 nodes are ready for the entire time window, say, for the last one hour.

  • 95% is displayed when 9 out of 10 nodes are ready for the entire time window and one node is ready only for 50% of the time.

The bar chart displays the trend across the selected time window, and each bar represents a time slice. For example, selecting the last 1-hour window displays 6 bars, each indicating a 10-minute time slice. Each bar represents the availability across the time slice (green) or the unavailability (red).

For instance, the following image shows an average availability of 80% across the last 1-hour, and each 10-minute time slice shows a constant availability for the same time window:

What to Expect?

Expect a constant 100% at all times.

What to Do Otherwise?

If the value is less than 100%, determine whether a node is not available at all, or one or more nodes are partially available.

  • Drill down either to the Nodes screen in Overview or to the “Kubernetes Cluster Overview” in Explore to see the list of nodes and their availability.

  • Check the Kubernetes Node Overview dashboard in Explore to identify the problem that Kubernetes reports.

Pods Available vs Desired

The chart shows the latest value returned by sum(avg(kubernetes.namespace.pod.available.count)) / sum(avg(kubernetes.namespace.pod.desired.count)).

What Is It?

The chart displays the ratio between available and desired pods, averaged across the selected time window, for all the pods in a given Cluster. The upper bound shows the number of desired pods in the Cluster.

For instance, the following image shows 42 desired pods are available to use:

What to Expect?

You should typically expect 100%.

If certain pods take a long time to be available you might temporarily see a value that is less than 100%. Pulling images, pod initialization, readiness probe, and so on causes such delays.

What to Do Otherwise?

Identify one or more Namespaces that have lower availability. To do so, drill down to the Namespaces screen, then drill down to the Workloads screen to identify the unavailable pods.

If the number of unavailable pods is considerably higher (the ratio is significantly low), check the status of the Nodes. A Node failure will cause several pods to become unavailable across most of the Namespaces.

Several factors could cause the pods to stuck in the Pending state:

  • Pods make requests for resources that exceed what’s available across the nodes (the remaining allocatable pods).

  • Pods make requests higher than the availability of every single node. For example, you have 8-core Nodes and you create a pod with a 16-core request. These pods might require reconfiguration and specific setup related to Node affinity and anti-affinity constraints.

  • Namespace quota is reached before making a high resource request.

    If a quota is enforced at the Namespace level, you may hit the limit independent of the resource availability across the Nodes.

CPU Requests vs Allocatable

The chart shows the latest value returned by sum(avg(kubernetes.pod.resourceRequests.cpuCores)) / sum(avg(kubernetes.node.allocatable.cpuCores)).

What Is It?

The chart displays the ratio between CPU requests configured for all the pods in a selected Cluster and allocatable CPUs across all the nodes.

The upper bound shows the number of allocatable CPU cores across all the nodes in the Cluster.

For instance, the image below shows that out of 620 available CPU cores across all the nodes (allocatable CPUs), 71% is requested by the pods:

What to Expect?

Your resource utilization strategy determines what ratio you can expect. A healthy ratio falls between 50% and 80%.

Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, the ratio will be 90% if you have 9 nodes. Having this percentage protects you against a node becoming unavailable.

What to Do Otherwise?

A lower ratio indicates under-utilized resources (and corresponding cost) in your infrastructure. A higher ratio indicates insufficient resources. As a result

  • Applications cannot be scheduled to be run.

  • Pods might not start and remain in a Pending/Unscheduled state.

To triage, do the following:

  • Drill down to the Nodes screen to get insights into how resources are utilized across all nodes.

  • Drill down to the Namespaces screen to understand how resources are requested across Namespaces.

  • Drill down to Explore and refer to the following dashboards:

    • Kubernetes CPU Allocation Optimization: Evaluate whether a significant amount of resources are under-utilized in the infrastructure.

    • Kubernetes Workloads CPU Usage and Allocation: Determine whether pods are properly configured and are using resources as expected.

Can the Value Be Higher than 100%?

Currently, the ratio accounts only for scheduled pods, while pending pods are excluded from the calculation. This means pods have been scheduled to run on Nodes out of the allocatable pods. Consequently, the ratio cannot be higher than 100%.

In the case of over-commitment (pods requesting for more resources than what’s available), you can expect a higher Requests vs Allocatable ratio and a lower Pods Available vs Desired ratio. What it indicates is that most of the available resources are being used, and what’s left is not enough to schedule additional pods. Therefore, the Available vs Desired ratio for pods will decrease.

When your environment has pods that are updated often or that are deleted and created often (for example, testing Clusters), the total requests might appear higher than what it is at any given time. Consequently, the ratio becomes higher across the selected time window, and you might see a value that is higher than 100%. This error is rendered due to how the data engine calculates the aggregated ratio.

Drill down to Kubernetes Cluster Overview to see the CPU Cores Usage vs Requests vs Allocatable time series to correctly evaluate the trend of the request commitments.

Listed below are some of the factors that could cause the pods to stuck in a Pending state:

  • Pods make requests that exceed what’s available across the nodes (the remaining allocatable pods). The Requests vs Allocatable ratio is an indicator of this issue.

  • Pods make requests that are higher than the availability of every single Node. For example, you have 8-core Nodes and you create a pod with a 16-core request. These pods might require reconfiguration and specific setup related to Node affinity and anti-affinity constraints.

  • The Quota set at the Namespace level is reached before a request is configured. The Requests vs Allocatable ratio may not suggest the problem, but the Pods Available vs Desired ratio would decrease, especially for the specific Namespaces. See the Namespaces screen in Overview.

Memory Requests vs Allocatable

The chart shows the latest value returned by sum(avg(kubernetes.pod.resourceRequests.memBytes)) / sum(avg(kubernetes.node.allocatable.memBytes)).

What Is It?

The chart displays the ratio between memory requests configured for all the pods in the Cluster and allocatable memory available across all the Nodes.

The upper bound shows the allocatable memory available across all Nodes. The value is expressed in bytes, displayed in a specified unit.

For instance, the image below shows that out of 29.7 GiB available across all Nodes (allocatable memory), 35% is requested by the pods:

What to Expect?

Your resource utilization strategy determines what ratio you can expect. A healthy ratio falls between 50% and 80%.

Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, 90% if you have 9 nodes. This ratio protects your system against a node becoming unavailable.

What to do Otherwise

A lower ratio indicates under-utilized resources (and corresponding cost) in your infrastructure. A higher ratio indicates insufficient resources. As a result

  • Applications cannot be scheduled to be run.

  • Pods might not start and remain in a Pending/Unscheduled state.

To troubleshoot, do the following:

  • Drill down to the Nodes screen to get insights into how resources are utilized across all the Nodes.

  • Drill down to the Namespaces screen to understand how resources are requested across Namespaces.

  • Drill down to Explore and refer to the following dashboards:

    • Kubernetes Memory Allocation Optimization: Evaluate whether a significant amount of resources are under-utilized in the infrastructure.

    • Kubernetes Workloads Memory Usage and Allocation: Determine whether pods are properly configured and are using resources as expected.

Can the Value be Higher than 100%?

The ratio currently accounts only for scheduled pods, while pending pods are excluded from the calculation. What this implies is that pods have been scheduled to run on Nodes out of the allocatable resources available. Consequently, the ratio cannot be higher than 100%.

In the case of over-commitment (pods requesting for more resources than what’s available), expect a higher Requests vs Allocatable ratio and a lower Pods Available vs Desired ratio. What it indicates is that most of the available resources have been used and what’s left is not enough to schedule additional pods. Therefore, the Pods Available vs Desired ratio will decrease.

When your environment has pods that are updated often or that are deleted and created often (for example, testing Clusters), the total requests might appear higher than what it is at any given time. Consequently, the ratio becomes higher across the selected time window, and you might see a value that is higher than 100%. This error is rendered due to how the data engine calculates the aggregated ratio.

Drill down to Kubernetes Cluster Overview to see the Memory Requests vs Allocatable time series to correctly evaluate the trend for the request commitments.

Listed are some of the factors that could cause your pods to stuck in a Pending state:

  • Pods make requests that exceed what’s available across the nodes (the remaining allocatable pods). The Requests vs Allocatable ratio is an indicator of this issue.

  • Pods make requests that are higher than the availability of every single Node. For example, you have 8-core nodes and you create a pod with a 16-core request. These pods might require configuration changes and specific setup related to node affinity and anti-affinity factors.

  • The Quota set at the Namespace-level is reached before a high request is configured. The Requests vs Allocatable ratio might not suggest the problem, but the Pods Available vs Desired ratio would decrease, especially for the specific Namespaces. See the Namespaces screen in Overview.

Compliance Score

Docker: The latest value returned by avg(avg(compliance.k8s-bench.pass_pct)).

Kubernetes: The latest value returned by avg(avg(compliance.docker-bench.pass_pct)).

What Is it?

The numbers show the percentage of benchmarks that succeeded in the selected time window, respectively for Docker and Kubernetes entities.

What to Expect

If you do not have Sysdig Secure enabled, or you do not have benchmarks scheduled, then you should expect no data available.

Otherwise, the higher the score, the more compliant your infrastructure is.

What to Do Otherwise?

If the score is lower than expected, drill down to Docker Compliance Report or Kubernetes Compliance Report to see further details about benchmark checks and their results.

You may also want to use the Benchmarks / Results page in Sysdig Secure to see the history of checks.

2.1.2 - Nodes Data

This topic discusses the Nodes Overview page and helps you understand its gauge charts and the data displayed on them.

About Nodes Overview

A node refers to a worker machine in Kubernetes. A physical machine or VM can represent a node. The Nodes Overview page provides key metrics indicating the health, capacity, and compliance of each node in your cluster.

In environments where no Sysdig Secure is enabled, Network I/O is shown instead of the Compliance score.

Interpret the Nodes Data

This topic gives insight into the metrics displayed on the Nodes Overview page.

Node Ready Status

The chart shows the latest value returned by avg(min(kubernetes.node.ready)).

What Is It?

The number expresses the Node readiness to accept pods across the Cluster. The numeric availability indicates the percentage of time the Node is reported ready by Kubernetes. For example:

  • 100% is displayed when a Node is ready for the entire time window, say, for the last one hour.

  • 95% when the Node is ready for 95% of the time window, say, 57 out of 60 minutes.

The bar chart displays the trend across the selected time window, and each bar represents a time slice. For example, selecting “last 1 hour” displays 6 bars, each indicating a 10-minute time slice. Each bar shows the availability across the time slice (green) and the unavailability (red).

For instance, the image below indicates the Node has not been ready for the entire last 1-hour time window:

What to Expect?

The chart should show a constant 100% at all times.

What to Do Otherwise?

If the number is less than 100%, review the status reported by Kubernetes. Drill-down to the Kubernetes Node Overview Dashboard in Explore to see details about the Node readiness:

If the Node Ready Status has an alternating behavior, as shown in the image, the node is flapping. Flapping indicates that the kubelet is not healthy. See specific conditions reported by Kubernetes that would help determine the causes for the Node not being ready. Such conditions include network issues and memory pressure.

Pods Ready vs Allocatable

The chart reports the latest value of sum(avg(kubernetes.pod.status.ready)) / avg(avg(kubernetes.node.allocatable.pods)).

What Is It?

It is the ratio between available and allocatable pods configured on the node, averaged across the selected time window.

The Clusters page includes a similar chart named Pods Available vs Desired. However, the meaning is different:

  • The Pods Available vs Desired chart for Clusters highlights how many pods you expect and how many are actually available. See IsPodAvailable for a detailed definition.

  • The Pods Ready vs Allocatable chart for Nodes indicates how many pods can be scheduled on each Node and how many are actually ready.

The upper bound shows the number of pods you can allocate in the node. See node configuration.

For instance, the image below indicates that you can allocate 110 pods in the Node (default configuration), but only 11 pods are ready:

What to Expect?

The ratio does not relate to resource utilization, but it measures the pod density on each node. The more pods you have on a single node, the more effort the kubelet has to put in order to manage the pods, the routing mechanism, and Kubernetes overall.

Given the allocatable is properly set, values lower than 80% indicate a healthy status.

What to Do Otherwise?

  • Reviewing the default maximum pods configuration of the kubelet to allow more pods, especially if the CPU and memory utilization is healthy.

  • Adding more nodes to allow for more pods to be scheduled.

  • Reviewing kubelet process performance and Node resource utilization in general. A higher ratio indicates high pressure on the operating system and for Kubernetes itself.

CPU Requests vs Allocatable

The chart shows the latest value returned by sum(avg(kubernetes.pod.resourceRequests.cpuCores)) / sum(avg(kubernetes.node.allocatable.cpuCores)).

What Is It?

The chart shows the ratio between the number of CPU cores requested by the pods scheduled on the Node and the number of cores available to pods. The upper bound shows the CPU cores available to pods, which corresponds to the user-defined configuration for allocatable CPU.

For instance, the image below shows that the Node has 16 CPU cores available, out of which, 84% are requested by the pods scheduled on the Node:

What to Expect?

Expect a value up to 80%.

Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, 90% if you have 9 nodes. Having a high ratio protects your system against a Node becoming unavailable.

What to Do Otherwise?

  • A low ratio indicates the Node is underutilized. Drill up to the corresponding cluster in the Clusters page to determine whether the number of pods currently running is lower, or if the pods cannot run for other reasons.

  • A high ratio indicates a potential risk of being unable to schedule additional pods on the Node.

    Drill down to the  Kubernetes Node Overview Dashboard to evaluate what Namespaces, Workloads, and pods are running. Additionally, drill up in the Clusters page to evaluate whether you are over-committing the CPU resource. You might not have enough resources to fulfill requests, and consequently, pods might not be able to run on the Node. Consider adding Nodes or replacing Nodes with additional CPU cores.

Can the Value Be Higher than 100%?

Kubernetes schedules pods on Nodes where sufficient allocatable resources are available to fulfill the pod request. This means Kubernetes does not allow having a total request higher than the allocatable. Consequently, the ratio cannot be higher than 100%.

Over-committing (pods requesting resources higher than the capacity) results in a high Requests vs Allocatable ratio and a low Pods Available vs Desired ratio at the Cluster level. What it indicates is that most of the available resources are being used, consequently, what’s available is not sufficient to schedule additional pods. Therefore, Pods Available vs Desired ratio will also decrease.

Memory Requests vs Allocatable

The chart highlights the latest value returned by sum(avg(kubernetes.pod.resourceRequests.memBytes)) / sum(avg(kubernetes.node.allocatable.memBytes)).

What Is It?

The ratio between the number of bytes of memory is requested by the pods scheduled on the node and the number of bytes of memory available.The upper bound shows the memory available to pods, which corresponds to the user-defined allocatable memory configuration.

For instance, the image below indicates the node has 62.8 GiB of memory available, out of which, 37% is requested by the pods scheduled on the Node:

What to Expect?

A healthy ratio falls under 80%.

Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, the ratio is 90% if you have 9 nodes. Having a high ratio protects your system against a node becoming unavailable.

What to Do Otherwise?

  • A low ratio indicates that the Node is underutilized. Drill up to the corresponding cluster in the Clusters page to determine whether the number of pods running is low, or if pods cannot run for other reasons.

  • A high ratio indicates a potential risk of being unable to schedule additional pods on the node.

    • Drill down to the  Kubernetes Node Overview dashboard to evaluate what Namespaces, Workloads, and pods are running.

    • Additionally, drill up in the Clusters page to evaluate whether you are over-committing the memory resource. Consequently, you don’t have enough resources to fulfill requests, and pods might not be able to run. Consider adding nodes or replacing nodes with more memory.

Can the Value be Higher than 100%?

Kubernetes schedules pods on nodes where sufficient allocatable resources are available to fulfill the pod request. This means Kubernetes does not allow having a total request higher than the allocatable. Consequently, the ratio cannot be higher than 100%.

Over-committing (pods requesting for more resources than that are available) results in a high Requests vs Allocatable ratio at the Nodes level and a low Pods Available vs Desired ratio at the Cluster level. What it indicates is that most of the resources are being used, consequently, what’s available is not sufficient to schedule additional pods. Therefore, Pods Available vs Desired ratio will also decrease.

Network I/O

The chart shows the latest value returned by avg(avg(net.bytes.total)).

What Is It?

The sparkline shows the trend of network traffic (inbound and outbound) for a Node. The number indicates the most recent rate of restarts per second.

For reference, the sparklines show the following number of steps (sampling):

  • Last hour: 6 steps, each for a 10-minute time slice

  • Last 6 hours: 12 steps, each for a 20-minute time slice

  • Last day: 12 steps, each for a 2-hour time slice

What to Expect?

The metric highly depends on what type of applications run on the Node. You should expect some network activity for Kubernetes related operations.

Drilling down to the Kubernetes Node Overview Dashboard in Explore will provide additional details, such as network activity across pods.

2.1.3 - Namespaces Data

This topic discusses the Namespaces Overview page and helps you understand its gauge charts and the data displayed on them.

About Namespaces Overview

Namespaces are virtual clusters on a physical cluster. They provide logical separation between the teams and their environments. The Namespaces Overview page provides key metrics indicating the health, capacity, and performance of each Namespace in your cluster.

Interpret the Namespaces Data

This topic gives insight into the metrics displayed on the Namespaces Overview screen.

Pod Restarts

The chart highlights the latest value returned by avg(timeAvg(kubernetes.pod.restart.rate)).

What Is It?

The sparkline shows the trend of pod restarts rate across all the pods in a selected Namespace. The number shows the most recent rate of restarts per second.

For instance, the image shows a rate of 0.04 restarts per second for the last 2-hours, given the selected time window is one day. The trend also suggests a non-flat pattern (periodic crashes).

  • Last hour: 6 steps, each for a 10-minute time slice

  • Last 6 hours: 12 steps, each for a 20-minute time slice

  • Last day: 12 steps, each for a 2-hour time slice

What to Expect?

Expect 0 restarts for any pod.

What to Do Otherwise?

A few restarts across the last one hour or larger time windows might not indicate a serious problem. In the event restart loop, identify the root cause as follows:

  • Drill down to the Workloads page in Overview to identify the Workloads that have been stuck at a restart loop.

  • Drill down to the Kubernetes Namespace Overview to see a detailed trend broken down by pods:

Pods Available vs Desired

The chart shows the latest value returned by sum(avg(kubernetes.namespace.pod.available.count)) / sum(avg(kubernetes.namespace.pod.desired.count)).

What Is It?

The chart displays the ratio between available and desired pods, averaged across the selected time window, in a given Namespace.

The upper bound shows the number of desired pods in the namespace.

For instance, the image below shows 42 desired pods that are available:

What to Expect?

Expect 100% on the chart.

If certain pods take a significant amount of time to become available due to delays (image pull time, pod initialization, readiness probe) you might temporarily see a ratio lower than 100%.

What to Do Otherwise?

  • Identify one or more Workloads that have low availability by drilling down to the Workloads page.

  • Once you identify the Workload, drill down to the related dashboard in Explore. For example, Kubernetes Deployment Overview to determine the trend and the state of the pods.

    For instance, in the following image, the ratio is 98% (3.93 / 4 x 100). The decline is due to an update that caused pods to be terminated and consequently to be started with a newer version.

CPU Used vs Requests

The chart shows the latest value returned by sum(avg(cpu.cores.used)) / sum(avg(kubernetes.pod.resourceRequests.cpuCores)).

What Is It?

The chart shows the ratio between the total CPU usage across all the pods in the Namespace and the total CPU requested by all the pods.

The upper bound shows the total CPU requested by all the pods. The value is expressed as the number of CPU cores.

For instance, the image below shows the pods in a Namespace requests for 40 CPU cores, of which only 43% is being used (about 17 cores):

What to Expect?

The value you see depends on the type of Workloads running in the Namespace.

Typically, values that fall between 80% and 120% is considered healthy. Values higher than 100% is considered healthy relatively for a short amount of time.

For applications whose resource usage is constant (such as background processes), expect the ratio to be close to 100%.

For “bursty” applications, such as an API server, expect the ratio to be less than 100%. Note that this value is averaged for the selected time window, therefore, a usage spike would be compensated by an idle period.

What to Do Otherwise?

A low usage indicates that the application is not properly running (not executing the expected functions) or the Workload configuration is not accurate (requests are too high compared to what the pods actually need).

A high usage indicates that the application is operating with a heavy load or the workload configuration is not accurate (requests are too low compared to what pods actually need).

In either case, drill down to the Workloads page to determine the workload that requires a deeper analysis.

Can the Value Be Higher than 100%?

Yes, it can.

  • You can configure requests without limits, or requests lower than the limits. In either case, you are allowing the containers to use more resources than requested, typically to handle temporary overloads.

  • Consider a Namespace with two Workloads with one pod each. Say, one Workload is configured to request for 1 CPU core and uses 1 CPU core (ratio of Used vs Request is 100%). The other Workload is configured without any request and uses 1 CPU core. In this example, 2 CPU cores used to 1 CPU core requested ratio at the Namespace level is 200%.

Memory Used vs Requests

The chart shows the latest value returned by sum(avg(memory.bytes.used)) / sum(avg(kubernetes.pod.resourceRequests.memBytes)).

What Is It?

The chart shows the ratio between the total memory usage across all pods of the Namespace and the total memory requested by all pods.

The upper bound shows the total memory requested by all the pods, expressed in a specified unit for bytes.

For instance, the image below shows that all the pods in the Namespace requests for 120 GiB, of which only 24% is being used (about 29 GiB):

What to Expect?

It depends on the type of Workloads you run in the Namespace. Typically, values that fall between 80% and 120% are considered healthy.

Values that are higher than 100% considered normal for a relatively short amount of time.

What to Do Otherwise?

A low usage indicates the application is not properly running (not executing the expected functions) or the workload configuration is not accurate (high requests compared to what the pods actually need).

A high usage indicates the application is operating with a high load or the Workload configuration is not accurate (Fewer requests compared to what the pods actually need).

Given the configured limits for the Workloads and the memory pressure on the nodes, if the Workloads use more memory than what’s requested they are at risk of eviction. See Exceed a Container’s Limit for more information.

In both cases, you may want to drill down to the Workloads page to determine which Workload requires a deeper analysis.

Can the Value Be Higher than 100%?

Yes, it can.

  • You can configure requests without limits, or requests lower than the limits. In either case, you are allowing the containers to use more resources than requested, typically to handle temporary overloads.

  • Consider a Namespace with two Workloads with one pod each. Say, one Workload is configured to request for 1 GiB of memory and uses 1 GiB (Used vs Request ratio is 100%). The other Workload is configured without any request and uses 1 GiB. In this example, 2 GiB of Memory Used to1 GiB Requested ratio at the Namespace level is 200%.

Network I/O

The chart shows the latest value returned by avg(avg(net.bytes.total)).

What Is It?

The sparkline shows the trend of network traffic (inbound and outbound) for all the pods in the Namespace. The number shows the most recent rate, expressed in restarts per second.

For reference, the sparklines show the following number of steps (sampling):

  • Last hour: 6 steps, each for a 10-minute time slice

  • Last 6 hours: 12 steps, each for a 30-minute time slice

  • Last day: 12 steps, each for a 2-hour time slice

What to Expect?

The type of applications run in the Namespace determine the metrics. Drilling down to the Kubernetes Namespace Overview Dashboard in Explore provides additional details, such as network activity across pods.

2.1.4 - Workloads Data

This topic discusses the Workloads Overview page and helps you understand its gauge charts and the data displayed on them.

About Workloads Overview

Workloads, in Kubernetes terminology, refers to your containerized applications. Workloads comprise of Deployments, Statefulsets, and Daemonsets within a Namespace.

In a Cluster, worker nodes run your application workloads, whereas the master node provides the core Kubernetes services and orchestration for application workloads. The Workloads Overview page provides the key metrics indicating health, capacity, and compliance.

Interpret the Workloads Data

This topic gives insight into the metrics displayed on the Workloads Overview page.

Pod Restarts

The chart displays the latest value returned by sum(timeAvg(kubernetes.pod.restart.rate)).

What Is It?

The sparkline shows the trend of Pod Restarts rate across all the pods in a selected Workload. The number shows the most recent rate, expressed in Restarts per Second.

For instance, the image below shows the trend for the last hour. The number indicates that the rate of pod restarts is less than 0.01 for the last 10 minutes.

For reference, the sparklines show the following number of steps (sampling):

  • Last hour: 6 steps, each for a 10-minute time slice.

  • Last 6 hours: 12 steps, each for a 20-minute time slice.

  • Last day: 12 steps, each for a 2-hour time slice.

What to Expect?

A healthy pod will have 0 restarts at any given time.

What to Do Otherwise?

In most cases, fewer restarts in the last hour (or larger time windows) do not indicate a serious problem. Drill down to the Kubernetes Overview Dashboard related to the Workload in Explore. For example, Kubernetes StatefulSet Overview provides a detailed trend broken down by pods.

In this example, the number of restarts is constant (roughly every 5 minutes) and no pods are ready. This might indicate a crash loop back-off .

Pods Available vs Desired

The chart shows the latest value of returned by sum(avg(kubernetes.deployment.replicas.available)) / sum(avg(kubernetes.deployment.replicas.desired)).

What Is It?

The chart displays the ratio between available and desired pods, averaged across the selected time window, for all the pods in a given Workload.

The upper bound shows the number of desired pods in the Workload.

For instance, the image below shows all the 42 desired pods are available.

What to Expect?

You should typically expect 100%.

If certain pods take a significant amount of time to become available (image pull time, pod initialization, readiness probe), then you may temporarily see a ratio lower than 100%.

What to Do Otherwise?

Determine the Workloads that have low availability by drilling down to the related Dashboard in Explore. For example, the Kubernetes Deployment Overview helps understand the trend and the state of the pods.

For instance, the image above shows that the ratio is 98% (3.93 / 4 x 100). The slight decline is due to an update that caused pods to be terminated and consequently to be started with a newer version.

CPU Used vs Requests

The chart shows the latest value returned by sum(avg(cpu.cores.used)) / sum(avg(kubernetes.pod.resourceRequests.cpuCores)).

What Is It?

The chart shows the ratio between the total CPU usage across all pods of a selected Workload and the total CPU requested by all the pods.

The upper bound shows the total CPU requested by all the pods. The value denotes the number of CPU cores.

In this image, the pods in the Workload requests for 40 CPU cores, of which 43% is actually used (about 17 cores).

What to Expect?

It depends on the type of workload.

For applications (background processes) whose resource usage is constant, expect the ratio to be around 100%.

For “bursty” applications, such as an API server, expect the ratio to be lower than 100%. Note that the value is averaged for the selected time window, therefore, a usage spike would be compensated by an idle period.

Generally, values between 80% and 120% are considered normal. Values that are higher than 100% deemed normal if it’s observed only for a relatively short time.

What to Do Otherwise?

  • A low usage indicates that the application is not properly running (not executing the expected functions) or the Workload configuration is not accurate (requests are too high compared to what the pods actually need).

  • A high usage indicates that the load is high for applications or the Workload configuration is not accurate (low requests compared to what the pods actually need).

In either case, drill down to the Kubernetes Overview Dashboard corresponding to the Workload in Explore. For example, the Kubernetes Deployment Overview Dashboard provides insight into resource usage and configuration.

Can the Value Be Higher than 100%?

Yes, it can.

  • Configuring CPU requests without limits or requests lower than limits is permissible. In these cases, you are allowing the containers to use more resources than requested, typically to handle temporary overloads.

  • Consider a Workload with two containers. Say, one container is configured to request for 1 CPU core and uses 1 CPU core (Used vs Request ratio is 100%). The other is configured without any request and uses 1 CPU core. In this example, the 2 CPU core Used to 1 CPU core Requested ratio is 200% at the Workload level.

What Does “No Data” Mean?

If the Workload is configured with no requests and limits, then the Usage vs Requests ratio cannot be computed. In this case, the chart will show “no data”. Drill down to the Dashboard in Explore to evaluate the actual usage.

You must always configure requests. Setting requests helps to detect Workloads that require reconfiguration.

Kubernetes itself might expose Workloads with no requests or limits configured. For example, the kube-system Namespace can have Workloads without requests configured.

Memory Used vs Requests

The chart shows the latest value returned by sum(avg(memory.bytes.used)) / sum(avg(kubernetes.pod.resourceRequests.memBytes)).

What Is It?

The chart shows the ratio between the total memory usage across all the pods in a Workload and the total memory requested by the Workload.

The upper bound shows the total memory requested by all the pods, expressed in the specified unit of bytes.

For instance, the image shows that the pods in the selected Workload requested for 120 GiB, of which 24% is actually used (about 29 GiB).

What to Expect?

The type of Workload determines the ratio. Values between 80% and 120% are considered normal. Values that are higher than 100% is deemed normal if it’s observed only for a relatively short time.

What to Do Otherwise?

A low memory usage indicates that the application is not properly running (not executing the expected functions) or the Workload configuration is not accurate (requests are too high compared to what the pods actually need).

A high memory usage indicates that the load is higher for applications or the Workload configuration is not accurate (low requests compared to what the pods actually need).

Given the configured limits for the Workloads and the memory pressure on the nodes, if the Workloads use more memory than what’s requested they are at risk of eviction. For more information, see Container’s Memory Limit.

In either case, drill down to the Workloads page to determine the Workload that requires a deeper analysis.

Can the Value Be Higher than 100%?

Yes, it can.

  • Configuring memory requests without limits or requests lower than limits is permissible. In these cases, you are allowing the containers to use more resources than requested, typically to handle temporary overloads.

  • Consider a Workload with two containers. Say, one container is configured to request for 1 GiB of memory and uses 1 GiB (Used vs Request ratio is 100%), while the other is configured without any request and uses 1 GiB of memory. In this example, the 2 GiB of memory used to 1 GiB requested ratio is 200% at the Workload level.

What Does “No Data” Mean?

If the Workload is configured with no memory requests and limits, then the Usage vs Requests ratio cannot be computed. In this case, the chart will show “no data”. Drill down to the Dashboard in Explore to evaluate the actual usage.

You must configure requests. It helps to detect Workloads that require reconfiguration.

Kubernetes itself might expose Workloads with no requests or limits configured. For example, the kube-system Namespace can have Workloads without requests configured.

Network I/O

The chart shows the latest value returned by avg(avg(net.bytes.total)).

What Is It?

The sparkline shows the trend of network traffic (inbound and outbound) for the Workload. The number shows the most recent rate, expressed in bytes per second in a specific unit.

For reference, the sparklines show the following number of steps (sampling):

  • Last hour: 6 steps, each for a 10-minute time slice

  • Last 6 hours: 12 steps, each for a 30-minute time slice

  • Last day: 12 steps, each for a 2-hour time slice

What to Expect?

The type of application runs in the Workload determines the metrics. Drill down to the Kubernetes Overview Dashboard corresponding to the Workload in Explore. For example, the Kubernetes Deployment Overview Dashboard provides additional details, such as network activity across pods.

3 - Explore

About Explore

The Sysdig Monitor user interface centers around the Explore module, where you perform the majority of infrastructure monitoring operations. Sysdig Monitor automatically discovers your stack and presents pre-built views in Metric Explorer. Explore provides you with the ability to view and troubleshoot key metrics and entities of your infrastructure stack. You can drill down to any layers of your infrastructure hierarchy and view granular-level data. Metrics Explorer allows you to run form queries and build infrastructure views by using interactive metric and label filtering.

Grouping controls how entities are organized in Explore. Grouping is fully customizable by logical layers, such as containers, Kubernetes clusters, or services.

In addition to the Explore interface, Sysdig provides a PromQL Query Explorer and PromQL Library. They help you understand metrics and corresponding labels and values clearly, to create queries faster, and to build Dashboard and Alerts easily.

Benefits of Using Explore

  • Explore gives insight into:

    • Metrics and labels associated with your infrastructure.
    • Scope of the metrics. View the list of metrics collected from different part of the infrastructure. You can easily understand the association between a metric and the infrastructure layer it belongs to.
  • Explore allows

    • One-click access to the Metric Explorer view for All Workloads, Nodes, Containerized Apps, and Hosts&Containers in your environment.
    • One-click access to PromQL Query Explore and PromQL Library.
    • One-click access to available data sources. These are the immutable Groupings and clicking one of these options has the same effect as selecting a Grouping from the menu dropdown.
    • Use either form query or PromQL to query metrics and build Dashboard panels or create alerts.
    • View the last selected Grouping.

Explore Interface

This section outlines the key areas of the interface and detail basic navigation steps.

There are several key areas highlighted in the image above:

  • Switch Products: This allows you to switch between Sysdig products.

  • Grouping: Groupings are hierarchical organizations of tags, allowing users to organize their infrastructure views using the Grouping Wizard in a logical hierarchy. For more information on groupings, refer to Grouping, Scoping, and Segmenting Metrics.

  • Modules: Quick links for each of the main Sysdig Monitor modules: Explore, Dashboards, Alerts, Events, and Captures.

  • PromQL Query Explorer: Run PromQL queries to build your infrastructure views and get an in-depth insight into what’s going on. See PromQL Query Explorer.

  • PromQL Query Library: Provides a set of out-of-the-box PromQL queries. See PromQL Library.

  • Management: Quick links for Sysdig Spotlight, help material, and the user profile configuration settings.

  • Scope Filtering: This allows you to explore deep down the infrastructure stack and retrieve all the components in a certain category in a single organized element.

  • Search Metrics: Helps you select desired metrics and build a query with one-click.

  • Time Navigation: Helps you customize the time window used for displaying data

  • Key Page Actions: Quick links to create alerts and dashboards.

Learn More

Learn more about using Explore in the following sections:

3.1 - Metrics Explorer

Use the Metrics Explorer for advanced metric exploration and querying. In addition to the core functionalities (grouping, scope tree, metrics, and graphing) of Explore, Metrics Explorer provides you the ability to:

  • Graph multiple metrics simultaneously for correlation. For example, CPU usage vs CPU limits.
  • View ungrouped queries by default, showing the individual time series for a metric.
  • View context-specific metrics for a selected a scope. You no longer see no data for a selected metric.
  • View metrics that are logically categorized with metric namespace prefix.
  • Display metrics at high resolution. For example a 1-hour view now shows data at 10-seconds resolution instead of 1 minute.

About the Metrics Explorer UI

The main components of the Metrics Explorer UI are widgets, time navigation, dashboard, and time series panel.

You’ll find Metrics Explorer on the Explore slider menu on the Sysdig Monitor UI. Click Explore to display the slider.

Use Metrics Explorer

This section helps you drill down into your infrastructure stack for troubleshooting views and create alerts and dashboard by using Metrics Explorer.

Switch Groupings

Sysdig Monitor detects and collects the metrics associated with your infrastructure once the agent is deployed in your environment. Use the Explore UI to search, group, and troubleshoot your infrastructure components.

To switch between available data sources:

  1. On the Metrics Explorer tab, click the My Groupings drop-down menu:

  2. Select the desired grouping from the drop-down list.

Groupings Editor

The Groupings Editor helps you create and manage your infrastructure groupings.

Filter Infrastructure (Scope Filtering)

You can drill down the infrastructure stack and get insight into the numerous metrics available to you at each level of your stack. These displays can be found by selecting a top-level infrastructure object, then using the scope filtering for relevant infrastructure objects and metrics filtering for desired metrics.

Sysdig Monitor displays only the metrics and dashboards that are relevant to the selected infrastructure object.

Metrics

You can view specific metrics for an infrastructure object by navigating the scope filtering and metrics filtering menus:

  1. On the Metrics Explorer tab, open the scope filtering menu.

  2. Select the infrastructure object you want to explore.

  3. Navigate to Filter metrics.

  4. Click the desired metrics.

    The metric will instantly be presented on the form query and on the dashboard. The scope of the metric, when viewed via the scope filtering menu, is set to the infrastructure object that you have selected.

  5. Optionally, click Add Query, then click a metric to add additional queries.

    You can do all the operations, such as setting Time Aggregation, Show Top 50 and Bottom 50 time series, Group Rollup, Segmentation, and Unit of Value Returned by Query, as you use form query. See Building a Form-Based Query for more information.

Create an Alert

  1. Build a form query as described in Metrics.

  2. Click Create Alert.

    If you have built multiple queries, you will be prompted to choose a single metric to be alerted on.

  3. Select the metric you want to create an alert for.

  4. Click Create Alert. The New Metric Alert page will be displayed.

    The group aggregation will be set to the default one for an alert that is created from a query with group aggregation set to none.

  5. Complete creating the alert as described in Metric Alerts.

Create a Dashboard Panel

  1. Build a form query as described in Metrics.

  2. Click Create dashboard panel.

  3. Select an existing dashboard or create a new dashboard by typing in a name.

  4. Click Copy and Open. The newly created dashboard will be displayed.

    The group aggregation will be set to the default one for a dashboard that is created from a query with group aggregation set to none.

  5. Optionally, continue with other operations as described in Managing Panels.

3.1.1 - Groupings Editor

Groupings are hierarchical organizations of labels, allowing you to organize your infrastructure views on the Explore UI in a logical hierarchy.

An example grouping is shown below:

The example above groups the infrastructure into four levels. This results in a tree view in the Groupings Editor with four levels, with rows for each infrastructure object applicable to each level.

As each label is selected, Sysdig Monitor automatically filters out labels for the next selection that no longer fit the hierarchy, to ensure that only logical groupings are created.

Sysdig Monitor automatically organizes all the configured groupings that are inapplicable to the current infrastructure under Inapplicable Groupings.

Manage Groupings

You can perform the following operations using the Groupings Editor:

  • Search existing groupings

  • Create a new grouping

  • Edit an existing grouping

  • Rename a groupings

  • Share a grouping with the active team

Search for a Grouping

  1. Do one of the following:

    • From Explore, click the Groupings drop-down. Search for the desired grouping.

      Either select the desired grouping, or search for it by scrolling down the list or by using the search bar, and then select it.

    • Click Manage Groupings and open the Groupings Editor.

      Either select the desired grouping, or search for it by scrolling down the list or by using the search bar, and then select it.

Create a New Grouping

  1. In the Explore tab, click the Groupings drop-down, then click Manage Groupings.

  2. Open the Groupings Editor.

  3. Click Add.

    The New Groupings page is displayed.

  4. Enter the following information:

    • Groupings Name: Set an appropriate name to identify the grouping that you are creating.

    • Shared with Team: Select if you want to share the grouping with the active team that you are part of.

    • Hierarchy: Determine the hierarchical representation of the grouping by choosing a top-level label and subsequent ones. Repeat adding the labels until there are no further layers available in the infrastructure label hierarchy.

      You can search for the label by entering the first few characters in the Select label drop-down or scrolling down. As you add labels, the preview displays associated components in your infrastructure.

  5. Check the preview to ensure that the label selection is correct.

  6. Click Save&Apply.

Rename a Grouping

Renaming is allowed only for groupings that are owned by you. To rename a shared grouping, create a copy of it and edit the name.

  1. On Explore, click the Groupings drill-down. Search for the desired grouping.

  2. Click the Edit button next to the grouping.

  3. Open the Groupings Editor.

  4. Select the desired grouping. You can either scroll down the list or use the search bar.

  5. Click Edit.

    The edit window is displayed on the screen.

  6. Specify the new grouping name, then click Save& Apply to save the changes.

Share a Grouping with Your Active Team

Custom groupings are owned by you, and therefore you can share them with all the members of your active team. To share a default grouping, create a custom grouping and use the Shared with Team option in the Grouping Editor.

  1. Click the Groupings drill-down and click Manage Groupings.

    The Grouping Editor screen appears.

  2. Highlight the relevant grouping and click Edit.

  3. Click Shared with Team.

  4. Click Save &Apply to save the changes.

To share a default grouping, create a custom grouping and then use the Shared with Team option in the Grouping Editor.

3.1.2 - Time Windows

By default, Sysdig Monitor displays information in Live mode. This means that dashboards, panels, and the Explore views will be automatically updated with new data as time passes, and will display the most recent data available for the configured time window.

By default, time navigation will enter Live mode with an hour time window.

The time window navigation bar provides users with quick links to common time windows, as well as the ability to configure a custom time period in order to review historical data.

As shown in the image above, the navigation bar provides a number of pieces of information:

  • The state of the data (Live or Past).

  • The current time window.

  • The configured timezone.

In addition, the navigation bar provides:

  • Quick links for common time windows

    • Metrics Explorer: five minute, ten minutes, one hour, six hours, twelve hours, one day, and two weeks.
    • Explore: ten seconds, five minute, ten minutes, one hour, six hours, one day, and two weeks.
  • A custom time window configuration option.

  • A pause/play button to exit Live mode and freeze the data to a time window, and to return to Live mode.

  • Step back/forward buttons to jump through a frozen time window to review historical data.

  • Zoom in/out buttons to increase/decrease the time window (note applicable to Metrics Explorer)

Configure a Custom Time Period

The Time Navigation drop-down panel can be used to configure a specific time range. To configure a manual range:

Metrics Explorer

  1. On the Metrics Explorer tab, click the custom panel the time navigation bar.

  2. Configure the start and end points, and click Save to save the changes.

Some limitations apply to custom time windows. Refer to Time Window Limitations for more information.

Explore

  1. On the Explore tab, click CUSTOM on the time navigation bar.

  2. Configure the start and end points, and click Adjust time to save the changes.

Some limitations apply to custom time windows. Refer to Time Window Limitations for more information.

Time Window Limitations

Some time window configurations may not be available in certain situations. In these instances, a modification to the time window is automatically applied, and a warning notification will be displayed:

There are two main reasons for a time window being unavailable. Both relate to data granularity and specificity:

  • The time window specifies the granularity of data that has expired and is no longer available. For example, a time window specifying a one-hour time range from six months ago would not be available, resulting in the time window being modified to a time range of at least one day.

  • The time window specifies a granularity of data that is too high given the size of the window, as a graph can only handle a certain number of data points. For example, a multi-hour time range would contain too many datapoints at one-minute granularity, and would automatically be modified to 10-minute granularity.

3.1.3 - Explore Workflows

While every user has unique needs from Sysdig Monitor, there are three main workflows that you can follow when building out the interface and monitoring your infrastructure.

Workflow One

This workflow assumes that an alert has not been triggered yet.

Start with Explore , identify a problem area, then drill-down into the data. This workflow is the most basic approach, as it begins with a user monitoring the overall infrastructure, rather than with a specific alert notification. The workflow tends to follow the following steps:

  1. Organize the infrastructure with groupings.

  2. Define key signals with alerts and dashboards to detect a problem.

  3. Identify a problem area, and drill down into the data using dashboards, metrics, and by adjusting groupings and scope as necessary.

Workflow Two

Start with an event notification, and begin troubleshooting. This workflow begins with an already configured alert and event being triggered. Unlike workflow one, this workflow assumes that pre-determined data boundaries have already been set:

  1. Explore the event by adjusting time windows, scope, and segmentation.

  2. Identify the exact area of concern within the infrastructure.

  3. Drill down into the data to troubleshoot the issue.

Workflow Three

Customize default dashboard panels to troubleshoot a potential issue. This workflow assumes that an issue has been identified within one of the default dashboards, but alerts have not been set up for the problem area.

  1. Copy the displayed panel to a new dashboard.

  2. Create an alert based on the dashboard panel.

  3. Configure a Sysdig Capture on demand.

3.2 - PromQL Query Explorer

Use the PromQL Query Explorer to run PromQL queries and build infrastructure views. It allows you

  • Write PromQL queries faster by automatically identifying the common labels and labels among different metrics.

    See Run PromQL Queries Faster with Extended Label Set.

  • Query metrics by leveraging advanced functions, operators, and boolean logic.

  • Interactively modify the PromQL results by using visual label filtering.

  • Use label filtering to visualize the common labels between metrics, which is key when combining multiple metrics.

About the PromQL Explorer UI

The main components of the PromQL Query Explorer UI include widgets, time navigation, and dashboard and time series panel.

You’ll find PromQL Explore under the Explore tab on the Sysdig Monitor UI.

PromQL Query

The PromQL field supports manually building PromQL queries. You can manually enter simple or complex PromQL queries and build dashboards and create alerts. The PromQL Query Explorer allows running up to 5 queries simultaneously. With the query field, you can do the following:

  • Explore metrics and labels available in your infrastructure.

    For example, calculate the number of bytes received in a selected host:

    sysdig_host_net_total_bytes{host_mac="0a:e2:e8:b4:6c:1a"}
    

    Calculate the number of bytes received in all the hosts except one:

    sysdig_host_net_total_bytes{host_mac!="0a:a3:4b:3e:db:a2"}
    

    Compare current data with historical data:

    sysdig_host_net_total_bytes offset 7d
    
  • Use arithmetic operators to perform calculations on one or more metrics or labels.

    For example, calculate the rate of incoming bytes and convert it to bits:

    rate(sysdig_host_net_total_bytes[5m]) * 8
    
  • Build complex PromQL queries.

    For example, return summary ingress traffic across all the network interfaces grouped by instances

    sum(rate(sysdig_host_net_total_bytes[5m])) by (container_id)
    

Label Filtering

Label filtering to automatically identify common labels between queries for vector matching. In the given example, you can see that A and B metrics have only the host_mac label in common.

You can also filter by using the relational operators available in the time series table. Simply click the operator for it to be automatically applied to the queries. Run the queries again to visualize the metrics.

Filtering simultaneously applies to all the queries in the PromQL Query Explorer.

Widgets

PromQL Query Explorer supports only time series (Timechart). You can run advanced (PromQL) queries and build dashboard panels. PromQL Explorer does not support building form-based queries.

Time Navigation

PromQL Query Explorer is designed around time. After a query has been executed, Sysdig Monitor polls the infrastructure data every 10 seconds and refreshes the metrics on the Dashboard panel. You select how to view this gathered data by choosing a Preset interval and a time Range. For more information, see Time Navigation.

Legend

The legend is positioned on the upper right corner of the panel. Each query will have associated legends listed in the same execution order.

Build a Query

  1. On the Explore tab, click PromQL Query.

  2. Enter a PromQL query manually.

    sysdig_host_cpu_used_percent
    

    Click Add Query to run multiple queries. You can run up to 5 queries at once.

    sysdig_container_cpu_used_percent
    
  3. Click Run Query or press command+Enter.

    A dashboard will appear on the screen. You can either Copy to a Dashboard or Create an Alert.

Copy to a Dashboard

  1. Run a PromQL query.

  2. Click Create > Create a Dashboard Panel.

  3. Either select an existing Dashboard or enter the Dashboard name to copy to a new Dashboard.

  4. Click Copy and Open.

    The new Dashboard panel with the given title will open to the Dashboard tab.

    You might want to continue with the Dashboard operations as given in Dashboards.

Create an Alert

  1. Run a PromQL query.

  2. Click Create > Create Alert.

  3. If you have multiple queries, select the query you want to create the alert for.

    A new PromQL Alert page for the selected query appears on the screen.

    Continue with PromQL Alerts.

Remove a Query

Click the three dots next to the query field to remove the query.

Toggle Query Results

Click the respective query buttons, for example, A or B, to show or hide query results.

3.3 - PromQL Library

PromQL is a powerful language to query metrics, but it could be challenging for beginners. To ease the learning curve of PromQL, Sysdig provides a set of curated examples, called PromQL Library. It helps you perform complex queries against your metrics with one click and get insight into your infrastructure problems which was not previously possible with Sysdig querying. For example, identify containers > 90% limit and counting pods per namespace, and so on.

You have the following categories currently to experiment with PromQL:

  • Kubernetes

  • Infrastructure

  • Troubleshooting

  • PromQL 101

Access PromQL Library

  1. Log in to Sysdig Monitor.

  2. Click Explore from the left navigation pane.

  3. On the Explore tab, click PromQL Library.

    The tab opens to a list of PromQL examples.

Use PromQL Library

Click Try me to open PromQL Query Explore. A visualization corresponding to the query will be displayed. You can do the following with the query:

  • Create a dashboard panel

  • Create an alert

See PromQL Query Explorer for more information.

To copy a query, click the copy icon next to the query.

Filter PromQL Queries

Automatic tag filtering identifies common tags in the given examples. You can use the following to filter queries:

  • Visual label filtering: Simply click the desired color-coded label to filter queries based on tags.

  • Text search: Use the Text Search bar on the top-left navigation pane.

  • Label search: Use the Label drop-down list on the top-left navigation pane.

  • Filter using categories: Use the All Categories checkboxes.

3.4 - (Deprecated) Explore Interface

This section helps you navigate the Explore menu in the Sysdig Monitor UI.

Switch Groupings

Sysdig Monitor detects and collects the metrics associated with your infrastructure once the agent is deployed in your environment. Use the Explore UI to search, group, and troubleshoot your infrastructure components.

To switch between available data sources:

  1. On the Explore tab, click the My Groupings drop-down menu:

  2. Select the desired grouping from the drop-down list.

Groupings Editor

The Groupings Editor helps you create and manage your infrastructure groupings.

Use Drill-Down Menu

Sysdig Monitor users can drill down into the infrastructure by using the numerous dashboards and metrics available for display in the Explore UI. These displays can be found by selecting an infrastructure object, and opening the drill-down menu.

Sysdig Monitor only displays the metrics and dashboards that are relevant to the selected infrastructure object.

Metrics

Sysdig Monitor users can view specific metrics for an infrastructure object by navigating the drill-down menu:

  1. On the Explore tab, open the drill-down menu.

  2. Navigate to Search Metrics and Dashboard.

  3. Select the desired metrics.

    The metric will now be presented on the Explore UI, until the user navigates away from it.

    The scope of the metric, when viewed via the drill-down menu, is set to the infrastructure object that you have selected.

Troubleshooting Views

The drill-down menu displays all the default dashboard templates relevant to the selected infrastructure object. These Troubleshooting Views are broken into the following sections:

The scope of the Troubleshooting View, when viewed via the drill-down menu, is set to the infrastructure object that you have selected from the drill-down.

To navigate to the Troubleshooting Views:

  1. On the Explore tab, select an infrastructure object.

  2. Open the drill-down menu and select the desired infrastructure element

  3. Navigate to Search Metrics and Dashboard.

  4. Select the desired troubleshooting view.

    The selected dashboard will now be presented on the screen, until you navigate away from it.

Pin and Unpin the Drill-Down Menu

  1. On the Explore tab, select an infrastructure object.

  2. Open the drill-down menu.

  3. Click Pin Menu to pin the menu to the Explore tab.

    To unpin the menu, click Unpin Menu at the bottom of the menu.

4 - Metrics

Metrics are quantitative values or measures that can be grouped/divided by labels.

Sysdig Monitor metrics are divided into two groups: default metrics (out-of-the-box metrics associated with the system, orchestrator, and network infrastructure), and custom metrics (JMX, StatsD, and multiple other integrated application metrics).

Sysdig automatically collects all types of metrics, and auto-labels them. Custom metrics can also have custom (user-defined) labels.

Out-of-the box, when an agent is deployed on a host, Sysdig Monitor automatically begins collecting and reporting on a wide array of metrics. The sections below describe how those metrics are conceptualized within the system.

In the sections, you can learn more also about the metrics types and the data aggregation techniques supported by Sysdig Monitor:

4.1 - Grouping, Scoping, and Segmenting Metrics

Data aggregation and filtering in Sysdig Monitor are done through the use of assigned labels. The sections below explain how labels work, the ways they can be used, and how to work with groupings, scopes, and segments.

Labels

Labels are used to identify and differentiate characteristics of a metric, allowing them to be aggregated or filtered for Explore module views, dashboards, alerts, and captures. Labels can be used in different ways:

  • To group infrastructure objects into logical hierarchies displayed on the Explore tab (called groupings). For more information, refer to Groupings .

  • To split aggregated data into segments. For more information, refer to Segments.

There are two types of labels:

  • Infrastructure labels

  • Metric descriptor labels

Infrastructure Labels

Infrastructure labels are used to identify objects or entities within the infrastructure that a metric is associated with, including hosts, containers, and processes. An example label is shown below:

Sysdig Notation

kubernetes.pod.name

Proemetheus Notation

kubernetes_pod_name

The table below outlines what each part of the label represents:

Example Label ComponentDescription
kubernetesThe infrastructure type.
podThe object.
nameThe label key.

Infrastructure labels are obtained from the infrastructure (including from orchestrators, platforms, and the runtime processes), and Sysdig automatically builds a relationship model using the labels. This allows users to create logical hierarchical groupings to better aggregate the infrastructure objects in the Explore module.

For more information on groupings, refer to the Groupings.

Metric Descriptor Labels

Metric descriptor labels are custom descriptors or key-value pairs applied directly to metrics, obtained from integrations like StatsD, Prometheus, and JMX. Sysdig automatically collects custom metrics from these integrations, and parses the labels from them. Unlike infrastructure labels, these labels can be arbitrary, and do not necessarily map to any entity or object.

Metric descriptor labels can only be used for segmenting, not grouping or scoping.

An example metric descriptor label is shown below:

website_failedRequests:20|region='Asia', customer_ID='abc'

The table below outlines what each part of the label represents:

Example Label ComponentDescription
website_failedRequestsThe metric name.
20The metric value.
region=‘Asia’, customer_ID=‘abc’The metric descriptor labels. Multiple key-value pairs can be assigned using a comma separated list.

Sysdig recommends not using labels to store dimensions with high cardinalities (numerous different label values), such as user IDs, email addresses, URLs, or other unbounded sets of values. Each unique key-value label pair represents a new time series, which can dramatically increase the amount of data stored.

Groupings

Groupings are hierarchical organizations of labels, allowing users to organize their infrastructure views on the Explore tab in a logical hierarchy. An example grouping is shown below:

The example above groups the infrastructure into four levels. This results in a tree view in the Explore module with four levels, with rows for each infrastructure object applicable to each level.

As each label is selected, Sysdig Monitor automatically filters out labels for the next selection that no longer fit the hierarchy, to ensure that only logical groupings are created.

The example below shows the logical hierarchy structure for Kubernetes:

  • Clusters: Cluster > Namespace > Replicaset > Pod

  • Namespace: Cluster > Namespace > HorizontalPodAutoscaler > Deployment > Pod

  • Daemonsets : Cluster > Namespace > Daemonsets > Pod

  • Services: Cluster > Namespace > Service > StatefulSet > Pod

  • Job: Cluster > Namespace > Job > Pod

  • ReplicationController: Cluster > Namespace > ReplicationController > Pod

The default groupings are immutable: They cannot be modified or deleted. However, you can make a copy of them that you can modify.

Unified Workload Labels

Sysdig provides the following labels to help improve your infrastructure organization and troubleshooting easier.

  • kubernetes_workload_name: Displays all the Kubernetes workloads and indicates what type and name of workload resource (deployment, daemonSet, replicaSet, and so on) it is.

  • kubernetes_workload_type: Indicates what type of workload resource (deployment, daemonSet, replicaSet, and so on) it is.

The availability of these labels also simplifies Groupings. You do not need different groupings for each type of deployment, instead, you have a single grouping for workloads.

The labels allow you to segment metrics, such as sysdig_host_cpu_cores_used_percent , by kubernetes_workload_name to see CPU cores usage for all the workloads, instead of having a separate query for segmenting by kubernetes_deployment_name, kubernetes_replicaSet_name , and so on.

Learn More

Scopes

A scope is a collection of labels that are used to filter out or define the boundaries of a group of data points when creating dashboards, dashboard panels, alerts, and teams. An example scope is shown below:

In the example above, the scope is defined by two labels with operators and values defined. The table below defines each of the available operators.

OperatorDescription
isThe value matches the defined label value exactly.
is notThe value does not match the defined label value exactly.
inThe value is among the comma separated values entered.
not inThe value is not among the comma separated values entered.
containsThe label value contains the defined value.
does not containThe label value does not contain the defined value.
starts withThe label value starts with the defined value.

The scope editor provides dynamic filtering capabilities. It restricts the scope of the selection for subsequent filters by rendering valid values that are specific to the previously selected label. Expand the list to view unfiltered suggestions. At run time, users can also supply custom values to achieve more granular filtering. The custom values are preserved. Note that changing a label higher up in the hierarchy might render the subsequent labels incompatible. For example, changing the kubernetes_namespace_name > kubernetes_deployment_name hierarchy to swarm_service_name > kubernetes_deployment_name is invalid as these entities belong to different orchestrators and cannot be logically grouped.

Dashboards and Panels

Dashboard scopes define the criteria for what metric data will be included in the dashboard’s panels. The current dashboard’s scope can be seen at the top of the dashboard:

By default, all dashboard panels abide by the scope of the overall dashboard. However, an individual panel scope can be configured for a different scope than the rest of the dashboard.

For more information on Dashboards and Panels, refer to the Dashboards documentation.

Alerts

Alert scopes are defined during the creation process, and specify what areas within the infrastructure the alert is applicable for. In the example alerts below, the first alert has a scope defined, whereas the second alert does not have a custom scope defined. If no scope is defined, the alert is applicable to the entire infrastructure.

For more information on Alerts, refer to the Alerts documentation.

Teams

A team’s scope determines the highest level of data that team members have visibility for:

  • If a team’s scope is set to Host, team members can see all host-level and container-level information.

  • If a team’s scope is set to Container, team members can only see container-level information.

A team’s scope only applies to that team. Users that are members of multiple teams may have different visibility depending on which team is active.

For more information on teams and configuring team scope, refer to the Manage Teams and Roles documentation.

Segments

Aggregated data can be split into smaller sections by segmenting the data with labels. This allows for the creation of multi-series comparisons and multiple alerts. In the first image, the metric is not segmented:

In the second image, the same metric has been segmented by container_id:

Line and Area panels can display any number of segments for any given metric. The example image below displays the sysdig_connection_net_in_bytes metric segmented by both container_id and host_hostname:

For more information regarding segmentation in dashboard panels, refer to the Configure Panels documentation. For more information regarding configuring alerts, refer to the Alerts documentation.

The Meaning of n/a

Sysdig Monitor imports data related to entities such as hosts, containers, processes, and so on, and reports them in tables or panels on the Explore and Dashboards UI, as well as in events, so across the UI you see varieties of data. The term n/a can appear anywhere on the UI where some form of data is displayed.

n/a is a term that indicates data that is not available or that it does not apply to a particular instance. In Sysdig parlance, the term signifies one or more entities defined by a particular label, such as hostname or Kubernetes service, for which the label is invalid. In other words, n/a collectively represent entities whose metadata is not relevant to aggregation and filtering techniques—Grouping, Scoping, and Segmenting. For instance, a list of Kubernetes services might display the list of all the services as well as n/a that includes all the containers without the metadata describing a Kubernetes service.

You might encounter n/a sporadically in Explore UI as well as in drill-down panels or dashboards, events, and likely elsewhere on the Sysdig Monitor UI when no relevant metadata is available for that particular display. How n/a should be treated depends on the nature of your deployment. The deployment will not be affected by the entities marked n/a.

The following are some of the cases that yield n/a on the UI:

  • Labels are partially available or not available. For example, a host has entities that are not associated with a monitored Kubernetes deployment, or a monitored host has an unmonitored Kubernetes deployment running.

  • Labels that do not apply to the grouping criteria or at the hierarchy level. For example:

    • Containers that are not managed by Kubernetes. The containers managed by Kubernetes are identified with their  container_name labels.

    • In certain groupings by DaemonSet, Deployments render N/A and vice versa. Not all containers belong to both DaemonSet and Deployment objects concurrently. Likewise, a Kubernetes ReplicaSet grouping with the kubernetes_replicaset_name label will not show StatefulSets.

    • In a kubernetes_cluster_name > kubernetes_namespace_name > kubernetes_deployment_name  grouping, the entities without the kubernetes_cluster_name label yield n/a.

  • Entities are incorrectly labeled in the infrastructure.

  • Kubernetes features that are yet to be in sync with Sysdig Monitoring.

  • The format is not applicable to a particular record in the database.

4.2 - Understanding Default, Custom, and Missing Metrics

Default Metrics

Default metrics include various kinds of metadata which Sysdig Monitor automatically knows how to label, segment, and display.

For example:

  • System metrics for hosts, containers, and processes (CPU used, etc.)

  • Orchestrator metrics (collected from Kubernetes, Mesos, etc.)

  • Network metrics (e.g. network traffic)

  • HTTP

  • Platform metrics (in some cases)

Default metrics are collected mainly from two sources: syscalls and Kubernetes.

Custom Metrics

About Custom Metrics

Custom metrics generally refer to any metrics that the Sysdig Agent collects from some third-party integration. The type of infrastructure and applications integrated determine the custom metrics that the Agent collects and reports to Sysdig Monitor. The supported custom metrics are:

Each metric comes with a set of custom labels, and additional labels can be user-created. Sysdig Monitor simply collects and reports them with minimal or no internal processing. Use the metrics_filter option in the dragent.yaml file to remove unwanted metrics or to choose the metrics to report when hosts exceed this limit. For more information on editing the dragent.yaml file, see Understanding the Agent Config Files.

Unit for Custom Metrics

Sysdig Monitor detects the default unit of custom metrics automatically with the delimiter suffix in the metrics name. For example, custom_expvar_time_seconds results in a base unit set to seconds. The supported base units are byte, percent, and time. Custom metrics name should carry one of the following delimiter suffixes in order for Sysdig Monitor to identify and configure the accurate unit type.

  • second

  • seconds

  • byte

  • bytes

  • total (represents accumulating count)

  • percent

Custom metrics will not be auto-detected and the unit will be incorrect unless this naming convention is followed. For instance, custom_byte_expvar will not yield the correct unit, that is MiB.

Editing the Unit Scale

You have the flexibility to change the unit scale either by editing the panel on the Dashboard or in the Explore.

Explore

From the Search Metrics and Dashboard drop-down, select the custom metrics you want to edit the unit selection for, then click More Options. Select the desired unit scale from the Metric Format drop-down and click Save.

Dashboard

Select the Dashboard Panel associated with the custom metrics you want to modify. Select the desired unit scale from the Metrics drop-down and click Save.

Display Missing Data

Data can be missing for a few different reasons:

  • Problems such as faulty network connectivity in the communication channel between your infrastructure and Sysdig metrics store.

  • Metrics or StatsD batch jobs are submitted sporadically.

Sysdig Monitor allows you to configure the behavior of missing data in Dashboards. Though metric type determines the default behavior, you can configure how to visualize missing data and define it at the per-query level. Use the No Data Display drop-down in the Options menu in the panel configuration, and the No Data Message text box under the Panel tab. See Create a New Panel for more information.

Consider the following guidelines:

  • Use the No Data Message text box under the Panel tab to enter a custom message when no data is available to render on the panels. This custom message, which could include links in markdown format and line breaks, is shown when queries return no data and reports no errors.

  • The No Data Display drop-down has only two options for the Stacked Area timechart: gap and show as zero.

  • For form-based timechart panels, the default option for a metrics selection that does not contain a StatsD metric is gap.

  • Adding a StatsD metric to a query in a form-based timechart panel will default the selected No Data Display type to the show as zero , which is the default option for form-based StatsD metrics. You can change this selection to any other type.

  • The default display option is gap for PromQL Timechart panels.

The options for No Data Display are:

  • gap: The default option for form-based timechart panel, where a query metrics selection does not contain a StatsD metric. gap is the best visualization type for most use cases because it is easy to spot indicating a problem.

  • show as zero: The best option for StatsD metrics which are only submitted sporadically. For example, batch jobs and count of errors. This is the default display option for StatsD metrics in form-based panels.

    We do not recommend this option as setting zero could be misleading. For example, this setting will report the value for free disk space as 0% when the disk or host disappears, but in reality, the value is unknown.

    Prometheus best practices recommend avoiding missing metrics.

  • connect - solid: Use for measuring the value of a metric, typically a gauge, where you want to visualize the missing samples flattened.

    The leftmost and rightmost visible data points can be connected as Sysdig does not perform the interpolation.

  • connect - dotted: Use it for measuring the value of a metric, typically a gauge, where you want to visualize the missing samples flattened.

    The leftmost and rightmost visible data points can be connected as Sysdig does not perform the interpolation.

4.3 - Metric Limits

Metric limits determine the amount of custom time series ingested by the Sysdig agent. While this is primarily a tool to help limit the total time series ingested for limiting cost exposure for each user, it does affect the total number of time series consumed and used in tracking metrics.

The Sysdig agent metric limit is different from the entitlement limit imposed on custom time series. Your time series entitlement could be lower than agent metric limits. For more information, see Time Series Billing.

View Metric Limits

The metric limits are automatically defined by Sysdig backend components based on your plan, agent version, and backend configuration. Metric limits are set per-category, and when aggregated the per-category limits define your overall metric limit per agent. Metric limits are global per account and the same limit will apply to each agent within a Sysdig account.

Use the Sysdig Agent Health & Status dashboard under Host Infrastructure templates to view per-category metric limits for your account, along with the current usage per host for each metric type.

Contact Sysdig Support to adjust metric limits for any category.

See the Sysdig Agent Health & Status dashboard to view the metric limits and current time series consumption for each agent.

MetricsDescription
statsd_dragent_metricCount_limit_appCheckThe maximum number of unique appCheck timeseries that are allowed in an individual sample from the agent per node.
statsd_dragent_metricCount_limit_statsdThe maximum number of unique statsd timeseries that are allowed in an individual sample from the agent per node.
statsd_dragent_metricCount_limit_jmxThe maximum number of unique JMX timeseries that are allowed in an individual sample from the agent per node.
statsd_dragent_metricCount_limit_prometheusThe maximum number of unique Prometheus timeseries that are allowed in an individual sample from the agent per node.

Learn More

4.4 - Sysdig Info Metrics

Sysdig provides Prometheus compatible Info metrics to show infrastructure (sysdig_*_info) and Kubernetes (kube_*_info) labels. The info metric are gauges with a value of 1 and will have the _info suffix .

For example, querying sysdig_host_info in PromQL Query will provide all labels associated with the host, such as:

  • agent_id
  • agent_tag_cluster
  • host_hostname
  • domain
  • host
  • host_domain
  • host_mac
  • instance_id

Although info metrics are available, all the metrics that are ingested by Sysdig agents are automatically enriched with the metadata and you don’t need to do PromQL joins. For more information, see Run PromQL Queries Faster with Extended Label Set

4.5 - Manage Metric Scale

Sysdig provides several knobs for managing metric scale.

There are three primary ways in which you could include/exclude metrics, should you encounter unwanted metrics limits.

  1. Include/exclude custom metrics by name filters.

    See Include/Exclude Custom Metrics.

  2. Include/exclude metrics emitted by certain containers, Kubernetes annotations, or any other container label at collection time.

    See Prioritize/Include/Exclude Designated Containers.

  3. Exclude metrics from unwanted ports.

    See Blacklist Ports.

4.6 - Data Aggregation

Sysdig Monitor allows users to adjust the aggregation settings when graphing or creating alerts for a metric, informing how Sysdig rolls up the available data samples in order to create the chart or evaluate the alert. There are two forms of aggregation used for metrics in Sysdig: time aggregation and group aggregation.

Time aggregation is always performed before group aggregation.

Time Aggregation

Time aggregation comes into effect in two overlapping situations:

  • Charts can only render a limited number of data points. To look at a wide range of data, Sysdig Monitor may need to aggregate granular data into larger samples for visualization.

  • Sysdig Monitor rolls up historical data over time.

    Sysdig retains rollups based on each aggregation type, to allow users to choose which data points to utilize when evaluating older data.

Sysdig agents collect 1-second samples and report data at 10-second resolution. The data is stored and reported every 10-second with the available aggregations (average, rate, min, max, sum) to make them available via the Sysdig Monitor UI and the API. For time series charts covering five minutes or less, data points are drawn at this 10-second resolution, and any time aggregation selections will have no effect. When an amount of time greater than five minutes is displayed, data points are drawn as an aggregate for an appropriate time interval. For example, for a chart covering one hour, each data point would reflect a one minute interval.

At time intervals of one minute and above, charts can be configured to display different aggregates for the 10-second metrics used to calculate each datapoint.

Aggregation TypeDescription
averageThe average of the retrieved metric values across the time period.
rateThe average value of the metric across the time period evaluated.
maximumThe highest value during the time period evaluated.
minimumThe lowest value during the time period evaluated.
sumThe combined sum of the metric across the time period evaluated.

In the example images below, the kubernetes_deployment_replicas_available metrics first uses the average for time aggregation:

Then uses the sum for time aggregation:

  • Rate and average are very similar and often provide the same result. However, the calculation of each is different.

    • If time aggregation is set to one minute, the agent is supposed to retrieve six samples (one every 10 seconds).

    • In some cases, samples may not be there, due to disconnections or other circumstances. For this example, four samples are available. If this was the case, the average would be calculated by dividing by four, while the rate would be calculated by dividing by six.

  • Most metrics are sampled once for each time interval, resulting in average and rate returning the same value. However, there will be a distinction for any metrics not reported at every time interval. For example, some custom statsd metrics.

  • Rate is currently referred to as timeAvg in the Sysdig Monitor API and advanced alerting language.

  • By default, average is used when displaying data points for a time interval.

Group Aggregation

Metrics applied to a group of items (for example, several containers, hosts, or nodes) are averaged between the members of the group by default. For example, three hosts report different CPU usage for one sample interval. The three values will be averaged, and reported on the chart as a single datapoint for that metric.

There are several different types of group aggregation:

Aggregation TypeDescription
averageThe average value of the interval’s samples.
maximumThe maximum value of the interval’s samples.
minimumThe minimum value of the interval’s samples.
sumThe combined value of all of the interval’s samples.

If a chart or alert is segmented, the group aggregation settings will be utilized for both aggregations across the whole group, and aggregation within each individual segmentation.

For example, the image below shows a chart for CPU% across the infrastructure:

When segmented by proc_name, the chart shows one CPU% line for each process:

Each line provides the average value for every process with the same name. To see the difference, change the group aggregation type to sum:

The metric aggregation value showed beside the metric name is for the time aggregation. While the screenshot shows AVG, the group aggregation is set to SUM.

Aggregation Examples

The tables below provide an example of how each type of aggregation works. The first table provides the metric data, while the second displays the resulting value for each type of aggregation.

In the example below, the CPU% metric is applied to a group of servers called webserver. The first chart shows metrics using average aggregation for both time and group. The second chart shows the metrics using maximum aggregation for both time and group.

For each one minute interval, the second chart renders the highest CPU usage value found from the servers in the webserver group and from all of the samples reported during the one minute interval. This view can be useful when searching for transient spikes in metrics over long periods of time, that would otherwise be missed with average aggregation.

The group aggregation type is dependent on the segmentation. For a view showing metrics for a group of items, the current group aggregation setting will revert to the default setting, if the Segment By selection is changed.

4.7 - Deprecated Metrics and Labels

Below is the list of metrics and labels that are discontinued with the introduction of new metric store. We made an effort to not deprecate any metrics or labels that are used in existing alerts, but in case you encounter any issues, contact Sysdig Support.

We have applied automatic mapping of all net.*.request.time.worst metrics to net.*.request.time, because the maximum aggregation gives equivalent results and it was almost exclusively used in combination with these metrics.

Deprecated Metrics

The following metrics are no longer supported.

  • net.request.time.file
  • net.request.time.file.percent
  • net.request.time.local
  • net.request.time.local.percent
  • net.request.time.net
  • net.request.time.net.percent
  • net.request.time.nextTiers
  • net.request.time.nextTiers.percent
  • net.request.time.processing
  • net.request.time.processing.percent
  • net.request.time.worst.in
  • net.request.time.worst.out
  • net.incomplete.connection.count.total
  • net.http.request.time.worst
  • net.mongodb.request.time.worst
  • net.sql.request.time.worst
  • net.link.clientServer.bytes
  • net.link.delay.perRequest
  • net.link.serverClient.bytes
  • capacity.estimated.request.stolen.count
  • capacity.estimated.request.total.count
  • capacity.stolen.percent
  • capacity.total.percent
  • capacity.used.percent

Deprecated Labels

The following labels are no longer supported:

  • net.connection.client
  • net.connection.client.pid
  • net.connection.direction
  • net.connection.endpoint.tcp
  • net.connection.udp.inverted
  • net.connection.errorCode
  • net.connection.l4proto
  • net.connection.server
  • net.connection.server.pid
  • net.connection.state
  • net.role
  • cloudProvider.resource.endPoint
  • host.container.mappings
  • host.ip.all
  • host.ip.private
  • host.ip.public
  • host.server.port
  • host.isClientServer
  • host.isInstrumented
  • host.isInternal
  • host.procList.main
  • proc.id
  • proc.name.client
  • proc.name.server
  • program.environment
  • program.usernames
  • mesos_cluster
  • mesos_node
  • mesos_pid

In addition to this list, the composite labels ending with ‘.label’ string will no longer be supported. For example kubernetes.service.label will be deprecated, but kubernetes.service.label.* labels are supported.

4.8 - Troubleshooting Metrics

Troubleshooting Metrics

Troubleshooting metrics include program metrics, connection-level network metrics, Kubernetes troubleshooting metrics, HTTP URL metrics, and some SQL metrics. They are reported on a granular 10s level and are stored for 4 days. Below is the list of troubleshooting metrics and the labels that you can use to segment them.

Program Level Metrics

  • sysdig_program_cpu_cores_used
  • sysdig_program_cpu_cores_used_percent
  • sysdig_program_cpu_used_percent
  • sysdig_program_memory_used_bytes
  • sysdig_program_net_in_bytes
  • sysdig_program_net_out_bytes
  • sysdig_program_net_connection_in_count
  • sysdig_program_net_connection_out_count
  • sysdig_program_net_connection_total_count
  • sysdig_program_net_error_count
  • sysdig_program_net_request_count
  • sysdig_program_net_request_in_count
  • sysdig_program_net_request_out_count
  • sysdig_program_net_request_time
  • sysdig_program_net_request_in_time
  • sysdig_program_net_tcp_queue_len
  • sysdig_program_proc_count
  • sysdig_program_thread_count
  • sysdig_program_up

In addition to the user-defined labels and standard set of labels Sysdig provides, you can use following labels to segment program metrics: program_cmd_line, program_name.

Connection-Level Network Metrics

  • sysdig_connection_net_in_bytes
  • sysdig_connection_net_out_bytes
  • sysdig_connection_net_total_bytes
  • sysdig_connection_net_connection_in _count
  • sysdig_connection_net_connection_out _count
  • sysdig_connection_net_connection_total _count
  • sysdig_connection_net_request_in_count
  • sysdig_connection_net_request_out_count
  • sysdig_connection_net_request_count
  • sysdig_connection_net_request_in_time
  • sysdig_connection_net_request_out_time
  • sysdig_connection_net_request_time

In addition to the user-defined labels and standard set of labels Sysdig provides, you can use following labels to segment connection level metrics: net_local_service, net_remote_service, net_local_endpoint, net_remote_endpoint, net_client_ip, net_server_ip, net_protocol

Kubernetes Troubleshooting Metrics

  • kube_workload_status_replicas_misscheduled
  • kube_workload_status_replicas_scheduled
  • kube_workload_status_replicas_updated
  • kube_pod_container_status_last_terminated_reason
  • kube_pod_container_status_ready
  • kube_pod_container_status_restarts_total
  • kube_pod_container_status_running
  • kube_pod_container_status_terminated
  • kube_pod_container_status_terminated_reason
  • kube_pod_container_status_waiting
  • kube_pod_container_status_waiting_reason
  • kube_pod_init_container_status_last_terminated_reason
  • kube_pod_init_container_status_ready
  • kube_pod_init_container_status_restarts_total
  • kube_pod_init_container_status_running
  • kube_pod_init_container_status_terminated
  • kube_pod_init_container_status_terminated_reason
  • kube_pod_init_container_status_waiting
  • kube_pod_init_container_status_waiting_reason

HTTP URL Metrics

  • sysdig_host_net_http_url_error_count
  • sysdig_host_net_http_url_request_count
  • sysdig_host_net_http_url_request_time
  • sysdig_container_net_http_url_error_count
  • sysdig_container_net_http_url_request_count
  • sysdig_container_net_http_url_request_time

In addition to the user-defined labels and standard set of labels Sysdig provides, you can use net_http_url label to segment HTTP URL level metrics.

SQL Query Metrics

  • sysdig_host_net_sql_query_error_count
  • sysdig_host_net_sql_query_request_count
  • sysdig_host_net_sql_query_request_time
  • sysdig_host_net_sql_querytype_error_count
  • sysdig_host_net_sql_querytype_request_count
  • sysdig_host_net_sql_querytype_request_time
  • sysdig_container_net_sql_query_error_count
  • sysdig_container_net_sql_query_request_count
  • sysdig_container_net_sql_query_request_time
  • sysdig_container_net_sql_querytype_error_count
  • sysdig_container_net_sql_querytype_request_count
  • sysdig_container_net_sql_querytype_request_time

In addition to the user-defined labels and standard set of labels Sysdig provides, you can use net_sql_querytype label to segment SQL querytype metrics by query type.

4.9 - Heuristic and Deprecated Metrics

Various network-related metrics reported by Sysdig, including response times, are calculated at the kernel level by measuring latency between systems calls.

Additional heuristic metric details are listed below:

MetricMetric in Legacy Format
host_net_http_request_time
container_net_http_request_time
net.http.request.time
host_net_http_request_count
container_net_http_request_count
net.http.request.count
host_net_http_error_count
container_net_http_error_count
net.http.error.count
host_net_sql_request_time
container_net_sql_request_time
net.sql.request.time
host_net_sql_request_count
container_net_sql_request_count
net.sql.request.count
host_net_sql_error_count
container_net_sql_error_count
net.sql.error.count
host_net_mongodb_request_time
container_net_mongodb_request_time
net.mongodb.request.time
host_net_mongodb_request_count
container_net_mongodb_request_count
net.mongodb.request.count
host_net_mongodb_error_count
container_net_mongodb_error_count
net.mongodb.error.count
sysdig_host_net_request_time
sysdig_container_net_request_time
sysdig_program_net_request_time
sysdig_connection_net_request_time
net.request.time
sysdig_host_net_request_in_time
sysdig_container_net_request_in_time
sysdig_program_net_request_in_time
sysdig_connection_net_request_in_time
net.request.time.in
sysdig_host_net_request_out_time
sysdig_container_net_request_out_time
sysdig_program_net_request_out_time
sysdig_connection_net_request_out_time
net.request.time.out
sysdig_host_net_request_count
sysdig_container_net_request_count
sysdig_program_net_request_count
sysdig_connection_net_request_count
net.request.count
sysdig_host_net_request_in_count
sysdig_container_net_request_in_count
sysdig_program_net_request_in_count
sysdig_connection_net_request_in_count
net.request.count.in

4.10 - Metrics Library

The Sysdig metrics dictionary lists all the metrics, both in Sysdig legacy and Prometheus-compatible notation, supported by the Sysdig product suite, as well as kube state and cloud provider metrics. The Metrics Dictionary is a living document and is updated as new metrics are added to the product.

4.10.1 - Metrics and Labels Mapping

This topic outlines the mapping between the metrics and label naming conventions in the Sysdig legacy datastore and the new Sysdig datastore.

4.10.1.1 - Mapping Classic Metrics with Context-Specific PromQL Metrics

Sysdig classic metrics such as cpu.used.percent previously returned values from a process, container, or host depending on the query segmentation or scope. You can now use context-explicit metrics which aligns with the flat model and resource specific semantics of Prometheus naming schema. Your existing dashboards and alerts will be automatically migrated to the new naming convention.

Sysdig Classic MetricsContext-Specific Metrics in Prometheus Notation
cpu.cores.usedsysdig_container_cpu_cores_used
sysdig_host_cpu_cores_used
sysdig_program_cpu_cores_used
cpu.cores.used.percentsysdig_container_cpu_cores_used_percent
sysdig_host_cpu_cores_used_percent
sysdig_program_cpu_cores_used_percent
cpu.used.percentsysdig_container_cpu_used_percent
sysdig_host_cpu_used_percent
sysdig_program_cpu_used_percent
fd.used.percentsysdig_container_fd_used_percent
sysdig_host_fd_used_percent
sysdig_program_fd_used_percent
file.bytes.insysdig_container_file_in_bytes
sysdig_host_file_in_bytes
sysdig_program_file_in_bytes
file.bytes.outsysdig_container_file_out_bytes
sysdig_host_file_out_bytes
sysdig_program_file_out_bytes
file.bytes.totalsysdig_container_file_total_bytes
sysdig_host_file_total_bytes
sysdig_program_file_total_bytes
file.error.open.countsysdig_container_file_error_open_count
sysdig_host_file_error_open_count
sysdig_program_file_error_open_count
file.error.total.countsysdig_container_file_error_total_count
sysdig_host_file_error_total_count
sysdig_program_file_error_total_count
file.iops.insysdig_container_file_in_iops
sysdig_host_file_in_iops
sysdig_program_file_in_iops
file.iops.outsysdig_container_file_out_iops
sysdig_host_file_out_iops
sysdig_program_file_out_iops
file.iops.totalsysdig_container_file_total_iops
sysdig_host_file_total_iops
sysdig_program_file_total_iops
file.open.countsysdig_container_file_open_count
sysdig_host_file_open_count
sysdig_program_file_open_count
file.time.insysdig_container_file_in_time
sysdig_host_file_in_time
sysdig_program_file_in_time
file.time.outsysdig_container_file_out_time
sysdig_host_file_out_time
sysdig_program_file_out_time
file.time.totalsysdig_container_file_total_time
sysdig_host_file_total_time
sysdig_program_file_total_time
fs.bytes.freesysdig_container_fs_free_bytes
sysdig_fs_free_bytes
sysdig_host_fs_free_bytes
fs.bytes.totalsysdig_container_fs_total_bytes
sysdig_fs_total_bytes
sysdig_host_fs_total_bytes
fs.bytes.usedsysdig_container_fs_used_bytes
sysdig_fs_used_bytes
sysdig_host_fs_used_bytes
fs.free.percentsysdig_container_fs_free_percent
sysdig_fs_free_percent
sysdig_host_fs_free_percent
fs.inodes.total.countsysdig_container_fs_inodes_total_count
sysdig_fs_inodes_total_count
sysdig_host_fs_inodes_total_count
fs.inodes.used.countsysdig_container_fs_inodes_used_count
sysdig_fs_inodes_used_count
sysdig_host_fs_inodes_used_count
fs.inodes.used.percentsysdig_container_fs_inodes_used_percent
sysdig_fs_inodes_used_percent
sysdig_host_fs_inodes_used_percent
fs.largest.used.percentsysdig_container_fs_largest_used_percent
sysdig_host_fs_largest_used_percent
fs.root.used.percentsysdig_container_fs_root_used_percent
sysdig_host_fs_root_used_percent
fs.used.percentsysdig_container_fs_used_percent
sysdig_fs_used_percent
sysdig_host_fs_used_percent
host.error.countsysdig_container_syscall_error_count
sysdig_host_syscall_error_count
infosysdig_agent_info
sysdig_container_info
sysdig_host_info
memory.bytes.totalsysdig_host_memory_total_bytes
sysdig_container_memory_used_bytes
sysdig_host_memory_used_bytes
sysdig_program_memory_used_bytes
memory.bytes.virtualsysdig_container_memory_virtual_bytes
sysdig_host_memory_virtual_bytes
memory.swap.bytes.usedsysdig_container_memory_swap_used_bytes
sysdig_host_memory_swap_used_bytes
memory.used.percentsysdig_container_memory_used_percent
sysdig_host_memory_used_percent
net.bytes.insysdig_connection_net_in_bytes
sysdig_container_net_in_bytes
sysdig_host_net_in_bytes
sysdig_program_net_in_bytes
net.bytes.outsysdig_connection_net_out_bytes
sysdig_container_net_out_bytes
sysdig_host_net_out_bytes
sysdig_program_net_out_bytes
net.bytes.totalsysdig_connection_net_total_bytes
sysdig_container_net_total_bytes
sysdig_host_net_total_bytes
sysdig_program_net_total_bytes
net.connection.count.insysdig_connection_net_connection_in_count
sysdig_container_net_connection_in_count
sysdig_host_net_connection_in_count
sysdig_program_net_connection_in_count
net.connection.count.outsysdig_connection_net_connection_out_count
sysdig_container_net_connection_out_count
sysdig_host_net_connection_out_count
sysdig_program_net_connection_out_count
net.connection.count.totalsysdig_connection_net_connection_total_count
sysdig_container_net_connection_total_count
sysdig_host_net_connection_total_count
sysdig_program_net_connection_total_count
net.request.countsysdig_connection_net_request_count
sysdig_container_net_request_count
sysdig_host_net_request_count
sysdig_program_net_request_count
net.error.countsysdig_container_net_error_count
sysdig_host_net_error_count
sysdig_program_net_error_count
net.request.count.insysdig_connection_net_request_in_count
sysdig_container_net_request_in_count
sysdig_host_net_request_in_count
sysdig_program_net_request_in_count
net.request.count.outsysdig_connection_net_request_out_count
sysdig_container_net_request_out_count
sysdig_host_net_request_out_count
sysdig_program_net_request_out_count
net.request.timesysdig_connection_net_request_time
sysdig_container_net_request_time
sysdig_host_net_request_time
sysdig_program_net_request_time
net.request.time.insysdig_connection_net_request_in_time
sysdig_container_net_request_in_time
sysdig_host_net_request_in_time
sysdig_program_net_request_in_time
net.request.time.outsysdig_connection_net_request_out_time
sysdig_container_net_request_out_time
sysdig_host_net_request_out_time
sysdig_program_net_request_out_time
net.server.bytes.insysdig_container_net_server_in_bytes
sysdig_host_net_server_in_bytes
net.server.bytes.outsysdig_container_net_server_out_bytes
sysdig_host_net_server_out_bytes
net.server.bytes.totalsysdig_container_net_server_total_bytes
sysdig_host_net_server_total_bytes
net.sql.error.countsysdig_container_net_sql_error_count
sysdig_host_net_sql_error_count
net.sql.request.countsysdig_container_net_sql_request_count
sysdig_host_net_sql_request_count
net.tcp.queue.lensysdig_container_net_tcp_queue_len
sysdig_host_net_tcp_queue_len
sysdig_program_net_tcp_queue_len
proc.countsysdig_container_proc_count
sysdig_host_proc_count
sysdig_program_proc_count
thread.countsysdig_container_thread_count
sysdig_host_thread_count
sysdig_program_thread_count
uptimesysdig_container_up
sysdig_host_up
sysdig_program_up

4.10.1.2 - Mapping Classic Metrics with PromQL Metrics

Starting SaaS v 3.2.6, Sysdig classic metrics and labels have been renamed to be aligned with Prometheus naming convention. For example, Sysdig classic metrics have a dot-oriented hierarchy, whereas Prometheus has label-based metric organization. The table below helps you identify the Prometheus metrics and labels and the corresponding ones in the Sysdig classic system.

Entity

Type

PromQL Metric Name

Classic Metric Name

Label

Classic Label

host

info

sysdig_host_info

Not exposed

  • host_mac

  • host

  • instance_id

  • agent_tag_{*}

  • host.mac

  • host.hostName

  • host.instanceId

  • agent.tag.{*}

sysdig_cloud_provider_info

  • host_mac

  • provider_id

  • account_id

  • region

  • availability_zone

  • instance_type

  • tag_{*}

  • security_groups

  • host_ip_public

  • host_ip_private

  • host_name

  • name

  • host.mac

  • cloudProvider.id

  • cloudProvider.account.id

  • cloudProvider.region

  • cloudProvider.availabilityZone

  • cloudProvider.instance.type

  • cloudProvider.tag.{*}

  • cloudProvider.securityGroups

  • cloudProvider.host.ip.public

  • cloudProvider.host.ip.private

  • cloudProvider.host.name

  • cloudProvider.name

data

sysdig_host_cpu_used_percent

cpu.used.percent

  • host_mac

  • host

  • host.mac

  • host.hostname

sysdig_host_cpu_cores_used

cpu.cores.used

sysdig_host_cpu_user_percent

cpu.user.percent

sysdig_host_cpu_idle_percent

cpu.idle.percent

sysdig_host_cpu_iowait_percent

cpu.iowait.percent

sysdig_host_cpu_nice_percent

cpu.nice.percent

sysdig_host_cpu_stolen_percent

cpu.stolen.percent

sysdig_host_cpu_system_percent

cpu.system.percent

sysdig_host_fd_used_percent

fd.used.percent

sysdig_host_file_error_open_count

file.error.open.count

sysdig_host_file_error_total_count

file.error.total.count

sysdig_host_file_in_bytes

file.bytes.in

sysdig_host_file_in_iops

file.iops.in

sysdig_host_file_in_time

file.time.in

sysdig_host_file_open_count

file.open.count

sysdig_host_file_out_bytes

file.bytes.out

sysdig_host_file_out_iops

file.iops.out

sysdig_host_file_out_time

file.time.out

sysdig_host_load_average_15m

load.average.15m

sysdig_host_load_average_1m

load.average.1m

sysdig_host_load_average_5m

load.average.5m

sysdig_host_memory_available_bytes

memory.bytes.available

sysdig_host_memory_total_bytes

memory.bytes.total

sysdig_host_memory_used_bytes

memory.bytes.used

sysdig_host_memory_swap_available_bytes

memory.swap.bytes.available

sysdig_host_memory_swap_total_bytes

memory.swap.bytes.total

sysdig_host_memory_swap_used_bytes

memory.swap.bytes.used

sysdig_host_memory_virtual_bytes

memory.bytes.virtual

sysdig_host_net_connection_in_count

net.connection.count.in

sysdig_host_net_connection_out_count

net.connection.count.out

sysdig_host_net_error_count

net.error.count

sysdig_host_net_in_bytes

net.bytes.in

sysdig_host_net_out_bytes

net.bytes.out

sysdig_host_net_tcp_queue_len

net.tcp.queue.len

sysdig_host_proc_count

proc.count

sysdig_host_system_uptime

system.uptime

sysdig_host_thread_count

thread.count

container

info

sysdig_container_info

Not exposed

container_id

container_id

container_full_id

none

host_mac

host.mac

container

container.name

container_type

container.type

image

container.image

image_id

container.image.id

mesos_task_id

container.mesosTaskId

Only available in Mesos orchestrator.

cluster

kubernetes.cluster.name

Present only if the container is part of Kubernetes.

pod

kubernetes.pod.name

Present only if the container is part of Kubernetes

namespace

kubernetes.namespace.name

Present only if the container is part of Kubernetes.

data

sysdig_container_cpu_used_percent

cpu.used.percent

  • host_mac

  • container_id

  • container_type

  • container

  • host.mac

  • container.id

  • container.type

  • container.name

sysdig_container_cpu_cores_used

cpu.cores.used

sysdig_container_cpu_cores_used_percent

cpu.cores.used.percent

sysdig_container_cpu_quota_used_percent

cpu.quota.used.percent

sysdig_container_cpu_shares

cpu.shares.count

sysdig_container_cpu_shares_used_percent

cpu.shares.used.percent

sysdig_container_fd_used_percent

fd.used.percent

sysdig_container_file_error_open_count

file.error.open.count

sysdig_container_file_error_total_count

file.error.total.count

sysdig_container_file_in_bytes

file.bytes.in

sysdig_container_file_in_iops

file.iops.in

sysdig_container_file_in_time

file.time.in

sysdig_container_file_open_count

file.open.count

sysdig_container_file_out_bytes

file.bytes.out

sysdig_container_file_out_iops

file.iops.out

sysdig_container_file_out_time

file.time.out

sysdig_container_memory_limit_bytes

memory.limit.bytes

sysdig_container_memory_limit_used_percent

memory.limit.used.percent

sysdig_container_memory_swap_available_bytes

memory.swap.bytes.available

sysdig_container_memory_swap_total_bytes

memory.swap.bytes.total

sysdig_container_memory_swap_used_bytes

memory.swap.bytes.used

sysdig_container_memory_used_bytes

memory.bytes.used

sysdig_container_memory_virtual_bytes

memory.bytes.virtual

sysdig_container_net_connection_in_count

net.connection.count.in

sysdig_container_net_connection_out_count

net.connection.count.out

sysdig_container_net_error_count

net.error.count

sysdig_container_net_in_bytes

net.bytes.in

sysdig_container_net_out_bytes

net.bytes.out

sysdig_container_net_tcp_queue_len

net.tcp.queue.len

sysdig_container_proc_count

proc.count

sysdig_container_swap_limit_bytes

swap.limit.bytes

sysdig_container_thread_count

thread.count

Process/ Program

Info

sysdig_program_info

not exposed

program

proc.name

cmd_line

proc.commandLine

host_mac

host.mac

container_id

container.id

container_type

container.type

data

sysdig_program_cpu_used_percent

cpu.used.percent

host_mac

host.mac

container_id

container.id

container_type

container.type

program

proc.name

cmd_line

proc.commandLine

sysdig_program_memory_used_bytes

memory.bytes.used

host_mac

host.mac

container_id

container.id

container_type

container.type

program

proc.name

cmd_line

proc.commandLine

sysdig_program_net_in_bytes

net.bytes.in

container_id

container.id

host_mac

host.mac

container_type

container.type

program

proc.name

cmd_line

proc.commandLine

sysdig_program_net_out_bytes

net.bytes.out

host_mac

host.mac

container_id

container.id

container_type

container.type

program

proc.name

cmd_line

proc.commandLine

sysdig_program_proc_count

proc.count

host_mac

host.mac

container_id

container.id

container_type

container.type

program

proc.name

cmd_line

proc.commandLine

sysdig_program_thread_count

thread.count

host_mac

host.mac

container_id

container.id

container_type

container.type

program

proc.name

cmd_line

proc.commandLine

fs

info

sysdig_fs_info

not exposed

host_mac

host.mac

container_id

container.id

container_type

container.type

device

fs.device

mount_dir

fs.mountDir

type

fs.type

data

sysdig_fs_free_bytes

fs.bytes.free

host_mac

host.mac

container_id

container.id

container_type

container.type

device

fs.device

sysdig_fs_inodes_total_count

fs.inodes.total.count

host_mac

host.mac

container_id

container.id

container_type

container.type

device

fs.device

sysdig_fs_inodes_used_count

fs.inodes.used.count

host_mac

host.mac

container_id

container.id

container_type

container.type

device

fs.device

sysdig_fs_total_bytes

fs.bytes.total

host_mac

host.mac

container_id

container.id

container_type

container.type

device

fs.device

fs.bytes.used

host_mac

host.mac

container_id

container.id

container_type

container.type

devide

fs.device

4.10.1.3 - Mapping Legacy Sysdig Kubernetes Metrics with Prometheus Metrics

Prometheus metrics, in Kubernetes parlance, are nothing but Kube State Metrics. These metrics are available in Sysdig PromQL and can be mapped to existing Sysdig Kubernetes metrics.

For descriptions on Kubernetes State Metrics, see Kubernetes State Metrics.

Resource

Sysdig Metrics

Kubernetes State Metrics

Label

Example / More Information

Pod

kubernetes.pod.containers.waiting

kube_pod_container_status_waiting

  • container=<container-name>

  • pod=<pod-name>

  • namespace=<pod-namespace>

kubernetes.pod.resourceLimits.cpuCores

kubernetes.pod.resourceLimits.memBytes

kube_pod_container_resource_limits

kube_pod_sysdig_resource_limits_memory_bytes

kube_pod_sysdig_resource_limits_cpu_cores

  • resource=<resource-name>

  • unit=<resource-unit>

  • pod=<pod-name>

  • namespace=<pod-namespace>

  • node=< node-name>

{namespace="default",pod="pod0",container="pod1_con1",resource="cpu",unit="core"}

{namespace="default",pod="pod0",container="pod1_con1",resource="memory",unit="byte"}

kubernetes.pod.resourceRequests.cpuCores

kubernetes.pod.resourceRequests.memBytes

kube_pod_container_resource_requests

kube_pod_sysdig_resource_requests_cpu_cores

kube_pod_sysdig_resource_requests_memory_bytes

  • resource=<resource-name>

  • unit=<resource-unit>

  • container=<container-name>

  • pod=<pod-name>

  • namespace=<pod-namespace>

  • node=< node-name>

{namespace="default",pod="pod0",container="pod1_con1",resource="cpu",unit="core"}

{namespace="default",pod="pod0",container="pod1_con1",resource="memory",unit="byte"}

kubernetes.pod.status.ready

kube_pod_status_ready

  • pod=<pod-name>

  • namespace=<pod-namespace>

  • condition=<true|false|unknown>

kube_pod_info

  • pod=<pod-name>

  • namespace=<pod-namespace>

  • host_ip=<host-ip>

  • pod_ip=<pod-ip>

  • node=<node-name>

  • uid=<pod-uid>

{namespace="default",pod="pod0",host_ip="1.1.1.1",pod_ip="1.2.3.4",uid="abc-0",node="node1",created_by_kind="<none>",created_by_name="<none>",priority_class=""}

kube_pod_owner

  • pod=<pod-name>

  • namespace=<pod-namespace>

  • owner_kind=<owner kind>

  • owner_name=<owner name>

{namespace="default",pod="pod0",owner_kind="<none>",owner_name="<none>;",owner_is_controller="<none>"}

kube_pod_labels

  • pod=<pod-name>

  • namespace=<pod-namespace>

  • label_POD_LABEL=<POD_LABEL>

{namespace="default",pod="pod0", label_app="myApp"}

kube_pod_container_info

  • pod=<pod-name>

  • namespace=<pod-namespace>

  • container_id=<containerid>

{namespace="default",pod="pod0",container="container2",image="k8s.gcr.io/hyperkube2",image_id="docker://sha256:bbb",container_id="docker://cd456"}

node

kubernetes.node.allocatable.cpuCores

kube_node_status_allocatable_cpu_cores

  • node=<node-address>

  • resource=<resource-name>

  • unit=<resource-unit>

  • node=<node-address>

resource/unit have one of the values: (cpu, core); (memory, byte); (pods, integer). Sysdig currently supports only CPU, pods, and memory resources for kube_node_status_capacity metrics.

"# HELP kube_node_status_capacity The capacity for different resources of a node.
kube_node_status_capacity{node=""k8s-master"",resource=""hugepages_1Gi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-master"",resource=""hugepages_2Mi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-master"",resource=""memory"",unit=""byte""} 4.16342016e+09
kube_node_status_capacity{node=""k8s-master"",resource=""pods"",unit=""integer""} 110
kube_node_status_capacity{node=""k8s-node1"",resource=""pods"",unit=""integer""} 110
kube_node_status_capacity{node=""k8s-node1"",resource=""cpu"",unit=""core""} 2
kube_node_status_capacity{node=""k8s-node1"",resource=""hugepages_1Gi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-node1"",resource=""hugepages_2Mi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-node1"",resource=""memory"",unit=""byte""} 6.274154496e+09
kube_node_status_capacity{node=""k8s-node2"",resource=""hugepages_1Gi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-node2"",resource=""hugepages_2Mi"",unit=""byte""} 0
kube_node_status_capacity{node=""k8s-node2"",resource=""memory"",unit=""byte""} 6.274154496e+09
kube_node_status_capacity{node=""k8s-node2"",resource=""pods"",unit=""integer""} 110
kube_node_status_capacity{node=""k8s-node2"",resource=""cpu"",unit=""core""} 2

kubernetes.node.allocatable.memBytes

kube_node_status_allocatable_memory_bytes

kubernetes.node.allocatable.pods

kube_node_status_allocatable_pods

kubernetes.node.capacity.cpuCores

kube_node_status_capacity_cpu_cores

  • node=<node-address>

  • resource=<resource-name>

  • unit=<resource-unit>

  • node=<node-address>

kubernetes.node.capacity.memBytes

kube_node_status_capacity_memory_bytes

kubernetes.node.capacity.pod

kube_node_status_capacity_pods

kubernetes.node.diskPressure

kube_node_status_condition

  • node=<node-address

  • condition=<node-condition>

  • status=<true|false|unknown>

kubernetes.node.memoryPressure

kubernetes.node.networkUnavailable

kubernetes.node.outOfDisk

kubernetes.node.ready

kubernetes.node.unschedulable

kube_node_spec_unschedulable

  • node=<node-address>

kube_node_info

  • node=<node-address>

kube_node_labels

  • node=<node-address>

  • label_NODE_LABEL=<NODE_LABEL>

Deployment

kubernetes.deployment.replicas.available

kube_deployment_status_replicas_available

  • deployment=<deployment-name>

  • namespace=<deployment-namespace>

kubernetes.deployment.replicas.desired

kube_deployment_spec_replicas

kubernetes.deployment.replicas.paused

kube_deployment_spec_paused

kubernetes.deployment.replicas.running

kube_deployment_status_replicas

kubernetes.deployment.replicas.unavailable

kube_deployment_status_replicas_unavailable

kubernetes.deployment.replicas.updated

kube_deployment_status_replicas_updated

kube_deployment_labels

job

kubernetes.job.completions

kube_job_spec_completions

  • job_name=<job-name>

  • namespace=<job-namespace>

kubernetes.job.numFailed

kube_job_failed

kubernetes.job.numSucceeded

kube_job_complete

kubernetes.job.parallelism

kube_job_spec_parallelism

kube_job_status_active

kube_job_info

kube_job_owner

  • job_name=<job-name>

  • namespace=<job-namespace>

  • owner_kind=<owner kind>

  • owner_name=<owner name>

kube_job_labels

  • job_name=<job-name>

  • namespace=<job-namespace>

  • label_job_label=<job_label>

daemonSet

kubernetes.daemonSet.pods.desired

kube_daemonset_status_desired_number_scheduled

  • daemonset=<daemonset-name>

  • namespace=<daemonset-namespace>

kubernetes.daemonSet.pods.misscheduled

kube_daemonset_status_number_misscheduled

kubernetes.daemonSet.pods.ready

kube_daemonset_status_number_ready

kubernetes.daemonSet.pods.scheduled

kube_daemonset_status_current_number_scheduled

kube_daemonset_labels

  • daemonset=<daemonset-name>

  • namespace=<daemonset-namespace>

  • label_daemonset_label=<daemonset_label>

replicaSet

kubernetes.replicaSet.replicas.fullyLabeled

kube_replicaset_status_fully_labeled_replicas

  • replicaset=<replicaset-name>

  • namespace=<replicaset-namespace>

kubernetes.replicaSet.replicas.ready

kube_replicaset_status_ready_replicas

kubernetes.replicaSet.replicas.running

kube_replicaset_status_replicas

kubernetes.replicaSet.replicas.desired

kube_replicaset_spec_replicas

kube_replicaset_owner

  • replicaset=<replicaset-name>

  • namespace=<replicaset-namespace>

  • owner_kind=<owner kind>

  • owner_name=<owner name>

kube_replicaset_labels

  • label_replicaset_label=<replicaset_label>

  • replicaset=<replicaset-name>

  • namespace=<replicaset-namespace>

statefulset

kubernetes.statefulset.replicas

kube_statefulset_replicas

  • statefulset=<statefulset-name>

  • namespace=<statefulset-namespace>

kubernetes.statefulset.status.replicas

kube_statefulset_status_replicas

kubernetes.statefulset.status.replicas.current

kube_statefulset_status_replicas_current

kubernetes.statefulset.status.replicas.ready

kube_statefulset_status_replicas_ready

kubernetes.statefulset.status.replicas.updated

kube_statefulset_status_replicas_updated

kube_statefulset_labels

hpa

kubernetes.hpa.replicas.min

kube_horizontalpodautoscaler_spec_min_replicas

  • hpa=<hpa-name>

  • namespace=<hpa-namespace>

kubernetes.hpa.replicas.max

kube_horizontalpodautoscaler_spec_max_replicas

kubernetes.hpa.replicas.current

kube_horizontalpodautoscaler_status_current_replicas

kubernetes.hpa.replicas.desired

kube_horizontalpodautoscaler_status_desired_replicas

kube_horizontalpodautoscaler_labels

resourcequota

kubernetes.resourcequota.configmaps.hard

kubernetes.resourcequota.configmaps.used

kubernetes.resourcequota.limits.cpu.hard

kubernetes.resourcequota.limits.cpu.used

kubernetes.resourcequota.limits.memory.hard

kubernetes.resourcequota.limits.memory.used

kubernetes.resourcequota.persistentvolumeclaims.hard

kubernetes.resourcequota.persistentvolumeclaims.used

kubernetes.resourcequota.cpu.hard

kubernetes.resourcequota.memory.hard

kubernetes.resourcequota.pods.hard

kubernetes.resourcequota.pods.used

kubernetes.resourcequota.replicationcontrollers.hard

kubernetes.resourcequota.replicationcontrollers.used

kubernetes.resourcequota.requests.cpu.hard

kubernetes.resourcequota.requests.cpu.used

kubernetes.resourcequota.requests.memory.hard

kubernetes.resourcequota.requests.memory.used

kubernetes.resourcequota.requests.storage.hard

kubernetes.resourcequota.requests.storage.used

kubernetes.resourcequota.resourcequotas.hard

kubernetes.resourcequota.resourcequotas.used

kubernetes.resourcequota.secrets.hard

kubernetes.resourcequota.secrets.used

kubernetes.resourcequota.services.hard

kubernetes.resourcequota.services.used

kubernetes.resourcequota.services.loadbalancers.hard

kubernetes.resourcequota.services.loadbalancers.used

kubernetes.resourcequota.services.nodeports.hard

kubernetes.resourcequota.services.nodeports.used

kube_resourcequota

  • resourcequota=<quota-name>

  • namespace=<namespace>

  • resource=<ResourceName>

  • type=<quota-type>

namespace

kube_namespace_labels

  • namespace=<namespace-name>

  • label_ns_label=<ns_label>

replicationcontroller

kubernetes.replicationcontroller.replicas.desired

kube_replicationcontroller_spec_replicase

  • replicationcontroller=<replicationcontroller-name>

  • namespace=<replicationcontroller-namespace>

kubernetes.replicationcontroller.replicas.running

kube_replicationcontroller_status_replicas

kube_replicationcontroller_status_fully_labeled_replicas

kube_replicationcontroller_status_ready_replicas

kube_replicationcontroller_status_available_replicas

kube_replicationcontroller_status_observed_generation

kube_replicationcontroller_metadata_generation

kube_replicationcontroller_created

kube_replicationcontroller_owner

  • replicationcontroller=<replicationcontroller-name>

  • namespace=<replicationcontroller-namespace>

  • owner_kind=<owner kind>

  • owner_name=<owner name>

service

kube_service_info

  • service=<service-name>

  • namespace=<service-namespace>

  • cluster_ip=<service cluster ip>

  • external_name=<service external name>

  • load_balancer_ip=<service load balancer ip>

kube_service_labels

  • service=<service-name>

  • namespace=<service-namespace>

  • label_service_label=<service_label>

persistentvolume

kubernetes.persistentvolume.storage

kube_persistentvolume_capacity_bytes

  • persistentvolume=<pv-name>

kube_persistentvolume_info

  • persistentvolume=<pv-name>

kube_persistentvolume_labels

  • persistentvolume=<pv-name>

  • namespace=<label_persistentvolume_label=<persistentvolume_label>

persistentvolumeclaim

kubernetes.persistentvolumeclaim.requests.storage

kube_persistentvolumeclaim_resource_requests_storage_bytes

  • namespace=<persistentvolumeclaim-namespace>

  • persistentvolumeclaim=<persistentvolumeclaim-name>

kube_persistentvolumeclaim_info

kube_persistentvolumeclaim_labels

  • persistentvolumeclaim=<persistentvolumeclaim-name>

  • namespace=<persistentvolumeclaim-namespace>

  • label_persistentvolumeclaim_label=<persistentvolumeclaim_label>

4.10.1.4 - Run PromQL Queries Faster with Extended Label Set

Sysdig allows you to run PromQL queries smoother and faster with the extended label set. The extended label set is created by augmenting the incoming data with the rich metadata associated with your infrastructure and making it available in PromQL.

With this, you can troubleshoot a problem or building Dashboards and Alerts without the need to write complex queries. Sysdig automatically enriches your metrics with Kubernetes and application context without the need to instrument additional labels in your environment. This reduces operational complexity and cost—the enrichment takes place in Sysdig metric ingestion pipeline after time series have been sent to the backend.

Calculate Memory Usage by Deployment in a Cluster

Using the vector matching operation, you could run the following query and calculate the memory usage by deployment in a cluster:

sum by(cluster,namespace,owner_name) ((sysdig_container_memory_used_bytes * on(container_id) group_left(pod,namespace,cluster) kube_pod_container_info) * on(pod,namespace,cluster) group_left(owner_name) kube_pod_owner{owner_kind="Deployment",owner_name=~".+",cluster=~".+",namespace=~".+"})

To get the result, you need to write a query to perform a join (vector match) of various metrics, usually in the following order:

  • Grab a metric you need that is defined on a container level. For example, a Prometheus metric or some of the Sysdig provided metrics, such as sysdig_container_memory_used_byte.

  • Perform a vector match on container ID with the metric kube_pod_container_info to get the pod metadata.

  • Perform a vector match on the pod, namespace, and cluster with the kube_pod_owner metric.

In the case of Sysdig’s extended label set for PromQL, all the metrics inherit the metadata, so that necessary container, host, and Kubernetes metadata are set on all the metrics. This simplifies the query so you can build and run it quickly.

Likewise, the above query can be simplified as follows:

sum by (kube_cluster_name,kube_namespace_name,kube_deployment_name) (sysdig_container_memory_used_bytes{kube_cluster_name!="",kube_namespace_name!="",kube_deployment_name!=""})

The advantages of using a simplified query are:

  • Complex vector matching operations (the group_left and group_right operators) are no longer required. All the labels are already available on each of the metrics, and therefore, any filtering can be performed directly on the metric itself.

  • The metrics now will have a huge amount of labels. You can use PromQL Explorer to deal with this rich metadata.

  • The metadata is distinguishable from user-defined labels. For example, Kubernetes metadata labels start with kube_. For instance, cluster is replaced with kube_cluster_name.

  • Create a dashboard panel or an alert from the PromQL query you run in the PromQL Query Explore.

  • Filter data by applying the comparison operators on the label values given in the table.

Examples for Simplifying Queries

Given below are some of the examples of using the extended label set to simplify complex query operations.

Memory Usage in a Kubernetes Cluster

Query with core label set:

avg by (agent_tag_cluster) ((sysdig_host_memory_used_bytes/sysdig_host_memory_total_bytes) * on(host,agent_tag_cluster) sysdig_host_info{agent_tag_cluster=~".+"}) * 100

Query with the extended label set:

avg by (agent_tag_cluster) (sysdig_host_memory_used_bytes/sysdig_host_memory_total_bytes) * 100

CPU Usage in Containers

Query with the core label set:

sum by (cluster,namespace)(sysdig_container_cpu_cores_used * on (container_id) group_left(cluster,pod,namespace) kube_pod_container_info{cluster=~".+"})

Simplified query with the extended label set:

sum by (kube_cluster_name,kube_namespace_name)(sysdig_container_cpu_cores_used{kube_cluster_name=~".+"})

Memory Usage in Daemonset

Query with the core label set:

sum by(cluster,namespace,owner_name) (sum by(pod) (label_replace(sysdig_container_memory_used_bytes * on(container_id,host_mac) group_left(label_io_kubernetes_pod_namespace,label_io_kubernetes_pod_name,label_io_kubernetes_container_name) sysdig_container_info{label_io_kubernetes_pod_namespace=~".*",cluster=~".*"},"pod","$1","label_io_kubernetes_pod_name","(.*)"))  * on(pod) group_right sum by(cluster,namespace,owner_name,pod) (kube_pod_owner{owner_kind=~"DaemonSet",owner_name=~".*",cluster=~".*",namespace=~".*"}))

Simplified query with the extended label set:

sum by(kube_cluster_name,kube_namespace_name,kube_daemonset_name) (sysdig_container_memory_used_bytes{kube_daemonset_name=~".*",kube_cluster_name=~".*",kube_namespace_name=~".*"})

Pod Restarts in a Kubernetes Cluster

Query with the core label set:

sum by(cluster,namespace,owner_name)(changes(kube_pod_status_ready{condition="true",cluster=~$cluster,namespace=~$namespace}[$__interval]) * on(cluster,namespace,pod) group_left(owner_name) kube_pod_owner{owner_kind="Deployment",owner_name=~".+",cluster=~$cluster,namespace=~$namespace})

Simplified query with the extended label set:

sum by (kube_cluster_name,kube_namespace_name,kube_deployment_name)(changes(kube_pod_status_ready{condition="true",kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_deployment_name=~".+"}[$__interval]))

Containers per Image

Query with the core label set:

count by (owner_name,image,cluster,namespace)((sysdig_container_info{cluster=~$cluster,namespace=~$namespace})  * on(pod,namespace,cluster) group_left(owner_name) max by (pod,namespace,cluster,owner_name)(kube_pod_owner{owner_kind="Deployment",owner_name=~".+"}))

Simplified query with the extended label set:

count by (kube_deployment_name,image,kube_cluster_name,kube_namespace_name)(sysdig_container_info{kube_deployment_name=~".+",kube_cluster_name=~$cluster,kube_namespace_name=~$namespace})

Average TCP Queue per Node

Query with the core label set:

avg by (agent_tag_cluster,host)( sysdig_host_net_tcp_queue_len * on (host_mac) group_left(agent_tag_cluster,host) sysdig_host_info{agent_tag_cluster=~$cluster,host=~".+"})

Simplified query with the extended label set:

avg by (agent_tag_cluster,host_hostname) (sysdig_host_net_tcp_queue_len{agent_tag_cluster =~ $cluster})

4.10.2 - Metrics and Labels in Prometheus Format

The Prometheus metrics library lists the metrics in Prometheus format supported by the Sysdig product suite, as well as kube state and cloud provider metrics.

The metrics listed in this section follows the statsd-compatible Sysdig naming convention. To see a mapping between Prometheus notation and Sysdig notation, see Metrics and Label Mapping.

Overview

Each metric in the dictionary has several pieces of metadata listed to provide greater context for how the metric can be used within Sysdig products. An example layout is displayed below:

Metric Name

Metric definition. For some metrics, the equation for how the value is determined is provided.

Metadata

Definition

Metric Type

Metric type determines whether the metric value is a counter metric or a gauge metric. Sysdig Monitor offers two Metric types:

Counter: The metric whose value keeps on increasing and is reliant on previous values. It helps you record how many times something has happened, for example, a user login.

Gauge: Represents a single numerical value that can arbitrarily fluctuate over time. Each value returns an instantaneous measurement, for example, CPU usage.

Value Type

The type of value the metric can have. The possible values are:

  • Percent (%)

  • Byte

  • Date

  • Double

  • Integer (int)

  • relativeTime

  • String

Segment By

The levels within the infrastructure that the metric can be segmented at:

  • Host

  • Container

  • Process

  • Kubernetes

  • Mesos

  • Swarm

  • CloudProvider

Default Time Aggregation

The default time aggregation format for the metric.

Available Time Aggregation Formats

The time aggregation formats the metric can be aggregated by:

  • Average (Avg)

  • Rate

  • Sum

  • Minimum (Min)

  • Maximum (Max)

Default Group Aggregation

The default group aggregation format for the metric.

Available Group Aggregation Formats

The group aggregation formats the metric can be aggregated by:

  • Average (Avg)

  • Sum

  • Minimum (Min)

  • Maximum (Max)

4.10.2.1 - Agent

sysdig_agent_info

Prometheus IDsysdig_agent_info
Legacy IDinfo
Metric Typegauge
Unitnumber
DescriptionThe metrics will always have the value of 1.
Additional Notes

sysdig_agent_timeseries_count_appcheck

Prometheus IDsysdig_agent_timeseries_count_appcheck
Legacy IDmetricCount.appCheck
Metric Typegauge
Unitnumber
DescriptionThe total number of time series received from appcheck integrations.
Additional Notes

sysdig_agent_timeseries_count_jmx

Prometheus IDsysdig_agent_timeseries_count_jmx
Legacy IDmetricCount.jmx
Metric Typegauge
Unitnumber
DescriptionThe total number of time series received from JMX integrations.
Additional Notes

sysdig_agent_timeseries_count_prometheus

Prometheus IDsysdig_agent_timeseries_count_prometheus
Legacy IDmetricCount.prometheus
Metric Typegauge
Unitnumber
DescriptionThe total number of time series received from Prometheus integrations.
Additional Notes

sysdig_agent_timeseries_count_statsd

Prometheus IDsysdig_agent_timeseries_count_statsd
Legacy IDmetricCount.statsd
Metric Typegauge
Unitnumber
DescriptionThe total number of time series received from StatsD integrations.
Additional Notes

4.10.2.2 - Containers

sysdig_container_count

Prometheus IDsysdig_container_count
Legacy IDcontainer.count
Metric Typegauge
Unitnumber
DescriptionThe count of the number of containers.
Additional NotesThis metric is perfect for dashboards and alerts. In particular, you can create alerts that notify you when you have too many (or too few) containers of a certain type in a certain group or node - try segmenting by container.image, .id or .name. See also: host.count.

sysdig_container_cpu_cgroup_used_percent

Prometheus IDsysdig_container_cpu_cgroup_used_percent
Legacy IDcpu.cgroup.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of a container’s cgroup limit that is actually used. This is the minimum usage for the underlying cgroup limits: cpuset.limit and quota.limit.
Additional Notes

sysdig_container_cpu_cores_cgroup_limit

Prometheus IDsysdig_container_cpu_cores_cgroup_limit
Legacy IDcpu.cores.cgroup.limit
Metric Typegauge
Unitnumber
DescriptionThe number of CPU cores assigned to a container. This is the minimum of the cgroup limits: cpuset.limit and quota.limit.
Additional Notes

sysdig_container_cpu_cores_quota_limit

Prometheus IDsysdig_container_cpu_cores_quota_limit
Legacy IDcpu.cores.quota.limit
Metric Typegauge
Unitnumber
DescriptionThe number of CPU cores assigned to a container. Technically, the container’s cgroup quota and period. This is a way of creating a CPU limit for a container.
Additional Notes

sysdig_container_cpu_cores_used

Prometheus IDsysdig_container_cpu_cores_used
Legacy IDcpu.cores.used
Metric Typegauge
Unitnumber
DescriptionThe CPU core usage of each container is obtained from cgroups, and is equal to the number of cores used by the container. For example, if a container uses two of an available four cores, the value of sysdig_container_cpu_cores_used will be two.
Additional Notes

sysdig_container_cpu_cores_used_percent

Prometheus IDsysdig_container_cpu_cores_used_percent
Legacy IDcpu.cores.used.percent
Metric Typegauge
Unitpercent
DescriptionThe CPU core usage percent for each container is obtained from cgroups, and is equal to the number of cores multiplied by 100. For example, if a container uses three cores, the value of sysdig_container_cpu_cores_used_percent would be 300%.
Additional Notes

sysdig_container_cpu_quota_used_percent

Prometheus IDsysdig_container_cpu_quota_used_percent
Legacy IDcpu.quota.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of a container’s CPU Quota that is actually used. CPU Quotas are a common way of creating a CPU limit for a container. CPU Quotas are based on a percentage of time - a container can only spend its quota of time on CPU cycles across a given time period (default period is 100ms). Note that, unlike CPU Shares, CPU Quota is a hard limit to the amount of CPU the container can use - so this metric, CPU Quota %, should not exceed 100%.
Additional Notes

sysdig_container_cpu_shares_count

Prometheus IDsysdig_container_cpu_shares_count
Legacy IDcpu.shares.count
Metric Typegauge
Unitnumber
DescriptionThe number of CPU shares assigned to a container (technically, the container’s cgroup) - this is a common way of creating a CPU limit for a container. CPU Shares represent a relative weight used by the kernel to distribute CPU cycles across different containers. The default value for a container is 1024. Each container receives its own allocation of CPU cycles, according to the ratio of it’s share count vs to the total number of shares claimed by all containers. For example, if you have three containers, each with 1024 shares, then each will recieve 1/3 of the CPU cycles. Note that this is not a hard limit: a container can consume more than its allocation, if the CPU has cycles that aren’t being consumed by the container they were originally allocated to.
Additional Notes

sysdig_container_cpu_shares_used_percent

Prometheus IDsysdig_container_cpu_shares_used_percent
Legacy IDcpu.shares.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of a container’s allocated CPU shares that are actually used. CPU Shares are a common way of creating a CPU limit for a container. CPU Shares represent a relative weight used by the kernel to distribute CPU cycles across different containers. The default value for a container is 1024. Each container receives its own allocation of CPU cycles, according to the ratio of it’s share count vs to the total number of shares claimed by all containers. For example, if you have three containers, each with 1024 shares, then each will recieve 1/3 of the CPU cycles. Note that this is not a hard limit: a container can consume more than its allocation, if the CPU has cycles that aren’t being consumed by the container they were originally allocated to - so this metric, CPU Shares %, can actually exceed 100%.
Additional Notes

sysdig_container_cpu_used_percent

Prometheus IDsysdig_container_cpu_used_percent
Legacy IDcpu.used.percent
Metric Typegauge
Unitpercent
DescriptionThe CPU usage for each container is obtained from cgroups, and normalized by dividing by the number of cores to determine an overall percentage. For example, if the environment contains six cores on a host, and the container or processes are assigned two cores, Sysdig will report CPU usage of 2/6 * 100% = 33.33%. This metric is calculated differently for hosts and processes.
Additional Notes

sysdig_container_fd_used_percent

Prometheus IDsysdig_container_fd_used_percent
Legacy IDfd.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of used file descriptors out of the maximum available.
Additional NotesUsually, when a process reaches its FD limit it will stop operating properly and possibly crash. As a consequence, this is a metric you want to monitor carefully, or even better use for alerts.

sysdig_container_file_error_open_count

Prometheus IDsysdig_container_file_error_open_count
Legacy IDfile.error.open.count
Metric Typecounter
Unitnumber
DescriptionThe number of errors in opening files.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_file_error_total_count

Prometheus IDsysdig_container_file_error_total_count
Legacy IDfile.error.total.count
Metric Typecounter
Unitnumber
DescriptionThe number of error caused by file access.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_file_in_bytes

Prometheus IDsysdig_container_file_in_bytes
Legacy IDfile.bytes.in
Metric Typecounter
Unitdata
DescriptionThe amount of bytes read from file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_file_in_iops

Prometheus IDsysdig_container_file_in_iops
Legacy IDfile.iops.in
Metric Typecounter
Unitnumber
DescriptionThe number of file read operations per second.
Additional NotesThis is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.

sysdig_container_file_in_time

Prometheus IDsysdig_container_file_in_time
Legacy IDfile.time.in
Metric Typecounter
Unittime
DescriptionThe time spent in file reading.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_file_open_count

Prometheus IDsysdig_container_file_open_count
Legacy IDfile.open.count
Metric Typecounter
Unitnumber
DescriptionThe number of time the file has been opened.
Additional Notes

sysdig_container_file_out_bytes

Prometheus IDsysdig_container_file_out_bytes
Legacy IDfile.bytes.out
Metric Typecounter
Unitdata
DescriptionThe number of of bytes written to file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_file_out_iops

Prometheus IDsysdig_container_file_out_iops
Legacy IDfile.iops.out
Metric Typecounter
Unitnumber
DescriptionThe Number of file write operations per second.
Additional NotesThis is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.

sysdig_container_file_out_time

Prometheus IDsysdig_container_file_out_time
Legacy IDfile.time.out
Metric Typecounter
Unittime
DescriptionThe time spent in file writing.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_file_total_bytes

Prometheus IDsysdig_container_file_total_bytes
Legacy IDfile.bytes.total
Metric Typecounter
Unitdata
DescriptionThe number of bytes read from and written to file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_file_total_iops

Prometheus IDsysdig_container_file_total_iops
Legacy IDfile.iops.total
Metric Typecounter
Unitnumber
DescriptionThe number of read and write file operations per second.
Additional NotesThis is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.

sysdig_container_file_total_time

Prometheus IDsysdig_container_file_total_time
Legacy IDfile.time.total
Metric Typecounter
Unittime
DescriptionThe time spent in file I/O.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_fs_free_bytes

Prometheus IDsysdig_container_fs_free_bytes
Legacy IDfs.bytes.free
Metric Typegauge
Unitdata
DescriptionThe available space in the filesystem.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_fs_free_percent

Prometheus IDsysdig_container_fs_free_percent
Legacy IDfs.free.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of free space in the filesystem.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_fs_inodes_total_count

Prometheus IDsysdig_container_fs_inodes_total_count
Legacy IDfs.inodes.total.count
Metric Typegauge
Unitnumber
DescriptionThe total number of inodes in the filesystem.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_fs_inodes_used_count

Prometheus IDsysdig_container_fs_inodes_used_count
Legacy IDfs.inodes.used.count
Metric Typegauge
Unitnumber
DescriptionThe number of inodes used in the filesystem.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_fs_inodes_used_percent

Prometheus IDsysdig_container_fs_inodes_used_percent
Legacy IDfs.inodes.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of inodes usage in the filesystem.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_fs_largest_used_percent

Prometheus IDsysdig_container_fs_largest_used_percent
Legacy IDfs.largest.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of the largest filesystem in use.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_fs_root_used_percent

Prometheus IDsysdig_container_fs_root_used_percent
Legacy IDfs.root.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of the root filesystem in use in the container.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_fs_total_bytes

Prometheus IDsysdig_container_fs_total_bytes
Legacy IDfs.bytes.total
Metric Typegauge
Unitdata
DescriptionThe size of container filesystem.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_fs_used_bytes

Prometheus IDsysdig_container_fs_used_bytes
Legacy IDfs.bytes.used
Metric Typegauge
Unitdata
DescriptionThe used space in the container filesystem.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_fs_used_percent

Prometheus IDsysdig_container_fs_used_percent
Legacy IDfs.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of the sum of all filesystems in use in the container.
Additional NotesContainer Filesystem metrics report data on filesystems mounted to containers. These are the most useful metrics for stateful containers which have dedicated file storage mounted. Use these metrics with appropriate scoping. Care should be taken when aggregating filesystem metrics to ensure that there is no “double counting” of filesystems that are mounted to multiple containers. Additionally, the metrics from overlay type file systems are generally not reported, so these metrics typically will not show the actual space consumed by a container.

sysdig_container_info

Prometheus IDsysdig_container_info
Legacy IDinfo
Metric Typegauge
Unitnumber
DescriptionThe info metrics will always have the value of 1.
Additional Notes

sysdig_container_memory_limit_bytes

Prometheus IDsysdig_container_memory_limit_bytes
Legacy IDmemory.limit.bytes
Metric Typegauge
Unitdata
DescriptionThe memory limit in bytes assigned to a container.
Additional Notes

sysdig_container_memory_limit_used_percent

Prometheus IDsysdig_container_memory_limit_used_percent
Legacy IDmemory.limit.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of memory limit used by a container.
Additional Notes

sysdig_container_memory_used_bytes

Prometheus IDsysdig_container_memory_used_bytes
Legacy IDmemory.bytes.used
Metric Typegauge
Unitdata
DescriptionThe amount of physical memory currently in use.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_container_memory_used_percent

Prometheus IDsysdig_container_memory_used_percent
Legacy IDmemory.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of physical memory in use.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_memory_virtual_bytes

Prometheus IDsysdig_container_memory_virtual_bytes
Legacy IDmemory.bytes.virtual
Metric Typegauge
Unitdata
DescriptionThe virtual memory size of the process, in bytes. This value is obtained from Sysdig events.
Additional Notes

sysdig_container_net_connection_in_count

Prometheus IDsysdig_container_net_connection_in_count
Legacy IDnet.connection.count.in
Metric Typecounter
Unitnumber
DescriptionThe number of currently established client (inbound) connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_container_net_connection_out_count

Prometheus IDsysdig_container_net_connection_out_count
Legacy IDnet.connection.count.out
Metric Typecounter
Unitnumber
DescriptionThe number of currently established server (outbound) connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_container_net_connection_total_count

Prometheus IDsysdig_container_net_connection_total_count
Legacy IDnet.connection.count.total
Metric Typecounter
Unitnumber
DescriptionThe number of currently established connections. This value may exceed the sum of the inbound and outbound metrics since it represents client and server inter-host connections as well as internal only connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_container_net_error_count

Prometheus IDsysdig_container_net_error_count
Legacy IDnet.error.count
Metric Typecounter
Unitnumber
DescriptionThe number of network errors.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_net_http_error_count

Prometheus IDsysdig_container_net_http_error_count
Legacy IDnet.http.error.count
Metric Typecounter
Unitnumber
DescriptionThe number of failed HTTP requests as counted from 4xx/5xx status codes.
Additional Notes

sysdig_container_net_http_request_count

Prometheus IDsysdig_container_net_http_request_count
Legacy IDnet.http.request.count
Metric Typecounter
Unitnumber
DescriptionThe count of HTTP requests.
Additional Notes

sysdig_container_net_http_request_time

Prometheus IDsysdig_container_net_http_request_time
Legacy IDnet.http.request.time
Metric Typecounter
Unittime
DescriptionThe average time taken for HTTP requests.
Additional Notes

sysdig_container_net_http_statuscode_error_count

Prometheus IDsysdig_container_net_http_statuscode_error_count
Legacy IDnet.http.statuscode.error.count
Metric Typecounter
Unitnumber
DescriptionThe number of HTTP error codes returned.
Additional Notes

sysdig_container_net_http_statuscode_request_count

Prometheus IDsysdig_container_net_http_statuscode_request_count
Legacy IDnet.http.statuscode.request.count
Metric Typecounter
Unitnumber
DescriptionThe number of HTTP status codes requests.
Additional Notes

sysdig_container_net_http_url_error_count

Prometheus IDsysdig_container_net_http_url_error_count
Legacy IDnet.http.url.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_container_net_http_url_request_count

Prometheus IDsysdig_container_net_http_url_request_count
Legacy IDnet.http.url.request.count
Metric Typecounter
Unitnumber
DescriptionThe number of HTTP URLs requests.
Additional Notes

sysdig_container_net_http_url_request_time

Prometheus IDsysdig_container_net_http_url_request_time
Legacy IDnet.http.url.request.time
Metric Typecounter
Unittime
DescriptionThe time taken for requesting HTTP URLs.
Additional Notes

sysdig_container_net_in_bytes

Prometheus IDsysdig_container_net_in_bytes
Legacy IDnet.bytes.in
Metric Typecounter
Unitdata
DescriptionThe number of inbound network bytes.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_net_mongodb_error_count

Prometheus IDsysdig_container_net_mongodb_error_count
Legacy IDnet.mongodb.error.count
Metric Typecounter
Unitnumber
DescriptionThe number of Failed MongoDB requests.
Additional Notes

sysdig_container_net_mongodb_request_count

Prometheus IDsysdig_container_net_mongodb_request_count
Legacy IDnet.mongodb.request.count
Metric Typecounter
Unitnumber
DescriptionThe total number of MongoDB requests.
Additional Notes

sysdig_container_net_out_bytes

Prometheus IDsysdig_container_net_out_bytes
Legacy IDnet.bytes.out
Metric Typecounter
Unitdata
DescriptionThe number of outbound network bytes.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_net_request_count

Prometheus IDsysdig_container_net_request_count
Legacy IDnet.request.count
Metric Typecounter
Unitnumber
DescriptionThe total number of network requests. Note, this value may exceed the sum of inbound and outbound requests, because this count includes requests over internal connections.
Additional Notes

sysdig_container_net_request_in_count

Prometheus IDsysdig_container_net_request_in_count
Legacy IDnet.request.count.in
Metric Typecounter
Unitnumber
DescriptionThe number of inbound network requests.
Additional Notes

sysdig_container_net_request_in_time

Prometheus IDsysdig_container_net_request_in_time
Legacy IDnet.request.time.in
Metric Typecounter
Unittime
DescriptionThe average time to serve an inbound request.
Additional Notes

sysdig_container_net_request_out_count

Prometheus IDsysdig_container_net_request_out_count
Legacy IDnet.request.count.out
Metric Typecounter
Unitnumber
DescriptionThe number of outbound network requests.
Additional Notes

sysdig_container_net_request_out_time

Prometheus IDsysdig_container_net_request_out_time
Legacy IDnet.request.time.out
Metric Typecounter
Unittime
DescriptionThe average time spent waiting for an outbound request.
Additional Notes

sysdig_container_net_request_time

Prometheus IDsysdig_container_net_request_time
Legacy IDnet.request.time
Metric Typecounter
Unittime
DescriptionThe average time to serve a network request.
Additional Notes

sysdig_container_net_server_connection_in_count

Prometheus IDsysdig_container_net_server_connection_in_count
Legacy IDnet.server.connection.count.in
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_container_net_server_in_bytes

Prometheus IDsysdig_container_net_server_in_bytes
Legacy IDnet.server.bytes.in
Metric Typecounter
Unitdata
Description
Additional Notes

sysdig_container_net_server_out_bytes

Prometheus IDsysdig_container_net_server_out_bytes
Legacy IDnet.server.bytes.out
Metric Typecounter
Unitdata
Description
Additional Notes

sysdig_container_net_server_total_bytes

Prometheus IDsysdig_container_net_server_total_bytes
Legacy IDnet.server.bytes.total
Metric Typecounter
Unitdata
Description
Additional Notes

sysdig_container_net_sql_error_count

Prometheus IDsysdig_container_net_sql_error_count
Legacy IDnet.sql.error.count
Metric Typecounter
Unitnumber
DescriptionThe number of failed SQL requests.
Additional Notes

sysdig_container_net_sql_query_error_count

Prometheus IDsysdig_container_net_sql_query_error_count
Legacy IDnet.sql.query.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_container_net_sql_query_request_count

Prometheus IDsysdig_container_net_sql_query_request_count
Legacy IDnet.sql.query.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_container_net_sql_query_request_time

Prometheus IDsysdig_container_net_sql_query_request_time
Legacy IDnet.sql.query.request.time
Metric Typecounter
Unittime
Description
Additional Notes

sysdig_container_net_sql_querytype_error_count

Prometheus IDsysdig_container_net_sql_querytype_error_count
Legacy IDnet.sql.querytype.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_container_net_sql_querytype_request_count

Prometheus IDsysdig_container_net_sql_querytype_request_count
Legacy IDnet.sql.querytype.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_container_net_sql_querytype_request_time

Prometheus IDsysdig_container_net_sql_querytype_request_time
Legacy IDnet.sql.querytype.request.time
Metric Typecounter
Unittime
Description
Additional Notes

sysdig_container_net_sql_request_count

Prometheus IDsysdig_container_net_sql_request_count
Legacy IDnet.sql.request.count
Metric Typecounter
Unitnumber
DescriptionThe number of SQL requests.
Additional Notes

sysdig_container_net_sql_request_time

Prometheus IDsysdig_container_net_sql_request_time
Legacy IDnet.sql.request.time
Metric Typecounter
Unittime
DescriptionThe average time to complete an SQL request.
Additional Notes

sysdig_container_net_sql_table_error_count

Prometheus IDsysdig_container_net_sql_table_error_count
Legacy IDnet.sql.table.error.count
Metric Typecounter
Unitnumber
DescriptionThe total number of SQL errors returned.
Additional Notes

sysdig_container_net_sql_table_request_count

Prometheus IDsysdig_container_net_sql_table_request_count
Legacy IDnet.sql.table.request.count
Metric Typecounter
Unitnumber
DescriptionThe total number of SQL table requests.
Additional Notes

sysdig_container_net_sql_table_request_time

Prometheus IDsysdig_container_net_sql_table_request_time
Legacy IDnet.sql.table.request.time
Metric Typecounter
Unittime
DescriptionThe average time to serve an SQL table request.
Additional Notes

sysdig_container_net_tcp_queue_len

Prometheus IDsysdig_container_net_tcp_queue_len
Legacy IDnet.tcp.queue.len
Metric Typecounter
Unitnumber
DescriptionThe length of the TCP request queue.
Additional Notes

sysdig_container_net_total_bytes

Prometheus IDsysdig_container_net_total_bytes
Legacy IDnet.bytes.total
Metric Typecounter
Unitdata
DescriptionThe total number of network bytes, including inbound and outbound connections.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_proc_count

Prometheus IDsysdig_container_proc_count
Legacy IDproc.count
Metric Typecounter
Unitnumber
DescriptionThe number of processes on host or container.
Additional Notes

sysdig_container_swap_limit_bytes

Prometheus IDsysdig_container_swap_limit_bytes
Legacy IDswap.limit.bytes
Metric Typegauge
Unitdata
DescriptionThe swap limit in bytes assigned to a container.
Additional Notes

sysdig_container_swap_limit_used_percent

Prometheus IDsysdig_container_swap_limit_used_percent
Legacy IDswap.limit.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of swap limit used by the container.
Additional Notes

sysdig_container_syscall_count

Prometheus IDsysdig_container_syscall_count
Legacy IDsyscall.count
Metric Typegauge
Unitnumber
DescriptionThe total number of syscalls seen.
Additional NotesSyscalls are resource intensive. This metric tracks how many have been made by a given process or container

sysdig_container_syscall_error_count

Prometheus IDsysdig_container_syscall_error_count
Legacy IDhost.error.count
Metric Typecounter
Unitnumber
DescriptionThe number of system call errors.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_container_thread_count

Prometheus IDsysdig_container_thread_count
Legacy IDthread.count
Metric Typecounter
Unitnumber
DescriptionThe number of threads running in a container.
Additional Notes

sysdig_container_timeseries_count_appcheck

Prometheus IDsysdig_container_timeseries_count_appcheck
Legacy IDmetricCount.appCheck
Metric Typegauge
Unitnumber
DescriptionThe number of appcheck custom metrics.
Additional Notes

sysdig_container_timeseries_count_jmx

Prometheus IDsysdig_container_timeseries_count_jmx
Legacy IDmetricCount.jmx
Metric Typegauge
Unitnumber
DescriptionThe number of JMX custom metrics.
Additional Notes

sysdig_container_timeseries_count_prometheus

Prometheus IDsysdig_container_timeseries_count_prometheus
Legacy IDmetricCount.prometheus
Metric Typegauge
Unitnumber
DescriptionThe number of Prometheus custom metrics.
Additional Notes

sysdig_container_timeseries_count_statsd

Prometheus IDsysdig_container_timeseries_count_statsd
Legacy IDmetricCount.statsd
Metric Typegauge
Unitnumber
DescriptionThe number of StatsD custom metrics.
Additional Notes

sysdig_container_up

Prometheus IDsysdig_container_up
Legacy IDuptime
Metric Typegauge
Unitnumber
DescriptionThe percentage of time the selected entity was down during the visualized time sample. This can be used to determine if a machine (or a group of machines) went down.
Additional Notes

4.10.2.3 - Labels

_sysdig_datasource

Prometheus ID_sysdig_datasource
Legacy ID_sysdig_datasource
OSS KSM ID-
CategorySysdig
DescriptionIndicates the ingestion data source for the metric.
Additional Notes

agent_id

Prometheus IDagent_id
Legacy IDagent.id
OSS KSM ID-
CategoryAgent
DescriptionUnique agent id which sent the metric timeseries from the host
Additional Notes

agent_mode

Prometheus IDagent_mode
Legacy IDagent.mode
OSS KSM ID-
CategoryAgent
Description
Additional Notes

agent_version

Prometheus IDagent_version
Legacy IDagent.version
OSS KSM ID-
CategoryAgent
DescriptionThe Sysdig’s agent installed version
Additional Notes

cloud_provider_account_id

Prometheus IDcloud_provider_account_id
Legacy IDcloudProvider.account.id
OSS KSM ID-
CategoryCloud Provider
DescriptionThe account number related to your AWS account - useful when you have multiple AWS accounts linked with Sysdig Monitor.
Additional Notes

cloud_provider_availability_zone

Prometheus IDcloud_provider_availability_zone
Legacy IDcloudProvider.availabilityZone
OSS KSM ID-
CategoryCloud Provider
DescriptionThe AWS Availability Zone where the entity or entities are located. Each Availability zone is an isolated subsection of an AWS region (see cloudProvider.region).
Additional Notes

cloud_provider_host_ip_private

Prometheus IDcloud_provider_host_ip_private
Legacy IDcloudProvider.host.ip.private
OSS KSM ID-
CategoryCloud Provider
DescriptionThe private IP address allocated by the cloud provider for the instance. This address can be used for communication between instances in the same network.
Additional Notes

cloud_provider_host_ip_public

Prometheus IDcloud_provider_host_ip_public
Legacy IDcloudProvider.host.ip.public
OSS KSM ID-
CategoryCloud Provider
DescriptionPublic IP addresses of the selected host.
Additional Notes

cloud_provider_host_name

Prometheus IDcloud_provider_host_name
Legacy IDcloudProvider.host.name
OSS KSM ID-
CategoryCloud Provider
DescriptionThe name of the host as reported by the cloud provider (e.g. AWS).
Additional Notes

cloud_provider_id

Prometheus IDcloud_provider_id
Legacy IDcloudProvider.id
OSS KSM ID-
CategoryCloud Provider
DescriptionID number as assigned and reported by the cloud provider.
Additional Notes

cloud_provider_instance_type

Prometheus IDcloud_provider_instance_type
Legacy IDcloudProvider.instance.type
OSS KSM ID-
CategoryCloud Provider
DescriptionThe type of AWS instance.
Additional NotesThis metric is extremely useful to segment instances and compare their resource usage and saturation. You can use it as a grouping criteria for the explore table to quickly explore AWS usage on a per-instance-type basis. You can also use it to compare things like CPU usage, number of requests or network utilization for different instance types.

cloud_provider_name

Prometheus IDcloud_provider_name
Legacy IDcloudProvider.name
OSS KSM ID-
CategoryCloud Provider
DescriptionName of the cloud service provider (AWS, etc.).
Additional Notes

cloud_provider_region

Prometheus IDcloud_provider_region
Legacy IDcloudProvider.region
OSS KSM ID-
CategoryCloud Provider
DescriptionThe AWS region where the host (or group of hosts) is located.
Additional NotesUse this grouping criteria in conjunction with the host.count metric to easily create a report on how many instances you have in each region.

cloud_provider_resource_endpoint

Prometheus IDcloud_provider_resource_endpoint
Legacy IDcloudProvider.resource.endPoint
OSS KSM ID-
CategoryCloud Provider
DescriptionDNS name for which the resource can be accessed.
Additional Notes

cloud_provider_resource_name

Prometheus IDcloud_provider_resource_name
Legacy IDcloudProvider.resource.name
OSS KSM ID-
CategoryCloud Provider
DescriptionThe AWS service name (e.g. EC2, RDS, ELB).
Additional Notes

cloud_provider_resource_type

Prometheus IDcloud_provider_resource_type
Legacy IDcloudProvider.resource.type
OSS KSM ID-
CategoryCloud Provider
DescriptionThe service type (e.g. INSTANCE, LOAD_BALANCER, DATABASE).
Additional Notes

cloud_provider_security_groups

Prometheus IDcloud_provider_security_groups
Legacy IDcloudProvider.securityGroups
OSS KSM ID-
CategoryCloud Provider
DescriptionSecurity Groups Name.
Additional Notes

cloud_provider_status

Prometheus IDcloud_provider_status
Legacy IDcloudProvider.status
OSS KSM ID-
CategoryCloud Provider
DescriptionResource status.
Additional Notes

container_full_id

Prometheus IDcontainer_full_id
Legacy IDcontainer.full.id
OSS KSM ID-
CategoryContainer
DescriptionThe full UID of the running container as retrieved from the container runtime.
Additional Notes

container_id

Prometheus IDcontainer_id
Legacy IDcontainer.id
OSS KSM ID-
CategoryContainer
DescriptionThe short ID of the running container via truncating the full ID. In case of Docker, this is a 12 digit hex number.
Additional Notes

container_image

Prometheus IDcontainer_image
Legacy IDcontainer.image
OSS KSM ID-
CategoryContainer
DescriptionThe name of the image used to run the container.
Additional Notes

container_image_digest

Prometheus IDcontainer_image_digest
Legacy IDcontainer.image.digest
OSS KSM ID-
CategoryContainer
DescriptionThe digest of the image used to run the container.
Additional Notes

container_image_id

Prometheus IDcontainer_image_id
Legacy IDcontainer.image.id
OSS KSM ID-
CategoryContainer
DescriptionThe ID of the image used to run the container.
Additional Notes

container_image_repo

Prometheus IDcontainer_image_repo
Legacy IDcontainer.image.repo
OSS KSM ID-
CategoryContainer
DescriptionThe repo where the image used to run the container was retrieved from. Empty if image wasn’t retrieved from a remote repository.
Additional Notes

container_image_tag

Prometheus IDcontainer_image_tag
Legacy IDcontainer.image.tag
OSS KSM ID-
CategoryContainer
DescriptionThe tag of the image used to run the container.
Additional Notes

container_label_io_kubernetes_container_name

Prometheus IDcontainer_label_io_kubernetes_container_name
Legacy IDcontainer.label.io.kubernetes.container.name
OSS KSM ID-
CategoryContainer
DescriptionLabel set on the container in the container runtime when running in a Kubernetes environment. This label will match the container name set in the Kubernetes manifest for the Pod.
Additional Notes

container_label_io_kubernetes_pod_name

Prometheus IDcontainer_label_io_kubernetes_pod_name
Legacy IDcontainer.label.io.kubernetes.pod.name
OSS KSM ID-
CategoryContainer
DescriptionLabel set on the container in the container runtime when running in a Kubernetes environment. This label will match the Pod name set in the Kubernetes manifest for the Pod.
Additional Notes

container_label_io_kubernetes_pod_namespace

Prometheus IDcontainer_label_io_kubernetes_pod_namespace
Legacy IDcontainer.label.io.kubernetes.pod.namespace
OSS KSM ID-
CategoryContainer
DescriptionLabel set on the container in the container runtime when running in a Kubernetes environment. This label will match the Pod namespace set in the Kubernetes manifest for the Pod.
Additional Notes

container_label_io_prometheus_path

Prometheus IDcontainer_label_io_prometheus_path
Legacy IDcontainer.label.io.prometheus.path
OSS KSM ID-
CategoryContainer
Description
Additional Notes

container_label_io_prometheus_port

Prometheus IDcontainer_label_io_prometheus_port
Legacy IDcontainer.label.io.prometheus.port
OSS KSM ID-
CategoryContainer
Description
Additional Notes

container_label_io_prometheus_scrape

Prometheus IDcontainer_label_io_prometheus_scrape
Legacy IDcontainer.label.io.prometheus.scrape
OSS KSM ID-
CategoryContainer
Description
Additional Notes

container_name

Prometheus IDcontainer_name
Legacy IDcontainer.name
OSS KSM ID-
CategoryContainer
DescriptionThe name of a running container.
Additional Notes

container_type

Prometheus IDcontainer_type
Legacy IDcontainer.type
OSS KSM ID-
CategoryContainer
Description
Additional Notes

cpu_core

Prometheus IDcpu_core
Legacy IDcpu.core
OSS KSM ID-
CategoryHost
DescriptionCPU core number
Additional Notes

ecs_cluster_name

Prometheus IDecs_cluster_name
Legacy IDecs.clusterName
OSS KSM ID-
CategoryECS
DescriptionAmazon ECS cluster name
Additional Notes

ecs_service_name

Prometheus IDecs_service_name
Legacy IDecs.serviceName
OSS KSM ID-
CategoryECS
DescriptionAmazon ECS service name
Additional Notes

ecs_task_family_name

Prometheus IDecs_task_family_name
Legacy IDecs.taskFamilyName
OSS KSM ID-
CategoryECS
DescriptionAmazon ECS task family name
Additional Notes

file_mount

Prometheus IDfile_mount
Legacy IDfile.mount
OSS KSM ID-
CategoryFile Stats
DescriptionFile stats mount path
Additional Notes

file_name

Prometheus IDfile_name
Legacy IDfile.name
OSS KSM ID-
CategoryFile Stats
DescriptionFile stats file name including its path
Additional Notes

fs_device

Prometheus IDfs_device
Legacy IDfs.device
OSS KSM ID-
CategoryFile System
DescriptionFile system device name
Additional Notes

fs_mount_dir

Prometheus IDfs_mount_dir
Legacy IDfs.mountDir
OSS KSM ID-
CategoryFile System
DescriptionFile system mounted dir
Additional Notes

fs_type

Prometheus IDfs_type
Legacy IDfs.type
OSS KSM ID-
CategoryFile System
DescriptionFile system type (e.g. EXT, NTFS)
Additional Notes

host_domain

Prometheus IDhost_domain
Legacy IDhost.domain
OSS KSM ID-
CategoryHost
DescriptionDomain name for external websites.
Additional Notes

host_hostname

Prometheus IDhost_hostname
Legacy IDhost.hostName
OSS KSM ID-
CategoryHost
DescriptionHost name as defined in the /etc/hostname file.
Additional Notes

host_instance_id

Prometheus IDhost_instance_id
Legacy IDhost.instanceId
OSS KSM ID-
CategoryHost
Description
Additional Notes

host_ip_private

Prometheus IDhost_ip_private
Legacy IDhost.ip.private
OSS KSM ID-
CategoryHost
DescriptionPrivate machine IP address.
Additional Notes

host_ip_public

Prometheus IDhost_ip_public
Legacy IDhost.ip.public
OSS KSM ID-
CategoryHost
DescriptionPublic machine IP address.
Additional Notes

host_mac

Prometheus IDhost_mac
Legacy IDhost.mac
OSS KSM ID-
CategoryHost
DescriptionMedia Access Control address of the host.
Additional Notes

kube_cluster_id

Prometheus IDkube_cluster_id
Legacy IDkubernetes.cluster.id
OSS KSM IDid
CategoryKubernetes
DescriptionUniquely identifying ID for a cluster
Additional NotesAs there is no concept of a cluster ID in Kubernetes, this label is populated with the UID of the “default” namespace in the cluster

kube_cluster_name

Prometheus IDkube_cluster_name
Legacy IDkubernetes.cluster.name
OSS KSM IDcluster
CategoryKubernetes
DescriptionUser-defined name for the cluster
Additional NotesThe cluster name is set by the user via the “k8s_cluster_name” configuration parameter in the Agent or by adding an Agent tag with a key called “cluster”. If the user doesn’t set it, this label will not exist.

concurrency_policy

Prometheus IDconcurrency_policy
Legacy IDkubernetes.cronjob.concurrencyPolicy
OSS KSM ID-
CategoryKubernetes
DescriptionSpecifies how to treat concurrent executions created by this Cron Job. Value can be “Allow”, “Forbid”, or “Replace”
Additional Notes

kube_cronjob_name

Prometheus IDkube_cronjob_name
Legacy IDkubernetes.cronjob.name
OSS KSM IDcronjob
CategoryKubernetes
DescriptionName of the Cron Job as retrieved from the API server.
Additional Notes

schedule

Prometheus IDschedule
Legacy IDkubernetes.cronjob.schedule
OSS KSM ID-
CategoryKubernetes
DescriptionThe scheduled time in which the Cron Job will run. Will be a Cron format string.
Additional Notes

kube_daemonset_name

Prometheus IDkube_daemonset_name
Legacy IDkubernetes.daemonSet.name
OSS KSM IDdaemonset
CategoryKubernetes
DescriptionName of the DaemonSet as retrieved from the API server.
Additional Notes

kube_daemonset_uid

Prometheus IDkube_daemonset_uid
Legacy IDkubernetes.daemonSet.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the DaemonSet as retrieved from the API server.
Additional Notes

kube_deployment_name

Prometheus IDkube_deployment_name
Legacy IDkubernetes.deployment.name
OSS KSM IDdeployment
CategoryKubernetes
DescriptionName of the Deployment as retrieved from the API server.
Additional Notes

kube_deployment_uid

Prometheus IDkube_deployment_uid
Legacy IDkubernetes.deployment.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the Deployment as retrieved from the API server.
Additional Notes

kube_hpa_name

Prometheus IDkube_hpa_name
Legacy IDkubernetes.hpa.name
OSS KSM IDhpa
CategoryKubernetes
DescriptionName of the HPA as retrieved from the API server.
Additional Notes

kube_hpa_uid

Prometheus IDkube_hpa_uid
Legacy IDkubernetes.hpa.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the HPA as retrieved from the API server.
Additional Notes

kube_job_name

Prometheus IDkube_job_name
Legacy IDkubernetes.job.name
OSS KSM IDjob_name
CategoryKubernetes
DescriptionName of the Job as retrieved from the API server.
Additional Notes

kube_job_owner_is_controller

Prometheus IDkube_job_owner_is_controller
Legacy IDkubernetes.job.owner.isController
OSS KSM IDowner_is_controller
CategoryKubernetes
DescriptionDesignates whether the Job is created by a higher-level controller object
Additional Notes

kube_job_owner_kind

Prometheus IDkube_job_owner_kind
Legacy IDkubernetes.job.owner.kind
OSS KSM IDowner_kind
CategoryKubernetes
DescriptionThe workload resource type of the object that created the Job if owned by a higher-level controller object
Additional Notes

kube_job_owner_name

Prometheus IDkube_job_owner_name
Legacy IDkubernetes.job.owner.name
OSS KSM IDowner_name
CategoryKubernetes
DescriptionThe name of the object that created the Job if owned by a higher-level controller object
Additional Notes

kube_job_uid

Prometheus IDkube_job_uid
Legacy IDkubernetes.job.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the Job as retrieved from the API server.
Additional Notes

kube_namespace_name

Prometheus IDkube_namespace_name
Legacy IDkubernetes.namespace.name
OSS KSM IDnamespace
CategoryKubernetes
DescriptionName of the Namespace as retrieved from the API server.
Additional Notes

kube_namespace_uid

Prometheus IDkube_namespace_uid
Legacy IDkubernetes.namespace.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the Namespace as retrieved from the API server.
Additional Notes

kube_node_condition

Prometheus IDkube_node_condition
Legacy IDkubernetes.node.condition
OSS KSM IDcondition
CategoryKubernetes
DescriptionDescribes the status of the Node. Can be Ready, DiskPressure, OutOfDisk, MemoryPressure, or Unschedulable.
Additional Notes

kube_node_name

Prometheus IDkube_node_name
Legacy IDkubernetes.node.name
OSS KSM IDnode
CategoryKubernetes
DescriptionName of the Node as retrieved from the API server.
Additional Notes

kube_node_resource

Prometheus IDkube_node_resource
Legacy IDkubernetes.node.resource
OSS KSM IDresource
CategoryKubernetes
DescriptionIndicates the capacity or allocatable limit for the different resources of a node
Additional Notes

kube_node_status

Prometheus IDkube_node_status
Legacy IDkubernetes.node.status
OSS KSM IDstatus
CategoryKubernetes
DescriptionUsed in combination with the kube_node_condition label to indicate the boolean value of that label
Additional Notes

kube_node_uid

Prometheus IDkube_node_uid
Legacy IDkubernetes.node.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the Node as retrieved from the API server.
Additional Notes

kube_node_unit

Prometheus IDkube_node_unit
Legacy IDkubernetes.node.unit
OSS KSM IDunit
CategoryKubernetes
DescriptionUsed in combination with the kube_node_resource label to indicate the unit of that label
Additional Notes

name

Prometheus IDname
Legacy IDkubernetes.persistentvolume.claim.ref.name
OSS KSM ID-
CategoryKubernetes
DescriptionName of the Persistent Volume’s claimRef as retrieved from the API server.
Additional Notes

claim_namespace

Prometheus IDclaim_namespace
Legacy IDkubernetes.persistentvolume.claim.ref.namespace
OSS KSM ID-
CategoryKubernetes
DescriptionNamespace of the Persistent Volume’s claimRef as retrieved from the API server.
Additional Notes

kube_persistentvolume_name

Prometheus IDkube_persistentvolume_name
Legacy IDkubernetes.persistentvolume.name
OSS KSM IDpersistentvolume
CategoryKubernetes
DescriptionName of the Persistent Volume as retrieved from the API server.
Additional Notes

kube_persistentvolume_uid

Prometheus IDkube_persistentvolume_uid
Legacy IDkubernetes.persistentvolume.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the Persistent Volume as retrieved from the API server.
Additional Notes

access_mode

Prometheus IDaccess_mode
Legacy IDkubernetes.persistentvolumeclaim.accessMode
OSS KSM ID-
CategoryKubernetes
DescriptionAccess mode of the PVC as retrieved from the API server.
Additional Notes

status

Prometheus IDstatus
Legacy IDkubernetes.persistentvolumeclaim.condition.status
OSS KSM ID-
CategoryKubernetes
DescriptionUsed in combination with the type label to indicate the boolean value of that label
Additional Notes

type

Prometheus IDtype
Legacy IDkubernetes.persistentvolumeclaim.condition.type
OSS KSM ID-
CategoryKubernetes
DescriptionThe type of the condition that the PVC is in
Additional Notes

kube_persistentvolumeclaim_name

Prometheus IDkube_persistentvolumeclaim_name
Legacy IDkubernetes.persistentvolumeclaim.name
OSS KSM IDpersistentvolumeclaim
CategoryKubernetes
DescriptionName of the PVC as retrieved from the API server.
Additional Notes

phase

Prometheus IDphase
Legacy IDkubernetes.persistentvolumeclaim.phase
OSS KSM ID-
CategoryKubernetes
DescriptionThe phase that the PVC is in. Will be Available, Bound, Released, or Failed.
Additional Notes

kube_persistentvolumeclaim_uid

Prometheus IDkube_persistentvolumeclaim_uid
Legacy IDkubernetes.persistentvolumeclaim.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the PVC as retrieved from the API server.
Additional Notes

kube_pod_condition

Prometheus IDkube_pod_condition
Legacy IDkubernetes.pod.condition
OSS KSM IDcondition
CategoryKubernetes
DescriptionThe condition that the Pod is in. Will be PodScheduled, ContainersReady, Initialized, or Ready
Additional Notes

kube_pod_container_full_id

Prometheus IDkube_pod_container_full_id
Legacy IDkubernetes.pod.container.full.id
OSS KSM IDcontainer_full_id
CategoryKubernetes
DescriptionThe full UID of the container in the Pod
Additional Notes

kube_pod_container_id

Prometheus IDkube_pod_container_id
Legacy IDkubernetes.pod.container.id
OSS KSM IDcontainer_id
CategoryKubernetes
DescriptionA short ID from truncating the full UID of the container in the Pod
Additional Notes

kube_pod_container_name

Prometheus IDkube_pod_container_name
Legacy IDkubernetes.pod.container.name
OSS KSM IDcontainer
CategoryKubernetes
DescriptionThe name of the container in the Pod
Additional Notes

kube_pod_container_reason

Prometheus IDkube_pod_container_reason
Legacy IDkubernetes.pod.container.reason
OSS KSM IDreason
CategoryKubernetes
DescriptionThe reason that the container is in the state that it is in.
Additional Notes

kube_pod_internal_ip

Prometheus IDkube_pod_internal_ip
Legacy IDkubernetes.pod.internalIp
OSS KSM IDinternal_ip
CategoryKubernetes
DescriptionThe IP address associated with the Pod
Additional Notes

kube_pod_name

Prometheus IDkube_pod_name
Legacy IDkubernetes.pod.name
OSS KSM IDpod
CategoryKubernetes
DescriptionName of the Pod as retrieved from the API server.
Additional Notes

kube_pod_node

Prometheus IDkube_pod_node
Legacy IDkubernetes.pod.node
OSS KSM IDnode
CategoryKubernetes
DescriptionThe Node on which the Pod is running.
Additional Notes

kube_pod_owner_is_controller

Prometheus IDkube_pod_owner_is_controller
Legacy IDkubernetes.pod.owner.isController
OSS KSM IDowner_is_controller
CategoryKubernetes
DescriptionDesignates whether the Pod is created by a higher-level controller object
Additional Notes

kube_pod_owner_kind

Prometheus IDkube_pod_owner_kind
Legacy IDkubernetes.pod.owner.kind
OSS KSM IDowner_kind
CategoryKubernetes
DescriptionThe workload resource type of the object that created the Pod if owned by a higher-level controller object
Additional Notes

kube_pod_owner_name

Prometheus IDkube_pod_owner_name
Legacy IDkubernetes.pod.owner.name
OSS KSM IDowner_name
CategoryKubernetes
DescriptionThe name of the object that created the Pod if owned by a higher-level controller object
Additional Notes

kube_pod_persistentvolumeclaim

Prometheus IDkube_pod_persistentvolumeclaim
Legacy IDkubernetes.pod.persistentvolumeclaim
OSS KSM IDpersistentvolumeclaim
CategoryKubernetes
DescriptionThe name of the PVC associated with the Pod
Additional Notes

kube_pod_phase

Prometheus IDkube_pod_phase
Legacy IDkubernetes.pod.phase
OSS KSM IDphase
CategoryKubernetes
DescriptionThe phase that the Pod is in. Can be Pending, Running, Succeeded, Failed, or Unknown.
Additional Notes

kube_pod_pod_ip

Prometheus IDkube_pod_pod_ip
Legacy IDkubernetes.pod.pod.ip
OSS KSM IDpod_ip
CategoryKubernetes
DescriptionThe IP address associated with the Pod
Additional Notes

kube_pod_reason

Prometheus IDkube_pod_reason
Legacy IDkubernetes.pod.reason
OSS KSM IDreason
CategoryKubernetes
DescriptionThe reason the Pod is in the phase that it is in.
Additional Notes

kube_pod_resource

Prometheus IDkube_pod_resource
Legacy IDkubernetes.pod.resource
OSS KSM IDresource
CategoryKubernetes
DescriptionThe Pod’s resource limits and requests. Individual labels are created for memory limits, memory requests, CPU limits, and CPU requests
Additional Notes

kube_pod_uid

Prometheus IDkube_pod_uid
Legacy IDkubernetes.pod.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the Pod as retrieved from the API server.
Additional Notes

kube_pod_unit

Prometheus IDkube_pod_unit
Legacy IDkubernetes.pod.unit
OSS KSM IDunit
CategoryKubernetes
DescriptionUsed in combination with the kube_pod_resource label to indicate the unit of the resource limit or request
Additional Notes

kube_pod_volume

Prometheus IDkube_pod_volume
Legacy IDkubernetes.pod.volume
OSS KSM IDvolume
CategoryKubernetes
DescriptionName of the volume associated with the Pod.
Additional Notes

kube_replicaset_name

Prometheus IDkube_replicaset_name
Legacy IDkubernetes.replicaSet.name
OSS KSM IDreplicaset
CategoryKubernetes
DescriptionName of the ReplicaSet as retrieved from the API server.
Additional Notes

kube_replicaset_owner_is_controller

Prometheus IDkube_replicaset_owner_is_controller
Legacy IDkubernetes.replicaSet.owner.isController
OSS KSM IDowner_is_controller
CategoryKubernetes
DescriptionDesignates whether the ReplicaSet is created by a higher-level controller object
Additional Notes

kube_replicaset_owner_kind

Prometheus IDkube_replicaset_owner_kind
Legacy IDkubernetes.replicaSet.owner.kind
OSS KSM IDowner_kind
CategoryKubernetes
DescriptionThe workload resource type of the object that created the ReplicaSet if owned by a higher-level controller object
Additional Notes

kube_replicaset_owner_name

Prometheus IDkube_replicaset_owner_name
Legacy IDkubernetes.replicaSet.owner.name
OSS KSM IDowner_name
CategoryKubernetes
DescriptionThe name of the object that created the ReplicaSet if owned by a higher-level controller object
Additional Notes

kube_replicaset_uid

Prometheus IDkube_replicaset_uid
Legacy IDkubernetes.replicaSet.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the ReplicaSet as retrieved from the API server.
Additional Notes

kube_replicationcontroller_name

Prometheus IDkube_replicationcontroller_name
Legacy IDkubernetes.replicationController.name
OSS KSM IDreplicationcontroller
CategoryKubernetes
DescriptionName of the Replication Controller as retrieved from the API server.
Additional Notes

kube_replicationcontroller_owner_is_controller

Prometheus IDkube_replicationcontroller_owner_is_controller
Legacy IDkubernetes.replicationController.owner.isController
OSS KSM IDowner_is_controller
CategoryKubernetes
DescriptionDesignates whether the Replication Controller is created by a higher-level controller object
Additional Notes

kube_replicationcontroller_owner_kind

Prometheus IDkube_replicationcontroller_owner_kind
Legacy IDkubernetes.replicationController.owner.kind
OSS KSM IDowner_kind
CategoryKubernetes
DescriptionThe workload resource type of the object that created the Replication Controller if owned by a higher-level controller object
Additional Notes

kube_replicationcontroller_owner_name

Prometheus IDkube_replicationcontroller_owner_name
Legacy IDkubernetes.replicationController.owner.name
OSS KSM IDowner_name
CategoryKubernetes
DescriptionThe name of the object that created the Replication Controller if owned by a higher-level controller object
Additional Notes

kube_replicationcontroller_uid

Prometheus IDkube_replicationcontroller_uid
Legacy IDkubernetes.replicationController.uid
OSS KSM ID_uid
CategoryKubernetes
DescriptionUnique ID of the Replication Controller as retrieved from the API server.
Additional Notes

kube_resourcequota_name

Prometheus IDkube_resourcequota_name
Legacy IDkubernetes.resourcequota.name
OSS KSM IDresourcequota
CategoryKubernetes
DescriptionName of the Resource Quota as retrieved from the API server.
Additional Notes

kube_resourcequota_namespace

Prometheus IDkube_resourcequota_namespace
Legacy IDkubernetes.resourcequota.namespace
OSS KSM IDnamespace
CategoryKubernetes
DescriptionNamespace in which the Resource Quota is being enforced
Additional Notes

kube_resourcequota_resource

Prometheus IDkube_resourcequota_resource
Legacy IDkubernetes.resourcequota.resource
OSS KSM IDresource
CategoryKubernetes
DescriptionThe resource and the amount of it in which the Resource Quota is being enforced
Additional Notes

kube_resourcequota_resourcequota

Prometheus IDkube_resourcequota_resourcequota
Legacy IDkubernetes.resourcequota.resourcequota
OSS KSM IDresourcequota
CategoryKubernetes
DescriptionName of the Resource Quota as retrieved from the API server.
Additional Notes

kube_resourcequota_type

Prometheus IDkube_resourcequota_type
Legacy IDkubernetes.resourcequota.type
OSS KSM IDtype
CategoryKubernetes
DescriptionUsed in combination with kube_resourcequota_resource to designate whether the amount is Used or is the Hard limit
Additional Notes

kube_resourcequota_uid

Prometheus IDkube_resourcequota_uid
Legacy IDkubernetes.resourcequota.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the Resource Quota as retrieved from the API server.
Additional Notes

kube_service_cluster_ip

Prometheus IDkube_service_cluster_ip
Legacy IDkubernetes.service.clusterIp
OSS KSM IDcluster_ip
CategoryKubernetes
DescriptionThe IP address associated with the Service
Additional Notes

kube_service_name

Prometheus IDkube_service_name
Legacy IDkubernetes.service.name
OSS KSM IDservice
CategoryKubernetes
DescriptionName of the Service as retrieved from the API server.
Additional Notes

kube_service_service_ip

Prometheus IDkube_service_service_ip
Legacy IDkubernetes.service.service.ip
OSS KSM IDservice_ip
CategoryKubernetes
DescriptionThe IP address associated with the Service
Additional Notes

kube_service_uid

Prometheus IDkube_service_uid
Legacy IDkubernetes.service.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the Service as retrieved from the API server.
Additional Notes

kube_statefulset_name

Prometheus IDkube_statefulset_name
Legacy IDkubernetes.statefulSet.name
OSS KSM IDstatefulset
CategoryKubernetes
DescriptionName of the StatefulSet as retrieved from the API server.
Additional Notes

kube_statefulset_uid

Prometheus IDkube_statefulset_uid
Legacy IDkubernetes.statefulSet.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the StatefulSet as retrieved from the API server.
Additional Notes

kube_storageclass_name

Prometheus IDkube_storageclass_name
Legacy IDkubernetes.storageclass.name
OSS KSM IDstorageclass
CategoryKubernetes
DescriptionName of the Storage Class as retrieved from the API server.
Additional Notes

provisioner

Prometheus IDprovisioner
Legacy IDkubernetes.storageclass.provisioner
OSS KSM ID-
CategoryKubernetes
DescriptionThe Provisioner of the Storage Class as retrieved from the API server.
Additional Notes

reclaim_policy

Prometheus IDreclaim_policy
Legacy IDkubernetes.storageclass.reclaimPolicy
OSS KSM ID-
CategoryKubernetes
DescriptionThe reclaim policy for the Storage Class as retrieved from the API server.
Additional Notes

kube_storageclass_uid

Prometheus IDkube_storageclass_uid
Legacy IDkubernetes.storageclass.uid
OSS KSM IDuid
CategoryKubernetes
DescriptionUnique ID of the Storage Class as retrieved from the API server.
Additional Notes

volume_binding_mode

Prometheus IDvolume_binding_mode
Legacy IDkubernetes.storageclass.volumeBindingMode
OSS KSM ID-
CategoryKubernetes
DescriptionThe volume binding mode for the Storage Class as retrieved from the API server.
Additional Notes

kube_workload_name

Prometheus IDkube_workload_name
Legacy IDkubernetes.workload.name
OSS KSM IDworkload_name
CategoryKubernetes
DescriptionThe name of the Kubernetes workload resource object
Additional Notes

kube_workload_type

Prometheus IDkube_workload_type
Legacy IDkubernetes.workload.type
OSS KSM IDworkload_type
CategoryKubernetes
DescriptionThe type of the Kubernetes workload resource i.e. Deployment, DaemonSet, Job, etc.
Additional Notes

marathon_app_id

Prometheus IDmarathon_app_id
Legacy IDmarathon.app.id
OSS KSM ID-
CategoryMarathon
Description
Additional Notes

marathon_app_name

Prometheus IDmarathon_app_name
Legacy IDmarathon.app.name
OSS KSM ID-
CategoryMarathon
Description
Additional Notes

marathon_group_id

Prometheus IDmarathon_group_id
Legacy IDmarathon.group.id
OSS KSM ID-
CategoryMarathon
Description
Additional Notes

marathon_group_name

Prometheus IDmarathon_group_name
Legacy IDmarathon.group.name
OSS KSM ID-
CategoryMarathon
Description
Additional Notes

mesos_cluster_id

Prometheus IDmesos_cluster_id
Legacy IDmesos.cluster.id
OSS KSM ID-
CategoryMesos
Description
Additional Notes

mesos_cluster_name

Prometheus IDmesos_cluster_name
Legacy IDmesos.cluster.name
OSS KSM ID-
CategoryMesos
Description
Additional Notes

mesos_framework_id

Prometheus IDmesos_framework_id
Legacy IDmesos.framework.id
OSS KSM ID-
CategoryMesos
Description
Additional Notes

mesos_framework_name

Prometheus IDmesos_framework_name
Legacy IDmesos.framework.name
OSS KSM ID-
CategoryMesos
Description
Additional Notes

mesos_slave_id

Prometheus IDmesos_slave_id
Legacy IDmesos.slave.id
OSS KSM ID-
CategoryMesos
Description
Additional Notes

mesos_slave_name

Prometheus IDmesos_slave_name
Legacy IDmesos.slave.name
OSS KSM ID-
CategoryMesos
Description
Additional Notes

mesos_task_id

Prometheus IDmesos_task_id
Legacy IDmesos.task.id
OSS KSM ID-
CategoryMesos
Description
Additional Notes

mesos_task_name

Prometheus IDmesos_task_name
Legacy IDmesos.task.name
OSS KSM ID-
CategoryMesos
Description
Additional Notes

net_client_ip

Prometheus IDnet_client_ip
Legacy IDnet.client.ip
OSS KSM ID-
CategoryNetwork
DescriptionClient IP address.
Additional Notes

net_http_method

Prometheus IDnet_http_method
Legacy IDnet.http.method
OSS KSM ID-
CategoryNetwork
DescriptionHTTP request method.
Additional Notes

net_http_statuscode

Prometheus IDnet_http_statuscode
Legacy IDnet.http.statusCode
OSS KSM ID-
CategoryNetwork
DescriptionHTTP response status code.
Additional Notes

net_http_url

Prometheus IDnet_http_url
Legacy IDnet.http.url
OSS KSM ID-
CategoryNetwork
DescriptionURL from an HTTP request.
Additional Notes

net_local_endpoint

Prometheus IDnet_local_endpoint
Legacy IDnet.local.endpoint
OSS KSM ID-
CategoryNetwork
DescriptionIP address of a local node.
Additional Notes

net_local_service

Prometheus IDnet_local_service
Legacy IDnet.local.service
OSS KSM ID-
CategoryNetwork
DescriptionService (port number) of a local node.
Additional Notes

net_mongodb_collection

Prometheus IDnet_mongodb_collection
Legacy IDnet.mongodb.collection
OSS KSM ID-
CategoryNetwork
DescriptionMongoDB collection.
Additional Notes

net_mongodb_operation

Prometheus IDnet_mongodb_operation
Legacy IDnet.mongodb.operation
OSS KSM ID-
CategoryNetwork
DescriptionMongoDB operation.
Additional Notes

net_protocol

Prometheus IDnet_protocol
Legacy IDnet.protocol
OSS KSM ID-
CategoryNetwork
DescriptionThe network protocol of a request (e.g. HTTP, MySQL).
Additional Notes

net_remote_endpoint

Prometheus IDnet_remote_endpoint
Legacy IDnet.remote.endpoint
OSS KSM ID-
CategoryNetwork
DescriptionIP address of a remote node.
Additional Notes

net_remote_service

Prometheus IDnet_remote_service
Legacy IDnet.remote.service
OSS KSM ID-
CategoryNetwork
DescriptionService (port number) of a remote node.
Additional Notes

net_server_ip

Prometheus IDnet_server_ip
Legacy IDnet.server.ip
OSS KSM ID-
CategoryNetwork
DescriptionServer IP address.
Additional Notes

net_server_port

Prometheus IDnet_server_port
Legacy IDnet.server.port
OSS KSM ID-
CategoryNetwork
DescriptionTCP/UDP Server Port number.
Additional Notes

net_sql_query

Prometheus IDnet_sql_query
Legacy IDnet.sql.query
OSS KSM ID-
CategoryNetwork
DescriptionThe full SQL query.
Additional Notes

net_sql_querytype

Prometheus IDnet_sql_querytype
Legacy IDnet.sql.query.type
OSS KSM ID-
CategoryNetwork
DescriptionSQL query type (SELECT, INSERT, DELETE, etc.).
Additional Notes

net_sql_table

Prometheus IDnet_sql_table
Legacy IDnet.sql.table
OSS KSM ID-
CategoryNetwork
DescriptionSQL query table name.
Additional Notes

program_name

Prometheus IDprogram_name
Legacy IDproc.client.name
OSS KSM ID-
CategoryProgram
DescriptionName of the Client process.
Additional Notes

program_cmd_line

Prometheus IDprogram_cmd_line
Legacy IDproc.commandLine
OSS KSM ID-
CategoryProgram
DescriptionCommand line used to start the process.
Additional Notes

program_name

Prometheus IDprogram_name
Legacy IDproc.name
OSS KSM ID-
CategoryProgram
DescriptionName of the process.
Additional Notes

program_name

Prometheus IDprogram_name
Legacy IDproc.server.name
OSS KSM ID-
CategoryProgram
DescriptionName of the server process.
Additional Notes

swarm_cluster_id

Prometheus IDswarm_cluster_id
Legacy IDswarm.cluster.id
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_cluster_name

Prometheus IDswarm_cluster_name
Legacy IDswarm.cluster.name
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_manager_reachability

Prometheus IDswarm_manager_reachability
Legacy IDswarm.manager.reachability
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_node_availability

Prometheus IDswarm_node_availability
Legacy IDswarm.node.availability
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_node_id

Prometheus IDswarm_node_id
Legacy IDswarm.node.id
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_node_ip_address

Prometheus IDswarm_node_ip_address
Legacy IDswarm.node.ip_address
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_node_name

Prometheus IDswarm_node_name
Legacy IDswarm.node.name
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_node_role

Prometheus IDswarm_node_role
Legacy IDswarm.node.role
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_node_state

Prometheus IDswarm_node_state
Legacy IDswarm.node.state
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_node_version

Prometheus IDswarm_node_version
Legacy IDswarm.node.version
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_service_id

Prometheus IDswarm_service_id
Legacy IDswarm.service.id
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_service_name

Prometheus IDswarm_service_name
Legacy IDswarm.service.name
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_task_container_id

Prometheus IDswarm_task_container_id
Legacy IDswarm.task.container_id
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_task_id

Prometheus IDswarm_task_id
Legacy IDswarm.task.id
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_task_name

Prometheus IDswarm_task_name
Legacy IDswarm.task.name
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_task_node_id

Prometheus IDswarm_task_node_id
Legacy IDswarm.task.node_id
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_task_service_id

Prometheus IDswarm_task_service_id
Legacy IDswarm.task.service_id
OSS KSM ID-
CategorySwarm
Description
Additional Notes

swarm_task_state

Prometheus IDswarm_task_state
Legacy IDswarm.task.state
OSS KSM ID-
CategorySwarm
Description
Additional Notes

4.10.2.4 - File

sysdig_filestats_host_file_error_total_count

Prometheus IDsysdig_filestats_host_file_error_total_count
Legacy IDfile.error.total.count
Metric Typecounter
Unitnumber
DescriptionNumber of error caused by file access.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_filestats_host_file_in_bytes

Prometheus IDsysdig_filestats_host_file_in_bytes
Legacy IDfile.bytes.in
Metric Typecounter
Unitdata
DescriptionAmount of bytes read from file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_filestats_host_file_open_count

Prometheus IDsysdig_filestats_host_file_open_count
Legacy IDfile.open.count
Metric Typecounter
Unitnumber
DescriptionNumber of time the file has been opened.
Additional Notes

sysdig_filestats_host_file_out_bytes

Prometheus IDsysdig_filestats_host_file_out_bytes
Legacy IDfile.bytes.out
Metric Typecounter
Unitdata
DescriptionAmount of bytes written to file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_filestats_host_file_total_bytes

Prometheus IDsysdig_filestats_host_file_total_bytes
Legacy IDfile.bytes.total
Metric Typecounter
Unitdata
DescriptionAmount of bytes read from and written to file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_filestats_host_file_total_time

Prometheus IDsysdig_filestats_host_file_total_time
Legacy IDfile.time.total
Metric Typecounter
Unittime
DescriptionTime spent in file I/O.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_fs_free_bytes

Prometheus IDsysdig_fs_free_bytes
Legacy IDfs.bytes.free
Metric Typegauge
Unitdata
DescriptionFilesystem available space.
Additional Notes

sysdig_fs_free_percent

Prometheus IDsysdig_fs_free_percent
Legacy IDfs.free.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of filesystem free space.
Additional Notes

sysdig_fs_inodes_total_count

Prometheus IDsysdig_fs_inodes_total_count
Legacy IDfs.inodes.total.count
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_fs_inodes_used_count

Prometheus IDsysdig_fs_inodes_used_count
Legacy IDfs.inodes.used.count
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_fs_inodes_used_percent

Prometheus IDsysdig_fs_inodes_used_percent
Legacy IDfs.inodes.used.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_fs_total_bytes

Prometheus IDsysdig_fs_total_bytes
Legacy IDfs.bytes.total
Metric Typegauge
Unitdata
DescriptionFilesystem size.
Additional Notes

sysdig_fs_used_bytes

Prometheus IDsysdig_fs_used_bytes
Legacy IDfs.bytes.used
Metric Typegauge
Unitdata
DescriptionFilesystem used space.
Additional Notes

sysdig_fs_used_percent

Prometheus IDsysdig_fs_used_percent
Legacy IDfs.used.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of the sum of all filesystems in use.
Additional Notes

4.10.2.5 - Host

sysdig_host_container_count

Prometheus IDsysdig_host_container_count
Legacy IDcontainer.count
Metric Typegauge
Unitnumber
DescriptionCount of the number of containers.
Additional NotesThis metric is perfect for dashboards and alerts. In particular, you can create alerts that notify you when you have too many (or too few) containers of a certain type in a certain group or node - try segmenting by container.image, .id or .name. See also: host.count.

sysdig_host_container_start_count

Prometheus IDsysdig_host_container_start_count
Legacy IDhost.container.start.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_count

Prometheus IDsysdig_host_count
Legacy IDhost.count
Metric Typegauge
Unitnumber
DescriptionCount of the number of hosts.
Additional NotesThis metric is perfect for dashboards and alerts. In particular, you can create alerts that notify you when you have too many (or too few) machines of a certain type in a certain group - try segment by tag or hostname. See also: container.count.

sysdig_host_cpu_cores_used

Prometheus IDsysdig_host_cpu_cores_used
Legacy IDcpu.cores.used
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_host_cpu_cores_used_percent

Prometheus IDsysdig_host_cpu_cores_used_percent
Legacy IDcpu.cores.used.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_host_cpu_idle_percent

Prometheus IDsysdig_host_cpu_idle_percent
Legacy IDcpu.idle.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_cpu_iowait_percent

Prometheus IDsysdig_host_cpu_iowait_percent
Legacy IDcpu.iowait.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_cpu_nice_percent

Prometheus IDsysdig_host_cpu_nice_percent
Legacy IDcpu.nice.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of CPU utilization that occurred while executing at the user level with nice priority.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_cpu_stolen_percent

Prometheus IDsysdig_host_cpu_stolen_percent
Legacy IDcpu.stolen.percent
Metric Typegauge
Unitpercent
DescriptionCPU steal time is a measure of the percent of time that a virtual machine’s CPU is in a state of involuntary wait due to the fact that the physical CPU is shared among virtual machines. In calculating steal time, the operating system kernel detects when it has work available but does not have access to the physical CPU to perform that work.
Additional NotesIf the percent of steal time is consistently high, you may want to stop and restart the instance (since it will most likely start on different physical hardware) or upgrade to a virtual machine with more CPU power. Also see the metric ‘capacity total percent’ to see how steal time directly impacts the number of server requests that could not be handled. On AWS EC2, steal time does not depend on the activity of other virtual machine neighbours. EC2 is simply making sure your instance is not using more CPU cycles than paid for.

sysdig_host_cpu_system_percent

Prometheus IDsysdig_host_cpu_system_percent
Legacy IDcpu.system.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of CPU utilization that occurred while executing at the system level (kernel).
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_cpu_used_percent

Prometheus IDsysdig_host_cpu_used_percent
Legacy IDcpu.used.percent
Metric Typegauge
Unitpercent
DescriptionThe CPU usage for each container is obtained from cgroups, and normalized by dividing by the number of cores to determine an overall percentage. For example, if the environment contains six cores on a host, and the container or processes are assigned two cores, Sysdig will report CPU usage of 2/6 * 100% = 33.33%. This metric is calculated differently for hosts and processes.
Additional Notes

sysdig_host_cpu_user_percent

Prometheus IDsysdig_host_cpu_user_percent
Legacy IDcpu.user.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of CPU utilization that occurred while executing at the user level (application).
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_cpucore_idle_percent

Prometheus IDsysdig_host_cpucore_idle_percent
Legacy IDcpucore.idle.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_host_cpucore_iowait_percent

Prometheus IDsysdig_host_cpucore_iowait_percent
Legacy IDcpucore.iowait.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_host_cpucore_nice_percent

Prometheus IDsysdig_host_cpucore_nice_percent
Legacy IDcpucore.nice.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_host_cpucore_stolen_percent

Prometheus IDsysdig_host_cpucore_stolen_percent
Legacy IDcpucore.stolen.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_host_cpucore_system_percent

Prometheus IDsysdig_host_cpucore_system_percent
Legacy IDcpucore.system.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_host_cpucore_used_percent

Prometheus IDsysdig_host_cpucore_used_percent
Legacy IDcpucore.used.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_host_cpucore_user_percent

Prometheus IDsysdig_host_cpucore_user_percent
Legacy IDcpucore.user.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_host_fd_used_percent

Prometheus IDsysdig_host_fd_used_percent
Legacy IDfd.used.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of used file descriptors out of the maximum available.
Additional NotesUsually, when a process reaches its FD limit it will stop operating properly and possibly crash. As a consequence, this is a metric you want to monitor carefully, or even better use for alerts.

sysdig_host_file_error_open_count

Prometheus IDsysdig_host_file_error_open_count
Legacy IDfile.error.open.count
Metric Typecounter
Unitnumber
DescriptionNumber of errors in opening files.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_file_error_total_count

Prometheus IDsysdig_host_file_error_total_count
Legacy IDfile.error.total.count
Metric Typecounter
Unitnumber
DescriptionNumber of error caused by file access.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_file_in_bytes

Prometheus IDsysdig_host_file_in_bytes
Legacy IDfile.bytes.in
Metric Typecounter
Unitdata
DescriptionAmount of bytes read from file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_file_in_iops

Prometheus IDsysdig_host_file_in_iops
Legacy IDfile.iops.in
Metric Typecounter
Unitnumber
DescriptionNumber of file read operations per second.
Additional NotesThis is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.

sysdig_host_file_in_time

Prometheus IDsysdig_host_file_in_time
Legacy IDfile.time.in
Metric Typecounter
Unittime
DescriptionTime spent in file reading.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_file_open_count

Prometheus IDsysdig_host_file_open_count
Legacy IDfile.open.count
Metric Typecounter
Unitnumber
DescriptionNumber of time the file has been opened.
Additional Notes

sysdig_host_file_out_bytes

Prometheus IDsysdig_host_file_out_bytes
Legacy IDfile.bytes.out
Metric Typecounter
Unitdata
DescriptionAmount of bytes written to file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_file_out_iops

Prometheus IDsysdig_host_file_out_iops
Legacy IDfile.iops.out
Metric Typecounter
Unitnumber
DescriptionNumber of file write operations per second.
Additional NotesThis is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.

sysdig_host_file_out_time

Prometheus IDsysdig_host_file_out_time
Legacy IDfile.time.out
Metric Typecounter
Unittime
DescriptionTime spent in file writing.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_file_total_bytes

Prometheus IDsysdig_host_file_total_bytes
Legacy IDfile.bytes.total
Metric Typecounter
Unitdata
DescriptionAmount of bytes read from and written to file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_file_total_iops

Prometheus IDsysdig_host_file_total_iops
Legacy IDfile.iops.total
Metric Typecounter
Unitnumber
DescriptionNumber of read and write file operations per second.
Additional NotesThis is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.

sysdig_host_file_total_time

Prometheus IDsysdig_host_file_total_time
Legacy IDfile.time.total
Metric Typecounter
Unittime
DescriptionTime spent in file I/O.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_fs_free_bytes

Prometheus IDsysdig_host_fs_free_bytes
Legacy IDfs.bytes.free
Metric Typegauge
Unitdata
DescriptionFilesystem available space.
Additional Notes

sysdig_host_fs_free_percent

Prometheus IDsysdig_host_fs_free_percent
Legacy IDfs.free.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of filesystem free space.
Additional Notes

sysdig_host_fs_inodes_total_count

Prometheus IDsysdig_host_fs_inodes_total_count
Legacy IDfs.inodes.total.count
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_host_fs_inodes_used_count

Prometheus IDsysdig_host_fs_inodes_used_count
Legacy IDfs.inodes.used.count
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_host_fs_inodes_used_percent

Prometheus IDsysdig_host_fs_inodes_used_percent
Legacy IDfs.inodes.used.percent
Metric Typegauge
Unitpercent
Description
Additional Notes

sysdig_host_fs_largest_used_percent

Prometheus IDsysdig_host_fs_largest_used_percent
Legacy IDfs.largest.used.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of the largest filesystem in use.
Additional Notes

sysdig_host_fs_root_used_percent

Prometheus IDsysdig_host_fs_root_used_percent
Legacy IDfs.root.used.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of the root filesystem in use.
Additional Notes

sysdig_host_fs_total_bytes

Prometheus IDsysdig_host_fs_total_bytes
Legacy IDfs.bytes.total
Metric Typegauge
Unitdata
DescriptionFilesystem size.
Additional Notes

sysdig_host_fs_used_bytes

Prometheus IDsysdig_host_fs_used_bytes
Legacy IDfs.bytes.used
Metric Typegauge
Unitdata
DescriptionFilesystem used space.
Additional Notes

sysdig_host_fs_used_percent

Prometheus IDsysdig_host_fs_used_percent
Legacy IDfs.used.percent
Metric Typegauge
Unitpercent
DescriptionPercentage of the sum of all filesystems in use.
Additional Notes

sysdig_host_info

Prometheus IDsysdig_host_info
Legacy IDinfo
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_host_load_average_15m

Prometheus IDsysdig_host_load_average_15m
Legacy IDload.average.15m
Metric Typegauge
Unitnumber
DescriptionThe 15 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 15 minutes for all cores. The value should correspond to the third (and last) load average value displayed by ‘uptime’ command.
Additional Notes

sysdig_host_load_average_1m

Prometheus IDsysdig_host_load_average_1m
Legacy IDload.average.1m
Metric Typegauge
Unitnumber
DescriptionThe 1 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 1 minute for all cores. The value should correspond to the first (of three) load average values displayed by ‘uptime’ command.
Additional Notes

sysdig_host_load_average_5m

Prometheus IDsysdig_host_load_average_5m
Legacy IDload.average.5m
Metric Typegauge
Unitnumber
DescriptionThe 5 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 5 minutes for all cores. The value should correspond to the second (of three) load average values displayed by ‘uptime’ command.
Additional Notes

sysdig_host_load_average_percpu_15m

Prometheus IDsysdig_host_load_average_percpu_15m
Legacy IDload.average.percpu.15m
Metric Typegauge
Unitnumber
DescriptionThe 15 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 15 minutes, divided by number of system CPUs.
Additional Notes

sysdig_host_load_average_percpu_1m

Prometheus IDsysdig_host_load_average_percpu_1m
Legacy IDload.average.percpu.1m
Metric Typegauge
Unitnumber
DescriptionThe 1 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 1 minute, divided by number of system CPUs.
Additional Notes

sysdig_host_load_average_percpu_5m

Prometheus IDsysdig_host_load_average_percpu_5m
Legacy IDload.average.percpu.5m
Metric Typegauge
Unitnumber
DescriptionThe 5 minute system load average represents the average number of jobs in (1) the CPU run queue or (2) waiting for disk I/O averaged over 5 minutes, divided by number of system CPUs.
Additional Notes

sysdig_host_memory_available_bytes

Prometheus IDsysdig_host_memory_available_bytes
Legacy IDmemory.bytes.available
Metric Typegauge
Unitdata
DescriptionThe available memory for a host is obtained from /proc/meminfo. For environments using Linux kernel version 3.12 and later, the available memory is obtained using the mem.available field in /proc/meminfo. For environments using earlier kernel versions, the formula is MemFree + Cached + Buffers.
Additional Notes

sysdig_host_memory_swap_available_bytes

Prometheus IDsysdig_host_memory_swap_available_bytes
Legacy IDmemory.swap.bytes.available
Metric Typegauge
Unitdata
DescriptionAvailable amount of swap memory.
Additional NotesSum of free and cached swap memory. By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_memory_swap_total_bytes

Prometheus IDsysdig_host_memory_swap_total_bytes
Legacy IDmemory.swap.bytes.total
Metric Typegauge
Unitdata
DescriptionTotal amount of swap memory.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_memory_swap_used_bytes

Prometheus IDsysdig_host_memory_swap_used_bytes
Legacy IDmemory.swap.bytes.used
Metric Typegauge
Unitdata
DescriptionUsed amount of swap memory.
Additional NotesThe amount of used swap memory is calculated by subtracting available from total swap memory. By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_memory_swap_used_percent

Prometheus IDsysdig_host_memory_swap_used_percent
Legacy IDmemory.swap.used.percent
Metric Typegauge
Unitpercent
DescriptionUsed percent of swap memory.
Additional NotesThe percentage of used swap memory is calculated as percentual ratio of used and total swap memory. By default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_memory_total_bytes

Prometheus IDsysdig_host_memory_total_bytes
Legacy IDmemory.bytes.total
Metric Typegauge
Unitdata
DescriptionThe total memory of a host, in bytes. This value is obtained from /proc.
Additional Notes

sysdig_host_memory_used_bytes

Prometheus IDsysdig_host_memory_used_bytes
Legacy IDmemory.bytes.used
Metric Typegauge
Unitdata
DescriptionThe amount of physical memory currently in use.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_host_memory_used_percent

Prometheus IDsysdig_host_memory_used_percent
Legacy IDmemory.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of physical memory in use.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_memory_virtual_bytes

Prometheus IDsysdig_host_memory_virtual_bytes
Legacy IDmemory.bytes.virtual
Metric Typegauge
Unitdata
DescriptionThe virtual memory size of the process, in bytes. This value is obtained from Sysdig events.
Additional Notes

sysdig_host_net_connection_in_count

Prometheus IDsysdig_host_net_connection_in_count
Legacy IDnet.connection.count.in
Metric Typecounter
Unitnumber
DescriptionNumber of currently established client (inbound) connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_host_net_connection_out_count

Prometheus IDsysdig_host_net_connection_out_count
Legacy IDnet.connection.count.out
Metric Typecounter
Unitnumber
DescriptionNumber of currently established server (outbound) connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_host_net_connection_total_count

Prometheus IDsysdig_host_net_connection_total_count
Legacy IDnet.connection.count.total
Metric Typecounter
Unitnumber
DescriptionNumber of currently established connections. This value may exceed the sum of the inbound and outbound metrics since it represents client and server inter-host connections as well as internal only connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_host_net_error_count

Prometheus IDsysdig_host_net_error_count
Legacy IDnet.error.count
Metric Typecounter
Unitnumber
DescriptionNumber of network errors.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_net_http_error_count

Prometheus IDsysdig_host_net_http_error_count
Legacy IDnet.http.error.count
Metric Typecounter
Unitnumber
DescriptionNumber of failed HTTP requests as counted from 4xx/5xx status codes.
Additional Notes

sysdig_host_net_http_request_count

Prometheus IDsysdig_host_net_http_request_count
Legacy IDnet.http.request.count
Metric Typecounter
Unitnumber
DescriptionCount of HTTP requests.
Additional Notes

sysdig_host_net_http_request_time

Prometheus IDsysdig_host_net_http_request_time
Legacy IDnet.http.request.time
Metric Typecounter
Unittime
DescriptionAverage time for HTTP requests.
Additional Notes

sysdig_host_net_http_statuscode_error_count

Prometheus IDsysdig_host_net_http_statuscode_error_count
Legacy IDnet.http.statuscode.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_http_statuscode_request_count

Prometheus IDsysdig_host_net_http_statuscode_request_count
Legacy IDnet.http.statuscode.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_http_url_error_count

Prometheus IDsysdig_host_net_http_url_error_count
Legacy IDnet.http.url.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_http_url_request_count

Prometheus IDsysdig_host_net_http_url_request_count
Legacy IDnet.http.url.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_http_url_request_time

Prometheus IDsysdig_host_net_http_url_request_time
Legacy IDnet.http.url.request.time
Metric Typecounter
Unittime
Description
Additional Notes

sysdig_host_net_mongodb_collection_error_count

Prometheus IDsysdig_host_net_mongodb_collection_error_count
Legacy IDnet.mongodb.collection.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_mongodb_collection_request_count

Prometheus IDsysdig_host_net_mongodb_collection_request_count
Legacy IDnet.mongodb.collection.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_mongodb_collection_request_time

Prometheus IDsysdig_host_net_mongodb_collection_request_time
Legacy IDnet.mongodb.collection.request.time
Metric Typecounter
Unittime
Description
Additional Notes

sysdig_host_net_mongodb_error_count

Prometheus IDsysdig_host_net_mongodb_error_count
Legacy IDnet.mongodb.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_mongodb_operation_error_count

Prometheus IDsysdig_host_net_mongodb_operation_error_count
Legacy IDnet.mongodb.operation.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_mongodb_operation_request_count

Prometheus IDsysdig_host_net_mongodb_operation_request_count
Legacy IDnet.mongodb.operation.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_mongodb_operation_request_time

Prometheus IDsysdig_host_net_mongodb_operation_request_time
Legacy IDnet.mongodb.operation.request.time
Metric Typecounter
Unittime
Description
Additional Notes

sysdig_host_net_mongodb_request_count

Prometheus IDsysdig_host_net_mongodb_request_count
Legacy IDnet.mongodb.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_mongodb_request_time

Prometheus IDsysdig_host_net_mongodb_request_time
Legacy IDnet.mongodb.request.time
Metric Typecounter
Unittime
Description
Additional Notes

sysdig_host_net_in_bytes

Prometheus IDsysdig_host_net_in_bytes
Legacy IDnet.bytes.in
Metric Typecounter
Unitdata
DescriptionInbound network bytes.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_net_out_bytes

Prometheus IDsysdig_host_net_out_bytes
Legacy IDnet.bytes.out
Metric Typecounter
Unitdata
DescriptionOutbound network bytes.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_net_request_count

Prometheus IDsysdig_host_net_request_count
Legacy IDnet.request.count
Metric Typecounter
Unitnumber
DescriptionTotal number of network requests. Note, this value may exceed the sum of inbound and outbound requests, because this count includes requests over internal connections.
Additional Notes

sysdig_host_net_request_in_count

Prometheus IDsysdig_host_net_request_in_count
Legacy IDnet.request.count.in
Metric Typecounter
Unitnumber
DescriptionNumber of inbound network requests.
Additional Notes

sysdig_host_net_request_in_time

Prometheus IDsysdig_host_net_request_in_time
Legacy IDnet.request.time.in
Metric Typecounter
Unittime
DescriptionAverage time to serve an inbound request.
Additional Notes

sysdig_host_net_request_out_count

Prometheus IDsysdig_host_net_request_out_count
Legacy IDnet.request.count.out
Metric Typecounter
Unitnumber
DescriptionNumber of outbound network requests.
Additional Notes

sysdig_host_net_request_out_time

Prometheus IDsysdig_host_net_request_out_time
Legacy IDnet.request.time.out
Metric Typecounter
Unittime
DescriptionAverage time spent waiting for an outbound request.
Additional Notes

sysdig_host_net_request_time

Prometheus IDsysdig_host_net_request_time
Legacy IDnet.request.time
Metric Typecounter
Unittime
DescriptionAverage time to serve a network request.
Additional Notes

sysdig_host_net_server_connection_in_count

Prometheus IDsysdig_host_net_server_connection_in_count
Legacy IDnet.server.connection.count.in
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_server_in_bytes

Prometheus IDsysdig_host_net_server_in_bytes
Legacy IDnet.server.bytes.in
Metric Typecounter
Unitdata
Description
Additional Notes

sysdig_host_net_server_out_bytes

Prometheus IDsysdig_host_net_server_out_bytes
Legacy IDnet.server.bytes.out
Metric Typecounter
Unitdata
Description
Additional Notes

sysdig_host_net_server_total_bytes

Prometheus IDsysdig_host_net_server_total_bytes
Legacy IDnet.server.bytes.total
Metric Typecounter
Unitdata
Description
Additional Notes

sysdig_host_net_sql_error_count

Prometheus IDsysdig_host_net_sql_error_count
Legacy IDnet.sql.error.count
Metric Typecounter
Unitnumber
DescriptionNumber of Failed SQL requests.
Additional Notes

sysdig_host_net_sql_query_error_count

Prometheus IDsysdig_host_net_sql_query_error_count
Legacy IDnet.sql.query.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_sql_query_request_count

Prometheus IDsysdig_host_net_sql_query_request_count
Legacy IDnet.sql.query.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_sql_query_request_time

Prometheus IDsysdig_host_net_sql_query_request_time
Legacy IDnet.sql.query.request.time
Metric Typecounter
Unittime
Description
Additional Notes

sysdig_host_net_sql_querytype_error_count

Prometheus IDsysdig_host_net_sql_querytype_error_count
Legacy IDnet.sql.querytype.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_sql_querytype_request_count

Prometheus IDsysdig_host_net_sql_querytype_request_count
Legacy IDnet.sql.querytype.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_sql_querytype_request_time

Prometheus IDsysdig_host_net_sql_querytype_request_time
Legacy IDnet.sql.querytype.request.time
Metric Typecounter
Unittime
Description
Additional Notes

sysdig_host_net_sql_request_count

Prometheus IDsysdig_host_net_sql_request_count
Legacy IDnet.sql.request.count
Metric Typecounter
Unitnumber
DescriptionNumber of SQL requests.
Additional Notes

sysdig_host_net_sql_request_time

Prometheus IDsysdig_host_net_sql_request_time
Legacy IDnet.sql.request.time
Metric Typecounter
Unittime
DescriptionAverage time to complete a SQL request.
Additional Notes

sysdig_host_net_sql_table_error_count

Prometheus IDsysdig_host_net_sql_table_error_count
Legacy IDnet.sql.table.error.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_sql_table_request_count

Prometheus IDsysdig_host_net_sql_table_request_count
Legacy IDnet.sql.table.request.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_net_sql_table_request_time

Prometheus IDsysdig_host_net_sql_table_request_time
Legacy IDnet.sql.table.request.time
Metric Typecounter
Unittime
Description
Additional Notes

sysdig_host_net_tcp_queue_len

Prometheus IDsysdig_host_net_tcp_queue_len
Legacy IDnet.tcp.queue.len
Metric Typecounter
Unitnumber
DescriptionLength of the TCP request queue.
Additional Notes

sysdig_host_net_total_bytes

Prometheus IDsysdig_host_net_total_bytes
Legacy IDnet.bytes.total
Metric Typecounter
Unitdata
DescriptionTotal network bytes, inbound and outbound.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_proc_count

Prometheus IDsysdig_host_proc_count
Legacy IDproc.count
Metric Typecounter
Unitnumber
DescriptionNumber of processes on host or container.
Additional Notes

sysdig_host_syscall_count

Prometheus IDsysdig_host_syscall_count
Legacy IDsyscall.count
Metric Typegauge
Unitnumber
DescriptionTotal number of syscalls seen
Additional NotesSyscalls are resource intensive. This metric tracks how many have been made by a given process or container

sysdig_host_syscall_error_count

Prometheus IDsysdig_host_syscall_error_count
Legacy IDhost.error.count
Metric Typecounter
Unitnumber
DescriptionNumber of system call errors.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_host_system_uptime

Prometheus IDsysdig_host_system_uptime
Legacy IDsystem.uptime
Metric Typegauge
Unittime
DescriptionThis metric is sent by the agent and represent the amount of seconds since host boot time. It is not available with container granularity.
Additional Notes

sysdig_host_thread_count

Prometheus IDsysdig_host_thread_count
Legacy IDthread.count
Metric Typecounter
Unitnumber
Description
Additional Notes

sysdig_host_timeseries_count_appcheck

Prometheus IDsysdig_host_timeseries_count_appcheck
Legacy IDmetricCount.appCheck
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_host_timeseries_count_jmx

Prometheus IDsysdig_host_timeseries_count_jmx
Legacy IDmetricCount.jmx
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_host_timeseries_count_prometheus

Prometheus IDsysdig_host_timeseries_count_prometheus
Legacy IDmetricCount.prometheus
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_host_timeseries_count_statsd

Prometheus IDsysdig_host_timeseries_count_statsd
Legacy IDmetricCount.statsd
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_host_up

Prometheus IDsysdig_host_up
Legacy IDuptime
Metric Typegauge
Unitnumber
DescriptionThe percentage of time the selected entity was down during the visualized time sample. This can be used to determine if a machine (or a group of machines) went down.
Additional Notes

4.10.2.6 - JMX

jmx_jvm_class_loaded

Prometheus IDjmx_jvm_class_loaded
Legacy IDjvm.class.loaded
Metric Typegauge
Unitnumber
DescriptionThe number of classes that are currently loaded in the JVM.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_class_unloaded

Prometheus IDjmx_jvm_class_unloaded
Legacy IDjvm.class.unloaded
Metric Typegauge
Unitnumber
Description
Additional Notes

jmx_jvm_gc_ConcurrentMarkSweep_count

Prometheus IDjmx_jvm_gc_ConcurrentMarkSweep_count
Legacy IDjvm.gc.ConcurrentMarkSweep.count
Metric Typecounter
Unitnumber
DescriptionThe number of times the Concurrent Mark-Sweep garbage collector has run.
Additional Notes

jmx_jvm_gc_ConcurrentMarkSweep_time

Prometheus IDjmx_jvm_gc_ConcurrentMarkSweep_time
Legacy IDjvm.gc.ConcurrentMarkSweep.time
Metric Typecounter
Unittime
DescriptionThe amount of time the Concurrent Mark-Sweep garbage collector has run.
Additional Notes

jmx_jvm_gc_Copy_count

Prometheus IDjmx_jvm_gc_Copy_count
Legacy IDjvm.gc.Copy.count
Metric Typecounter
Unitnumber
Description
Additional Notes

jmx_jvm_gc_Copy_time

Prometheus IDjmx_jvm_gc_Copy_time
Legacy IDjvm.gc.Copy.time
Metric Typecounter
Unittime
Description
Additional Notes

jmx_jvm_gc_G1_Old_Generation_count

Prometheus IDjmx_jvm_gc_G1_Old_Generation_count
Legacy IDjvm.gc.G1_Old_Generation.count
Metric Typecounter
Unitnumber
Description
Additional Notes

jmx_jvm_gc_G1_Old_Generation_time

Prometheus IDjmx_jvm_gc_G1_Old_Generation_time
Legacy IDjvm.gc.G1_Old_Generation.time
Metric Typecounter
Unittime
Description
Additional Notes

jmx_jvm_gc_G1_Young_Generation_count

Prometheus IDjmx_jvm_gc_G1_Young_Generation_count
Legacy IDjvm.gc.G1_Young_Generation.count
Metric Typecounter
Unitnumber
Description
Additional Notes

jmx_jvm_gc_G1_Young_Generation_time

Prometheus IDjmx_jvm_gc_G1_Young_Generation_time
Legacy IDjvm.gc.G1_Young_Generation.time
Metric Typecounter
Unittime
Description
Additional Notes

jmx_jvm_gc_MarkSweepCompact_count

Prometheus IDjmx_jvm_gc_MarkSweepCompact_count
Legacy IDjvm.gc.MarkSweepCompact.count
Metric Typecounter
Unitnumber
Description
Additional Notes

jmx_jvm_gc_MarkSweepCompact_time

Prometheus IDjmx_jvm_gc_MarkSweepCompact_time
Legacy IDjvm.gc.MarkSweepCompact.time
Metric Typecounter
Unittime
Description
Additional Notes

jmx_jvm_gc_PS_MarkSweep_count

Prometheus IDjmx_jvm_gc_PS_MarkSweep_count
Legacy IDjvm.gc.PS_MarkSweep.count
Metric Typecounter
Unitnumber
DescriptionThe number of times the parallel scavenge Mark-Sweep old generation garbage collector has run.
Additional Notes

jmx_jvm_gc_PS_MarkSweep_time

Prometheus IDjmx_jvm_gc_PS_MarkSweep_time
Legacy IDjvm.gc.PS_MarkSweep.time
Metric Typecounter
Unittime
DescriptionThe amount of time the parallel scavenge Mark-Sweep old generation garbage collector has run.
Additional Notes

jmx_jvm_gc_PS_Scavenge_count

Prometheus IDjmx_jvm_gc_PS_Scavenge_count
Legacy IDjvm.gc.PS_Scavenge.count
Metric Typecounter
Unitnumber
DescriptionThe number of times the parallel eden/survivor space garbage collector has run.
Additional Notes

jmx_jvm_gc_PS_Scavenge_time

Prometheus IDjmx_jvm_gc_PS_Scavenge_time
Legacy IDjvm.gc.PS_Scavenge.time
Metric Typecounter
Unittime
DescriptionThe amount of time the parallel eden/survivor space garbage collector has run.
Additional Notes

jmx_jvm_gc_ParNew_count

Prometheus IDjmx_jvm_gc_ParNew_count
Legacy IDjvm.gc.ParNew.count
Metric Typecounter
Unitnumber
DescriptionThe number of times the parallel garbage collector has run.
Additional Notes

jmx_jvm_gc_ParNew_time

Prometheus IDjmx_jvm_gc_ParNew_time
Legacy IDjvm.gc.ParNew.time
Metric Typecounter
Unittime
DescriptionThe amount of time the parallel garbage collector has run.
Additional Notes

jmx_jvm_heap_committed

Prometheus IDjmx_jvm_heap_committed
Legacy IDjvm.heap.committed
Metric Typecounter
Unitnumber
DescriptionThe amount of memory that is currently allocated to the JVM for heap memory. Heap memory is the storage area for Java objects. The JVM may release memory to the system and Heap Committed could decrease below Heap Init; but Heap Committed can never increase above Heap Max.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_heap_init

Prometheus IDjmx_jvm_heap_init
Legacy IDjvm.heap.init
Metric Typecounter
Unitnumber
DescriptionThe initial amount of memory that the JVM requests from the operating system for heap memory during startup (defined by the –Xms option). The JVM may request additional memory from the operating system and may also release memory to the system over time. The value of Heap Init may be undefined.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_heap_max

Prometheus IDjmx_jvm_heap_max
Legacy IDjvm.heap.max
Metric Typecounter
Unitnumber
DescriptionThe maximum size allocation of heap memory for the JVM (defined by the –Xmx option). Any memory allocation attempt that would exceed this limit will cause an OutOfMemoryError exception to be thrown.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_heap_used

Prometheus IDjmx_jvm_heap_used
Legacy IDjvm.heap.used
Metric Typecounter
Unitnumber
DescriptionThe amount of allocated heap memory (ie Heap Committed) currently in use. Heap memory is the storage area for Java objects. An object in the heap that is referenced by another object is ’live’, and will remain in the heap as long as it continues to be referenced. Objects that are no longer referenced are garbage and will be cleared out of the heap to reclaim space.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_heap_used_percent

Prometheus IDjmx_jvm_heap_used_percent
Legacy IDjvm.heap.used.percent
Metric Typegauge
Unitpercent
DescriptionThe ratio between Heap Used and Heap Committed.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_nonHeap_committed

Prometheus IDjmx_jvm_nonHeap_committed
Legacy IDjvm.nonHeap.committed
Metric Typecounter
Unitnumber
DescriptionThe amount of memory that is currently allocated to the JVM for non-heap memory. Non-heap memory is used by Java to store loaded classes and other meta-data. The JVM may release memory to the system and Non-Heap Committed could decrease below Non-Heap Init; but Non-Heap Committed can never increase above Non-Heap Max.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_nonHeap_init

Prometheus IDjmx_jvm_nonHeap_init
Legacy IDjvm.nonHeap.init
Metric Typecounter
Unitnumber
DescriptionThe initial amount of memory that the JVM requests from the operating system for non-heap memory during startup. The JVM may request additional memory from the operating system and may also release memory to the system over time. The value of Non-Heap Init may be undefined.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_nonHeap_max

Prometheus IDjmx_jvm_nonHeap_max
Legacy IDjvm.nonHeap.max
Metric Typecounter
Unitnumber
DescriptionThe maximum size allocation of non-heap memory for the JVM. This memory is used by Java to store loaded classes and other meta-data.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_nonHeap_used

Prometheus IDjmx_jvm_nonHeap_used
Legacy IDjvm.nonHeap.used
Metric Typecounter
Unitnumber
DescriptionThe amount of allocated non-heap memory (ie Non-Heap Committed) currently in use. Non-heap memory is used by Java to store loaded classes and other meta-data.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_nonHeap_used_percent

Prometheus IDjmx_jvm_nonHeap_used_percent
Legacy IDjvm.nonHeap.used.percent
Metric Typegauge
Unitpercent
DescriptionThe ratio between Non-Heap Used and Non-Heap Committed.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_thread_count

Prometheus IDjmx_jvm_thread_count
Legacy IDjvm.thread.count
Metric Typegauge
Unitnumber
DescriptionThe current number of live daemon and non-daemon threads.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

jmx_jvm_thread_daemon

Prometheus IDjmx_jvm_thread_daemon
Legacy IDjvm.thread.daemon
Metric Typegauge
Unitnumber
DescriptionThe current number of live daemon threads. Daemon threads are used for background supporting tasks and are only needed while normal threads are executing.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

4.10.2.7 - Kubernetes

kube_daemonset_labels

Prometheus IDkube_daemonset_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_daemonset_status_current_number_scheduled

Prometheus IDkube_daemonset_status_current_number_scheduled
Legacy IDkubernetes.daemonSet.pods.scheduled
Metric Typegauge
Unitnumber
DescriptionThe number of nodes that running at least one daemon and are supposed to.
Additional Notes

kube_daemonset_status_desired_number_scheduled

Prometheus IDkube_daemonset_status_desired_number_scheduled
Legacy IDkubernetes.daemonSet.pods.desired
Metric Typegauge
Unitnumber
DescriptionThe number of nodes that should be running the daemon Pod.
Additional Notes

kube_daemonset_status_number_misscheduled

Prometheus IDkube_daemonset_status_number_misscheduled
Legacy IDkubernetes.daemonSet.pods.misscheduled
Metric Typegauge
Unitnumber
DescriptionThe number of nodes running a daemon Pod that are not supposed to.
Additional Notes

kube_daemonset_status_number_ready

Prometheus IDkube_daemonset_status_number_ready
Legacy IDkubernetes.daemonSet.pods.ready
Metric Typegauge
Unitnumber
DescriptionThe number of nodes that should be running the daemon Pod and have one or more of the daemon Pod running and ready.
Additional Notes

kube_deployment_labels

Prometheus IDkube_deployment_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_deployment_spec_paused

Prometheus IDkube_deployment_spec_paused
Legacy IDkubernetes.deployment.replicas.paused
Metric Typegauge
Unitnumber
DescriptionThe number of paused Pods per deployment. These Pods will not be processed by the deployment controller.
Additional Notes

kube_deployment_spec_replicas

Prometheus IDkube_deployment_spec_replicas
Legacy IDkubernetes.deployment.replicas.desired
Metric Typegauge
Unitnumber
DescriptionThe number of desired Pods per deployment.
Additional Notes

kube_deployment_status_replicas

Prometheus IDkube_deployment_status_replicas
Legacy IDkubernetes.deployment.replicas.running
Metric Typegauge
Unitnumber
DescriptionThe number of running Pods per deployment.
Additional Notes

kube_deployment_status_replicas_available

Prometheus IDkube_deployment_status_replicas_available
Legacy IDkubernetes.deployment.replicas.available
Metric Typegauge
Unitnumber
DescriptionThe number of available Pods per deployment.
Additional Notes

kube_deployment_status_replicas_unavailable

Prometheus IDkube_deployment_status_replicas_unavailable
Legacy IDkubernetes.deployment.replicas.unavailable
Metric Typegauge
Unitnumber
DescriptionThe number of unavailable Pods per deployment.
Additional Notes

kube_deployment_status_replicas_updated

Prometheus IDkube_deployment_status_replicas_updated
Legacy IDkubernetes.deployment.replicas.updated
Metric Typegauge
Unitnumber
DescriptionThe number of updated Pods per deployment.
Additional Notes

kube_hpa_labels

Prometheus IDkube_hpa_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_hpa_spec_max_replicas

Prometheus IDkube_hpa_spec_max_replicas
Legacy IDkubernetes.hpa.replicas.max
Metric Typegauge
Unitnumber
DescriptionUpper limit for the number of Pods that can be set by the autoscaler.
Additional Notes

kube_hpa_spec_min_replicas

Prometheus IDkube_hpa_spec_min_replicas
Legacy IDkubernetes.hpa.replicas.min
Metric Typegauge
Unitnumber
DescriptionLower limit for the number of Pods that can be set by the autoscaler.
Additional Notes

kube_hpa_status_current_replicas

Prometheus IDkube_hpa_status_current_replicas
Legacy IDkubernetes.hpa.replicas.current
Metric Typegauge
Unitnumber
DescriptionCurrent number of replicas of Pods managed by this autoscaler.
Additional Notes

kube_hpa_status_desired_replicas

Prometheus IDkube_hpa_status_desired_replicas
Legacy IDkubernetes.hpa.replicas.desired
Metric Typegauge
Unitnumber
DescriptionDesired number of replicas of Pods managed by this autoscaler.
Additional Notes

kube_job_complete

Prometheus IDkube_job_complete
Legacy IDkubernetes.job.numSucceeded
Metric Typegauge
Unitnumber
DescriptionThe number of Pods which reached Phase Succeeded.
Additional Notes

kube_job_failed

Prometheus IDkube_job_failed
Legacy IDkubernetes.job.numFailed
Metric Typegauge
Unitnumber
DescriptionThe number of Pods which reached Phase Failed.
Additional Notes

kube_job_info

Prometheus IDkube_job_info
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe labels on the metric store information about the object.
Additional NotesThe value of the metric will always be 1.

kube_job_labels

Prometheus IDkube_job_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_job_owner

Prometheus IDkube_job_owner
Legacy ID
Metric Typegauge
Unitnumber
DescriptionInformation about the owner of the job is stored as labels on the metric.
Additional NotesThe value of the metric will always be 1.

kube_job_spec_completions

Prometheus IDkube_job_spec_completions
Legacy IDkubernetes.job.completions
Metric Typegauge
Unitnumber
DescriptionThe desired number of successfully finished Pods that the job should be run with.
Additional Notes

kube_job_spec_parallelism

Prometheus IDkube_job_spec_parallelism
Legacy IDkubernetes.job.parallelism
Metric Typegauge
Unitnumber
DescriptionThe maximum desired number of Pods that the job should run at any given time.
Additional Notes

kube_job_status_active

Prometheus IDkube_job_status_active
Legacy IDkubernetes.job.status.active
Metric Typegauge
Unitnumber
DescriptionThe number of actively running Pods.
Additional Notes

kube_namespace_labels

Prometheus IDkube_namespace_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_namespace_sysdig_count

Prometheus IDkube_namespace_sysdig_count
Legacy IDkubernetes.namespace.count
Metric Typegauge
Unitnumber
DescriptionThe number of namespaces.
Additional Notes

kube_namespace_sysdig_deployment_count

Prometheus IDkube_namespace_sysdig_deployment_count
Legacy IDkubernetes.namespace.deployment.count
Metric Typegauge
Unitnumber
DescriptionThe number of deployments per namespace.
Additional Notes

kube_namespace_sysdig_hpa_count

Prometheus IDkube_namespace_sysdig_hpa_count
Legacy IDkubernetes.namespace.hpa.count
Metric Typegauge
Unitnumber
DescriptionThe number of HPA per namespace.
Additional Notes

kube_namespace_sysdig_job_count

Prometheus IDkube_namespace_sysdig_job_count
Legacy IDkubernetes.namespace.job.count
Metric Typegauge
Unitnumber
DescriptionThe number of jobs per namespace.
Additional Notes

kube_namespace_sysdig_persistentvolumeclaim_count

Prometheus IDkube_namespace_sysdig_persistentvolumeclaim_count
Legacy IDkubernetes.namespace.persistentvolumeclaim.count
Metric Typegauge
Unitnumber
DescriptionThe number of persistentvolumeclaim per namespace.
Additional Notes

kube_namespace_sysdig_pod_available_count

Prometheus IDkube_namespace_sysdig_pod_available_count
Legacy IDkubernetes.namespace.pod.available.count
Metric Typegauge
Unitnumber
DescriptionThe number of available Pods per namespace.
Additional Notes

kube_namespace_sysdig_pod_desired_count

Prometheus IDkube_namespace_sysdig_pod_desired_count
Legacy IDkubernetes.namespace.pod.desired.count
Metric Typegauge
Unitnumber
DescriptionThe number of desired Pods per namespace.
Additional Notes

kube_namespace_sysdig_pod_running_count

Prometheus IDkube_namespace_sysdig_pod_running_count
Legacy IDkubernetes.namespace.pod.running.count
Metric Typegauge
Unitnumber
DescriptionThe number of Pods running per namespace.
Additional Notes

kube_namespace_sysdig_replicaset_count

Prometheus IDkube_namespace_sysdig_replicaset_count
Legacy IDkubernetes.namespace.replicaSet.count
Metric Typegauge
Unitnumber
DescriptionThe number of replicaSets per namespace.
Additional Notes

kube_namespace_sysdig_resourcequota_count

Prometheus IDkube_namespace_sysdig_resourcequota_count
Legacy IDkubernetes.namespace.resourcequota.count
Metric Typegauge
Unitnumber
DescriptionThe number of resource quota per namespace.
Additional Notes

kube_namespace_sysdig_service_count

Prometheus IDkube_namespace_sysdig_service_count
Legacy IDkubernetes.namespace.service.count
Metric Typegauge
Unitnumber
DescriptionThe number of services per namespace.
Additional Notes

kube_namespace_sysdig_statefulset_count

Prometheus IDkube_namespace_sysdig_statefulset_count
Legacy IDkubernetes.namespace.statefulSet.count
Metric Typegauge
Unitnumber
DescriptionThe number of statefulset per namespace.
Additional Notes

kube_node_info

Prometheus IDkube_node_info
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe labels on the metric store information about the object.
Additional NotesThe value of the metric will always be 1.

kube_node_labels

Prometheus IDkube_node_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_node_spec_unschedulable

Prometheus IDkube_node_spec_unschedulable
Legacy IDkubernetes.node.unschedulable
Metric Typegauge
Unitnumber
DescriptionThe number of nodes unavailable to schedule new Pods.
Additional Notes

kube_node_status_allocatable

Prometheus IDkube_node_status_allocatable
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe amount of a resource on a node that is freely available.
Additional NotesThe type and unit of the resource are stored as labels on the metric.

kube_node_status_allocatable_cpu_cores

Prometheus IDkube_node_status_allocatable_cpu_cores
Legacy IDkubernetes.node.allocatable.cpuCores
Metric Typegauge
Unitnumber
DescriptionThe CPU resources of a node that are available for scheduling.
Additional Notes

kube_node_status_allocatable_memory_bytes

Prometheus IDkube_node_status_allocatable_memory_bytes
Legacy IDkubernetes.node.allocatable.memBytes
Metric Typegauge
Unitdata
DescriptionThe memory resources of a node that are available for scheduling.
Additional Notes

kube_node_status_allocatable_pods

Prometheus IDkube_node_status_allocatable_pods
Legacy IDkubernetes.node.allocatable.pods
Metric Typegauge
Unitnumber
DescriptionThe Pod resources of a node that are available for scheduling.
Additional Notes

kube_node_status_capacity

Prometheus IDkube_node_status_capacity
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe total amount of a resource on a node.
Additional NotesThe type and unit of the resource are stored as labels on the metric.

kube_node_status_capacity_cpu_cores

Prometheus IDkube_node_status_capacity_cpu_cores
Legacy IDkubernetes.node.capacity.cpuCores
Metric Typegauge
Unitnumber
DescriptionThe maximum CPU resources of the node.
Additional Notes

kube_node_status_capacity_memory_bytes

Prometheus IDkube_node_status_capacity_memory_bytes
Legacy IDkubernetes.node.capacity.memBytes
Metric Typegauge
Unitdata
DescriptionThe maximum memory resources of the node.
Additional Notes

kube_node_status_capacity_pods

Prometheus IDkube_node_status_capacity_pods
Legacy IDkubernetes.node.capacity.pods
Metric Typegauge
Unitnumber
DescriptionThe maximum number of Pods of the node.
Additional Notes

kube_node_status_condition

Prometheus IDkube_node_status_condition
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the condition of the node as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_node_sysdig_disk_pressure

Prometheus IDkube_node_sysdig_disk_pressure
Legacy IDkubernetes.node.diskPressure
Metric Typegauge
Unitnumber
DescriptionThe number of nodes with disk pressure.
Additional Notes

kube_node_sysdig_host

Prometheus IDkube_node_sysdig_host
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the hostname of the node as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_node_sysdig_memory_pressure

Prometheus IDkube_node_sysdig_memory_pressure
Legacy IDkubernetes.node.memoryPressure
Metric Typegauge
Unitnumber
DescriptionThe number of nodes with memory pressure.
Additional Notes

kube_node_sysdig_network_unavailable

Prometheus IDkube_node_sysdig_network_unavailable
Legacy IDkubernetes.node.networkUnavailable
Metric Typegauge
Unitnumber
DescriptionThe number of nodes with network unavailable.
Additional Notes

kube_node_sysdig_ready

Prometheus IDkube_node_sysdig_ready
Legacy IDkubernetes.node.ready
Metric Typegauge
Unitnumber
DescriptionThe number of nodes that are ready.
Additional Notes

kube_persistentvolume_capacity_bytes

Prometheus IDkube_persistentvolume_capacity_bytes
Legacy IDkubernetes.persistentvolume.storage
Metric Typegauge
Unitnumber
DescriptionThe persistent volume’s capacity.
Additional Notes

kube_persistentvolume_claim_ref

Prometheus IDkube_persistentvolume_claim_ref
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the claim’s name and namespace as labels on the metric.
Additional NotesThe value of the metric will always be 1.

kube_persistentvolume_info

Prometheus IDkube_persistentvolume_info
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe labels on the metric store information about the object.
Additional NotesThe value of the metric will always be 1.

kube_persistentvolume_labels

Prometheus IDkube_persistentvolume_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_persistentvolume_status_phase

Prometheus IDkube_persistentvolume_status_phase
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the phase of the PV as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_persistentvolumeclaim_access_mode

Prometheus IDkube_persistentvolumeclaim_access_mode
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the access mode of the PVC as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_persistentvolumeclaim_info

Prometheus IDkube_persistentvolumeclaim_info
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe labels on the metric store information about the object.
Additional NotesThe value of the metric will always be 1.

kube_persistentvolumeclaim_labels

Prometheus IDkube_persistentvolumeclaim_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_persistentvolumeclaim_resource_requests_storage_bytes

Prometheus IDkube_persistentvolumeclaim_resource_requests_storage_bytes
Legacy IDkubernetes.persistentvolumeclaim.requests.storage
Metric Typegauge
Unitnumber
DescriptionThe amount of bytes that the PVC has requested.
Additional Notes

kube_persistentvolumeclaim_status_phase

Prometheus IDkube_persistentvolumeclaim_status_phase
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the phase of the PVC as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_persistentvolumeclaim_sysdig_storage

Prometheus IDkube_persistentvolumeclaim_sysdig_storage
Legacy IDkubernetes.persistentvolumeclaim.storage
Metric Typegauge
Unitnumber
DescriptionThe actual resources of the underlying volume.
Additional Notes

kube_pod_container_info

Prometheus IDkube_pod_container_info
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe labels on the metric store information about the object.
Additional NotesThe value of the metric will always be 1.

kube_pod_container_resource_limits

Prometheus IDkube_pod_container_resource_limits
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe amount of the resource limit for a container in a pod.
Additional Notes

kube_pod_container_resource_requests

Prometheus IDkube_pod_container_resource_requests
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe amount of the resource request for a container in a pod.
Additional Notes

kube_pod_container_status_last_terminated_reason

Prometheus IDkube_pod_container_status_last_terminated_reason
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the reason for the last terminated state as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_pod_container_status_ready

Prometheus IDkube_pod_container_status_ready
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe number of containers in the Pod in the ready state.
Additional Notes

kube_pod_container_status_restarts_total

Prometheus IDkube_pod_container_status_restarts_total
Legacy ID
Metric Typecounter
Unitnumber
DescriptionThe number of times that containers in the Pod have restarted.
Additional Notes

kube_pod_container_status_running

Prometheus IDkube_pod_container_status_running
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe number of containers in the Pod in the running state.
Additional Notes

kube_pod_container_status_terminated

Prometheus IDkube_pod_container_status_terminated
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe number of containers in the Pod in the terminated state.
Additional Notes

kube_pod_container_status_terminated_reason

Prometheus IDkube_pod_container_status_terminated_reason
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the reason that the container is in the terminated state as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_pod_container_status_waiting

Prometheus IDkube_pod_container_status_waiting
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe number of containers in the Pod in the waiting state.
Additional Notes

kube_pod_container_status_waiting_reason

Prometheus IDkube_pod_container_status_waiting_reason
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the reason that the container is in the waiting state as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_pod_info

Prometheus IDkube_pod_info
Legacy IDkubernetes.pod.info
Metric Typegauge
Unitnumber
DescriptionThe labels on the metric store information about the object.
Additional NotesThe value of the metric will always be 1.

kube_pod_init_container_resource_limits

Prometheus IDkube_pod_init_container_resource_limits
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe amount of the resource limit for an init container in a pod.
Additional Notes

kube_pod_init_container_resource_requests

Prometheus IDkube_pod_init_container_resource_requests
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe amount of the resource request for an init container in a pod.
Additional Notes

kube_pod_init_container_status_last_terminated_reason

Prometheus IDkube_pod_init_container_status_last_terminated_reason
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the reason for the last terminated state as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_pod_init_container_status_ready

Prometheus IDkube_pod_init_container_status_ready
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe number of init containers in the Pod in the ready state.
Additional Notes

kube_pod_init_container_status_restarts_total

Prometheus IDkube_pod_init_container_status_restarts_total
Legacy ID
Metric Typecounter
Unitnumber
DescriptionThe number of times that init containers in the Pod have restarted.
Additional Notes

kube_pod_init_container_status_running

Prometheus IDkube_pod_init_container_status_running
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe number of init containers in the Pod in the running state.
Additional Notes

kube_pod_init_container_status_terminated

Prometheus IDkube_pod_init_container_status_terminated
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe number of init containers in the Pod in the terminated state.
Additional Notes

kube_pod_init_container_status_terminated_reason

Prometheus IDkube_pod_init_container_status_terminated_reason
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the reason that the init container is in the terminated state as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_pod_init_container_status_waiting

Prometheus IDkube_pod_init_container_status_waiting
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe number of init containers in the Pod in the waiting state.
Additional Notes

kube_pod_init_container_status_waiting_reason

Prometheus IDkube_pod_init_container_status_waiting_reason
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the reason that the init container is in the waiting state as a label on the metric.
Additional NotesThe value of the metric will always be 1.

kube_pod_labels

Prometheus IDkube_pod_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_pod_owner

Prometheus IDkube_pod_owner
Legacy ID
Metric Typegauge
Unitnumber
DescriptionInformation about the owner of the pod is stored as labels on the metric.
Additional NotesThe value of the metric will always be 1.

kube_pod_spec_volumes_persistentvolumeclaims_info

Prometheus IDkube_pod_spec_volumes_persistentvolumeclaims_info
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores information about the PVC specified in a Pod’s spec.
Additional NotesThe value of the metric will always be 1.

kube_pod_spec_volumes_persistentvolumeclaims_readonly

Prometheus IDkube_pod_spec_volumes_persistentvolumeclaims_readonly
Legacy ID
Metric Typegauge
Unitnumber
DescriptionDescribes whether a PVC is mounted read-only.
Additional NotesThe value of the metric wil be 1 if the PVC is read-only and 0 if not.

kube_pod_sysdig_containers_waiting

Prometheus IDkube_pod_sysdig_containers_waiting
Legacy IDkubernetes.pod.containers.waiting
Metric Typegauge
Unitnumber
DescriptionThe number of containers waiting for a Pod.
Additional Notes

kube_pod_sysdig_resource_limits_cpu_cores

Prometheus IDkube_pod_sysdig_resource_limits_cpu_cores
Legacy IDkubernetes.pod.resourceLimits.cpuCores
Metric Typegauge
Unitnumber
DescriptionThe limit on CPU cores to be used by a container.
Additional Notes

kube_pod_sysdig_resource_limits_memory_bytes

Prometheus IDkube_pod_sysdig_resource_limits_memory_bytes
Legacy IDkubernetes.pod.resourceLimits.memBytes
Metric Typegauge
Unitdata
DescriptionThe limit on memory to be used by a container in bytes.
Additional Notes

kube_pod_sysdig_resource_requests_cpu_cores

Prometheus IDkube_pod_sysdig_resource_requests_cpu_cores
Legacy IDkubernetes.pod.resourceRequests.cpuCores
Metric Typegauge
Unitnumber
DescriptionThe number of CPU cores requested by containers in the Pod.
Additional Notes

kube_pod_sysdig_resource_requests_memory_bytes

Prometheus IDkube_pod_sysdig_resource_requests_memory_bytes
Legacy IDkubernetes.pod.resourceRequests.memBytes
Metric Typegauge
Unitdata
DescriptionThe number of memory bytes requested by containers in the Pod.
Additional Notes

kube_pod_sysdig_restart_count

Prometheus IDkube_pod_sysdig_restart_count
Legacy IDkubernetes.pod.restart.count
Metric Typegauge
Unitnumber
DescriptionThe number of container restarts for the Pod.
Additional Notes

kube_pod_sysdig_restart_rate

Prometheus IDkube_pod_sysdig_restart_rate
Legacy IDkubernetes.pod.restart.rate
Metric Typegauge
Unitnumber
DescriptionNumber of times the pod has been restarted per second
Additional Notes

kube_pod_sysdig_status_ready

Prometheus IDkube_pod_sysdig_status_ready
Legacy IDkubernetes.pod.status.ready
Metric Typegauge
Unitnumber
DescriptionThe number of pods ready to serve requests.
Additional Notes

kube_replicaset_labels

Prometheus IDkube_replicaset_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_replicaset_owner

Prometheus IDkube_replicaset_owner
Legacy ID
Metric Typegauge
Unitnumber
DescriptionInformation about the owner of the pod is stored as labels on the metric.
Additional NotesThe value of the metric will always be 1.

kube_replicaset_spec_replicas

Prometheus IDkube_replicaset_spec_replicas
Legacy IDkubernetes.replicaSet.replicas.desired
Metric Typegauge
Unitnumber
DescriptionThe number of desired Pods per replicaSet.
Additional Notes

kube_replicaset_status_fully_labeled_replicas

Prometheus IDkube_replicaset_status_fully_labeled_replicas
Legacy IDkubernetes.replicaSet.replicas.fullyLabeled
Metric Typegauge
Unitnumber
DescriptionThe number of fully labeled Pods per replicaSet.
Additional Notes

kube_replicaset_status_ready_replicas

Prometheus IDkube_replicaset_status_ready_replicas
Legacy IDkubernetes.replicaSet.replicas.ready
Metric Typegauge
Unitnumber
DescriptionThe number of ready Pods per replicaSet.
Additional Notes

kube_replicaset_status_replicas

Prometheus IDkube_replicaset_status_replicas
Legacy IDkubernetes.replicaSet.replicas.running
Metric Typegauge
Unitnumber
DescriptionThe number of running Pods per replicaSet.
Additional Notes

kube_resourcequota

Prometheus IDkube_resourcequota
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe amount of a resource that the resource quota is configured for.
Additional NotesThe resource type and whether the quota is hard or soft is stored as labels on the metric.

kube_resourcequota_sysdig_limits_cpu_hard

Prometheus IDkube_resourcequota_sysdig_limits_cpu_hard
Legacy IDkubernetes.resourcequota.limits.cpu.hard
Metric Typegauge
Unitnumber
DescriptionEnforced CPU Limit quota per namespace.
Additional Notes

kube_resourcequota_sysdig_limits_cpu_used

Prometheus IDkube_resourcequota_sysdig_limits_cpu_used
Legacy IDkubernetes.resourcequota.limits.cpu.used
Metric Typegauge
Unitnumber
DescriptionCurrent observed CPU limit usage per namespace.
Additional Notes

kube_resourcequota_sysdig_limits_memory_hard

Prometheus IDkube_resourcequota_sysdig_limits_memory_hard
Legacy IDkubernetes.resourcequota.limits.memory.hard
Metric Typegauge
Unitnumber
DescriptionEnforced memory limit quota per namespace.
Additional Notes

kube_resourcequota_sysdig_limits_memory_used

Prometheus IDkube_resourcequota_sysdig_limits_memory_used
Legacy IDkubernetes.resourcequota.limits.memory.used
Metric Typegauge
Unitnumber
DescriptionCurrent observed memory limit usage per namespace.
Additional Notes

kube_resourcequota_sysdig_persistentvolumeclaims_hard

Prometheus IDkube_resourcequota_sysdig_persistentvolumeclaims_hard
Legacy IDkubernetes.resourcequota.persistentvolumeclaims.hard
Metric Typegauge
Unitnumber
DescriptionEnforced Peristentvolumeclaim quota per namespace.
Additional Notes

kube_resourcequota_sysdig_persistentvolumeclaims_used

Prometheus IDkube_resourcequota_sysdig_persistentvolumeclaims_used
Legacy IDkubernetes.resourcequota.persistentvolumeclaims.used
Metric Typegauge
Unitnumber
DescriptionCurrent observed Persistentvolumeclaim usage per namespace.
Additional Notes

kube_resourcequota_sysdig_pods_hard

Prometheus IDkube_resourcequota_sysdig_pods_hard
Legacy IDkubernetes.resourcequota.pods.hard
Metric Typegauge
Unitnumber
DescriptionEnforced Pod quota per namespace.
Additional Notes

kube_resourcequota_sysdig_pods_used

Prometheus IDkube_resourcequota_sysdig_pods_used
Legacy IDkubernetes.resourcequota.pods.used
Metric Typegauge
Unitnumber
DescriptionCurrent observed Pod usage per namespace.
Additional Notes

kube_resourcequota_sysdig_requests_cpu_hard

Prometheus IDkube_resourcequota_sysdig_requests_cpu_hard
Legacy IDkubernetes.resourcequota.requests.cpu.hard
Metric Typegauge
Unitnumber
DescriptionEnforced CPU request quota per namespace.
Additional Notes

kube_resourcequota_sysdig_requests_cpu_used

Prometheus IDkube_resourcequota_sysdig_requests_cpu_used
Legacy IDkubernetes.resourcequota.requests.cpu.used
Metric Typegauge
Unitnumber
DescriptionCurrent observed CPU request usage per namespace.
Additional Notes

kube_resourcequota_sysdig_requests_memory_hard

Prometheus IDkube_resourcequota_sysdig_requests_memory_hard
Legacy IDkubernetes.resourcequota.requests.memory.hard
Metric Typegauge
Unitnumber
DescriptionEnforced memory request quota per namespace.
Additional Notes

kube_resourcequota_sysdig_requests_memory_used

Prometheus IDkube_resourcequota_sysdig_requests_memory_used
Legacy IDkubernetes.resourcequota.requests.memory.used
Metric Typegauge
Unitnumber
DescriptionCurrent observed memory request usage per namespace.
Additional Notes

kube_resourcequota_sysdig_services_hard

Prometheus IDkube_resourcequota_sysdig_services_hard
Legacy IDkubernetes.resourcequota.services.hard
Metric Typegauge
Unitnumber
DescriptionEnforced service quota per namespace.
Additional Notes

kube_resourcequota_sysdig_services_used

Prometheus IDkube_resourcequota_sysdig_services_used
Legacy IDkubernetes.resourcequota.services.used
Metric Typegauge
Unitnumber
DescriptionCurrent observed service usage per namespace.
Additional Notes

kube_service_info

Prometheus IDkube_service_info
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe labels on the metric store information about the object.
Additional NotesThe value of the metric will always be 1.

kube_service_labels

Prometheus IDkube_service_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_statefulset_labels

Prometheus IDkube_statefulset_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_statefulset_replicas

Prometheus IDkube_statefulset_replicas
Legacy IDkubernetes.statefulSet.replicas
Metric Typegauge
Unitnumber
DescriptionDesired number of replicas of the given Template.
Additional Notes

kube_statefulset_status_replicas

Prometheus IDkube_statefulset_status_replicas
Legacy IDkubernetes.statefulSet.status.replicas
Metric Typegauge
Unitnumber
DescriptionNumber of Pods created by the StatefulSet controller.
Additional Notes

kube_statefulset_status_replicas_current

Prometheus IDkube_statefulset_status_replicas_current
Legacy IDkubernetes.statefulSet.status.replicas.current
Metric Typegauge
Unitnumber
DescriptionThe number of Pods created by the StatefulSet controller from the StatefulSet version indicated by currrentRevision.
Additional Notes

kube_statefulset_status_replicas_ready

Prometheus IDkube_statefulset_status_replicas_ready
Legacy IDkubernetes.statefulSet.status.replicas.ready
Metric Typegauge
Unitnumber
DescriptionNumber of Pods created by the StatefulSet controller that have a Ready Condition.
Additional Notes

kube_statefulset_status_replicas_updated

Prometheus IDkube_statefulset_status_replicas_updated
Legacy IDkubernetes.statefulSet.status.replicas.updated
Metric Typegauge
Unitnumber
DescriptionNumber of Pods created by the StatefulSet controller from the StatefulSet version indicated by updateRevision.
Additional Notes

kube_storageclass_created

Prometheus IDkube_storageclass_created
Legacy ID
Metric Typegauge
Unitnumber
DescriptionUnix epoch time when the storageclass was created.
Additional Notes

kube_storageclass_info

Prometheus IDkube_storageclass_info
Legacy ID
Metric Typegauge
Unitnumber
DescriptionThe labels on the metric store information about the object.
Additional NotesThe value of the metric will always be 1.

kube_storageclass_labels

Prometheus IDkube_storageclass_labels
Legacy ID
Metric Typegauge
Unitnumber
DescriptionStores the labels associated with the object as labels on the metric.
Additional NotesThe value of the metric will always be 1. The labels will be prepended with ’label_'.

kube_workload_pods_status_phase

Prometheus IDkube_workload_pods_status_phase
Legacy IDkubernetes.workload.pods.status.phase
Metric Typegauge
Unitnumber
DescriptionThe number of Pods in a particular phase for the workload.
Additional NotesStores the phase as a label on the metric.

kube_workload_status_replicas_misscheduled

Prometheus IDkube_workload_status_replicas_misscheduled
Legacy IDkubernetes.workload.status.replicas.misscheduled
Metric Typegauge
Unitnumber
DescriptionThe number of running Pods for a workload that are not supposed to be running.
Additional Notes

kube_workload_status_replicas_scheduled

Prometheus IDkube_workload_status_replicas_scheduled
Legacy IDkubernetes.workload.status.replicas.scheduled
Metric Typegauge
Unitnumber
DescriptionThe number of Pods scheduled to be run for a workload.
Additional Notes

kube_workload_status_replicas_updated

Prometheus IDkube_workload_status_replicas_updated
Legacy IDkubernetes.workload.status.replicas.updated
Metric Typegauge
Unitnumber
DescriptionThe number of updated Pods per workload.
Additional Notes

kube_workload_status_running

Prometheus IDkube_workload_status_running
Legacy IDkubernetes.workload.status.running
Metric Typegauge
Unitnumber
DescriptionThe number of running Pods for a workload.
Additional Notes

kube_workload_status_unavailable

Prometheus IDkube_workload_status_unavailable
Legacy IDkubernetes.workload.status.unavailable
Metric Typegauge
Unitnumber
DescriptionThe number of unavailable Pods per workload.
Additional Notes

4.10.2.8 - Network

sysdig_connection_net_connection_in_count

Prometheus IDsysdig_connection_net_connection_in_count
Legacy IDnet.connection.count.in
Metric Typecounter
Unitnumber
DescriptionThe number of currently established client (inbound) connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_connection_net_connection_out_count

Prometheus IDsysdig_connection_net_connection_out_count
Legacy IDnet.connection.count.out
Metric Typecounter
Unitnumber
DescriptionThe number of currently established server (outbound) connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_connection_net_connection_total_count

Prometheus IDsysdig_connection_net_connection_total_count
Legacy IDnet.connection.count.total
Metric Typecounter
Unitnumber
DescriptionThe number of currently established connections. This value may exceed the sum of the inbound and outbound metrics since it represents client and server inter-host connections as well as internal only connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_connection_net_in_bytes

Prometheus IDsysdig_connection_net_in_bytes
Legacy IDnet.bytes.in
Metric Typecounter
Unitdata
DescriptionThe number of inbound network bytes.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_connection_net_out_bytes

Prometheus IDsysdig_connection_net_out_bytes
Legacy IDnet.bytes.out
Metric Typecounter
Unitdata
DescriptionThe number of outbound network bytes.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_connection_net_request_count

Prometheus IDsysdig_connection_net_request_count
Legacy IDnet.request.count
Metric Typecounter
Unitnumber
DescriptionThe total number of network requests. Note, this value may exceed the sum of inbound and outbound requests, because this count includes requests over internal connections.
Additional Notes

sysdig_connection_net_request_in_count

Prometheus IDsysdig_connection_net_request_in_count
Legacy IDnet.request.count.in
Metric Typecounter
Unitnumber
DescriptionThe number of inbound network requests.
Additional Notes

sysdig_connection_net_request_in_time

Prometheus IDsysdig_connection_net_request_in_time
Legacy IDnet.request.time.in
Metric Typecounter
Unittime
DescriptionThe average time to serve an inbound request.
Additional Notes

sysdig_connection_net_request_out_count

Prometheus IDsysdig_connection_net_request_out_count
Legacy IDnet.request.count.out
Metric Typecounter
Unitnumber
DescriptionThe number of outbound network requests.
Additional Notes

sysdig_connection_net_request_out_time

Prometheus IDsysdig_connection_net_request_out_time
Legacy IDnet.request.time.out
Metric Typecounter
Unittime
DescriptionThe number of average time spent waiting for an outbound request.
Additional Notes

sysdig_connection_net_request_time

Prometheus IDsysdig_connection_net_request_time
Legacy IDnet.request.time
Metric Typecounter
Unittime
DescriptionThe number of average time to serve a network request.
Additional Notes

sysdig_connection_net_total_bytes

Prometheus IDsysdig_connection_net_total_bytes
Legacy IDnet.bytes.total
Metric Typecounter
Unitdata
DescriptionThe total network bytes, including both inbound and outbound connections.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

4.10.2.9 - Program

sysdig_program_cpu_cores_used

Prometheus IDsysdig_program_cpu_cores_used
Legacy IDcpu.cores.used
Metric Typegauge
Unitnumber
DescriptionThe CPU core usage of each program is obtained from cgroups, and is equal to the number of cores used by the program. For example, if a program uses two of an available four cores, the value of sysdig_program_cpu_cores_used will be two.
Additional Notes

sysdig_program_cpu_cores_used_percent

Prometheus IDsysdig_program_cpu_cores_used_percent
Legacy IDcpu.cores.used.percent
Metric Typegauge
Unitpercent
DescriptionThe CPU core usage percent for each program is obtained from cgroups, and is equal to the number of cores multiplied by 100. For example, if a program uses three cores, the value of sysdig_program_cpu_cores_used_percent would be 300%.
Additional Notes

sysdig_program_cpu_used_percent

Prometheus IDsysdig_program_cpu_used_percent
Legacy IDcpu.used.percent
Metric Typegauge
Unitpercent
DescriptionThe CPU usage for each program is obtained from cgroups, and normalized by dividing by the number of cores to determine an overall percentage. For example, if the environment contains six cores on a host, and the processes are assigned two cores, Sysdig will report CPU usage of 2/6 * 100% = 33.33%. This metric is calculated differently for hosts and containers.
Additional Notes

sysdig_program_fd_used_percent

Prometheus IDsysdig_program_fd_used_percent
Legacy IDfd.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of used file descriptors out of the maximum available.
Additional NotesUsually, when a process reaches its FD limit it will stop operating properly and possibly crash. As a consequence, this is a metric you want to monitor carefully, or even better use for alerts.

sysdig_program_file_error_open_count

Prometheus IDsysdig_program_file_error_open_count
Legacy IDfile.error.open.count
Metric Typecounter
Unitnumber
DescriptionThe number of errors caused by opening files.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_file_error_total_count

Prometheus IDsysdig_program_file_error_total_count
Legacy IDfile.error.total.count
Metric Typecounter
Unitnumber
DescriptionThe number of error caused by file access.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_file_in_bytes

Prometheus IDsysdig_program_file_in_bytes
Legacy IDfile.bytes.in
Metric Typecounter
Unitdata
DescriptionThe number of bytes read from file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_file_in_iops

Prometheus IDsysdig_program_file_in_iops
Legacy IDfile.iops.in
Metric Typecounter
Unitnumber
DescriptionThe number of file read operations per second.
Additional NotesThis is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.

sysdig_program_file_in_time

Prometheus IDsysdig_program_file_in_time
Legacy IDfile.time.in
Metric Typecounter
Unittime
DescriptionThe time spent in file reading.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_file_open_count

Prometheus IDsysdig_program_file_open_count
Legacy IDfile.open.count
Metric Typecounter
Unitnumber
DescriptionThe number of time the file has been opened.
Additional Notes

sysdig_program_file_out_bytes

Prometheus IDsysdig_program_file_out_bytes
Legacy IDfile.bytes.out
Metric Typecounter
Unitdata
DescriptionThe number of bytes written to file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_file_out_iops

Prometheus IDsysdig_program_file_out_iops
Legacy IDfile.iops.out
Metric Typecounter
Unitnumber
DescriptionThe number of file write operations per second.
Additional NotesThis is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.

sysdig_program_file_out_time

Prometheus IDsysdig_program_file_out_time
Legacy IDfile.time.out
Metric Typecounter
Unittime
DescriptionThe time spent in file writing.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_file_total_bytes

Prometheus IDsysdig_program_file_total_bytes
Legacy IDfile.bytes.total
Metric Typecounter
Unitdata
DescriptionThe number of bytes read from and written to file.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_file_total_iops

Prometheus IDsysdig_program_file_total_iops
Legacy IDfile.iops.total
Metric Typecounter
Unitnumber
DescriptionThe number of read and write file operations per second.
Additional NotesThis is calculated by measuring the actual number of read and write requests made by a process. Therefore, it can differ from what other tools show, which is usually based on interpolating this value from the number of bytes read and written to the file system.

sysdig_program_file_total_time

Prometheus IDsysdig_program_file_total_time
Legacy IDfile.time.total
Metric Typecounter
Unittime
DescriptionThe time spent in file I/O.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_info

Prometheus IDsysdig_program_info
Legacy IDinfo
Metric Typegauge
Unitnumber
Description
Additional Notes

sysdig_program_memory_used_bytes

Prometheus IDsysdig_program_memory_used_bytes
Legacy IDmemory.bytes.used
Metric Typegauge
Unitdata
DescriptionThe amount of physical memory currently in use.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, the metric can also be segmented by using ‘Segment by’ in the UI.

sysdig_program_memory_used_percent

Prometheus IDsysdig_program_memory_used_percent
Legacy IDmemory.used.percent
Metric Typegauge
Unitpercent
DescriptionThe percentage of physical memory in use.
Additional NotesBy default, this metric shows the average value for the selected scope. For instance, if you apply it to a group of machines, you will see the average value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_net_connection_in_count

Prometheus IDsysdig_program_net_connection_in_count
Legacy IDnet.connection.count.in
Metric Typecounter
Unitnumber
DescriptionThe number of currently established client (inbound) connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_program_net_connection_out_count

Prometheus IDsysdig_program_net_connection_out_count
Legacy IDnet.connection.count.out
Metric Typecounter
Unitnumber
DescriptionThe number of currently established server (outbound) connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_program_net_connection_total_count

Prometheus IDsysdig_program_net_connection_total_count
Legacy IDnet.connection.count.total
Metric Typecounter
Unitnumber
DescriptionThe number of currently established connections. This value may exceed the sum of the inbound and outbound metrics since it represents client and server inter-host connections as well as internal only connections.
Additional NotesThis metric is especially useful when segmented by protocol, port or process.

sysdig_program_net_error_count

Prometheus IDsysdig_program_net_error_count
Legacy IDnet.error.count
Metric Typecounter
Unitnumber
DescriptionThe total number of network errors occurred in a second.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_net_in_bytes

Prometheus IDsysdig_program_net_in_bytes
Legacy IDnet.bytes.in
Metric Typecounter
Unitdata
DescriptionThe number of inbound network bytes.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_net_out_bytes

Prometheus IDsysdig_program_net_out_bytes
Legacy IDnet.bytes.out
Metric Typecounter
Unitdata
DescriptionThe number of outbound network bytes.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_net_request_count

Prometheus IDsysdig_program_net_request_count
Legacy IDnet.request.count
Metric Typecounter
Unitnumber
DescriptionThe total number of network requests. Note, this value may exceed the sum of inbound and outbound requests, because this count includes requests over internal connections.
Additional Notes

sysdig_program_net_request_in_count

Prometheus IDsysdig_program_net_request_in_count
Legacy IDnet.request.count.in
Metric Typecounter
Unitnumber
DescriptionThe number of inbound network requests.
Additional Notes

sysdig_program_net_request_in_time

Prometheus IDsysdig_program_net_request_in_time
Legacy IDnet.request.time.in
Metric Typecounter
Unittime
DescriptionThe average time to serve an inbound request.
Additional Notes

sysdig_program_net_request_out_count

Prometheus IDsysdig_program_net_request_out_count
Legacy IDnet.request.count.out
Metric Typecounter
Unitnumber
DescriptionThe number of outbound network requests.
Additional Notes

sysdig_program_net_request_out_time

Prometheus IDsysdig_program_net_request_out_time
Legacy IDnet.request.time.out
Metric Typecounter
Unittime
DescriptionThe average time spent waiting for an outbound request.
Additional Notes

sysdig_program_net_request_time

Prometheus IDsysdig_program_net_request_time
Legacy IDnet.request.time
Metric Typecounter
Unittime
DescriptionAverage time to serve a network request.
Additional Notes

sysdig_program_net_tcp_queue_len

Prometheus IDsysdig_program_net_tcp_queue_len
Legacy IDnet.tcp.queue.len
Metric Typecounter
Unitnumber
DescriptionThe length of the TCP request queue.
Additional Notes

sysdig_program_net_total_bytes

Prometheus IDsysdig_program_net_total_bytes
Legacy IDnet.bytes.total
Metric Typecounter
Unitdata
DescriptionThe total network bytes, including inbound and outbound connections, in a program.
Additional NotesBy default, this metric shows the total value for the selected scope. For instance, if you apply it to a group of machines, you will see the total value for the whole group. However, you can easily segment the metric to see it by host, process, container, and so on. Just use ‘Segment by’ in the UI.

sysdig_program_proc_count

Prometheus IDsysdig_program_proc_count
Legacy IDproc.count
Metric Typecounter
Unitnumber
DescriptionThe number of processes on a host or container.
Additional Notes

sysdig_program_syscall_count

Prometheus IDsysdig_program_syscall_count
Legacy IDsyscall.count
Metric Typegauge
Unitnumber
DescriptionThe total number of syscalls seen
Additional NotesSyscalls are resource intensive. This metric tracks how many have been made by a given process or container

sysdig_program_thread_count

Prometheus IDsysdig_program_thread_count
Legacy IDthread.count
Metric Typecounter
Unitnumber
DescriptionThe total number of threads running in a program.
Additional Notes

sysdig_program_timeseries_count_appcheck

Prometheus IDsysdig_program_timeseries_count_appcheck
Legacy IDmetricCount.appCheck
Metric Typegauge
Unitnumber
DescriptionThe number of app check custom metrics.
Additional Notes

sysdig_program_timeseries_count_jmx

Prometheus IDsysdig_program_timeseries_count_jmx
Legacy IDmetricCount.jmx
Metric Typegauge
Unitnumber
DescriptionThe number of JMS custom metrics.
Additional Notes

sysdig_program_timeseries_count_prometheus

Prometheus IDsysdig_program_timeseries_count_prometheus
Legacy IDmetricCount.prometheus
Metric Typegauge
Unitnumber
DescriptionThe number of Prometheus custom metrics.
Additional Notes

sysdig_program_up

Prometheus IDsysdig_program_up
Legacy IDuptime
Metric Typegauge
Unitnumber
DescriptionThe percentage of time the selected entity was down during the visualized time sample. This can be used to determine if a machine (or a group of machines) went down.
Additional Notes

4.10.2.10 - Provider

sysdig_cloud_provider_info

Prometheus IDsysdig_cloud_provider_info
Legacy IDinfo
Metric Typegauge
Unitnumber
DescriptionThe metrics will always have the value of 1.
Additional Notes

4.10.3 - Metrics in Sysdig Legacy Format

The Sysdig legacy metrics dictionary lists the default legacy metrics supported by the Sysdig product suite, as well as kube state and cloud provider metrics.

The metrics listed in this section follows the statsd-compatible Sysdig naming convention. To see a mapping between Prometheus notation and Sysdig notation, see Metrics and Label Mapping.

Overview

Each metric in the dictionary has several pieces of metadata listed to provide greater context for how the metric can be used within Sysdig products. An example layout is displayed below:

Metric Name

Metric definition. For some metrics, the equation for how the value is determined is provided.

Metadata

Definition

Metric Type

Metric type determines whether the metric value is a counter metric or a gauge metric. Sysdig Monitor offers two Metric types:

Counter: The metric whose value keeps on increasing and is reliant on previous values. It helps you record how many times something has happened, for example, a user login.

Gauge: Represents a single numerical value that can arbitrarily fluctuate over time. Each value returns an instantaneous measurement, for example, CPU usage.

Value Type

The type of value the metric can have. The possible values are:

  • Percent (%)

  • Byte

  • Date

  • Double

  • Integer (int)

  • relativeTime

  • String

Segment By

The levels within the infrastructure that the metric can be segmented at:

  • Host

  • Container

  • Process

  • Kubernetes

  • Mesos

  • Swarm

  • CloudProvider

Default Time Aggregation

The default time aggregation format for the metric.

Available Time Aggregation Formats

The time aggregation formats the metric can be aggregated by:

  • Average (Avg)

  • Rate

  • Sum

  • Minimum (Min)

  • Maximum (Max)

Default Group Aggregation

The default group aggregation format for the metric.

Available Group Aggregation Formats

The group aggregation formats the metric can be aggregated by:

  • Average (Avg)

  • Sum

  • Minimum (Min)

  • Maximum (Max)

4.10.3.1 - Agent

Note: Sysdig follows the Prometheus-compabtible naming convention for both metrics and labels as opposed to the previous statsd-compatible, legacy Sysdig naming convention. However, this page still shows metrics in the legacy Sysdig naming convention. Until this page is updated, see Metrics and Label Mapping for the mapping between Sysdig legacy and Prometheus naming conventions.

dragent.analyzer

dragent is the main process in the agent that collects and collates data from multiple sources, including syscall events from the kernel in order to generate metrics. The analyzer module that runs in the dragent process does much of the work involved in generating metrics. These internal metrics are used to troubleshoot the health of the analyzer component.

Sysdig Monitor provides the following analyzer metrics:

MetricsTypeMinimum Agent VersionDescription
dragent.analyzer.processesgauge0.80.0 or aboveThe number of processes found by the analyzer.
dragent.analyzer.threadsThe number of threads found by the analyzer.
dragent.analyzer.threads.droppedcounterThe number of threads not reported due to thread limits.
dragent.analyzer.containersgaugeThe number of containers found by the analyzer.
dragent.analyzer.javaprocsThe number of java processes found by the analyzer.
dragent.analyzer.appchecksThe number of application checks reporting to the analyzer.
dragent.analyzer.mesos.autodetectIf the agent is configured to autodetect a Mesos environment, value is 1, otherwise is 0.
dragent.analyzer.mesos.detectedIf the agent actually found a Mesos environment, value is 1, otherwise, value is 0
dragent.analyzer.fp.pct100The analyzer flush CPU % (0-100)
dragent.analyzer.fl.msThe analyzer flush duration (milliseconds)
dragent.analyzer.srThe current sampling ratio (1=all events, 2= half of events analyzed, 4=one fourth of events analyzed, and so on.
dragent.analyzer.n_evtsThe number of events processed
dragent.analyzer.n_dropsThe number of events dropped
dragent.analyzer.n_drops_bufferThe number of events dropped due to the buffer being full.
dragent.analyzer.n_preemptionsThe number of driver preemptions.
dragent.analyzer.n_command_linesThe number of command lines collected and sent to the collector.
dragent.analyzer.command_line_cats.n_none
dragent.analyzer.n_container_healthcheck_command_lines0.80.1 or aboveThe number of command lines identified as container health checks. This metric does not change even if healthcheck command lines are not sent to the collector.

4.10.3.2 - Applications

Note: Sysdig follows the Prometheus-compabtible naming convention for both metrics and labels as opposed to the previous statsd-compatible, legacy Sysdig naming convention. However, this page still shows metrics in the legacy Sysdig naming convention. Until this page is updated, see Metrics and Label Mapping for the mapping between Sysdig legacy and Prometheus naming conventions.

The metrics in this section are collected from either default or customized agent configurations for integrated applications. See also: Integrate Applications (Default App Checks).

Contents

4.10.3.2.1 - Apache Metrics

See Application Integrations for more information.

apache.conns_async_closing

The number of asynchronous closing connections.

apache.conns_async_keep_alive

The number of asynchronous keep-alive connections.

apache.conns_async_writing

The number of asynchronous write connections.

apache.conns_total

The total number of connections handled.

apache.net.bytes

The total number of bytes served.

apache.net.bytes_per_s

The number of bytes served per second.

apache.net.hits

The total number of requests performed.

apache.net.request_per_s

The number of requests performed per second.

apache.performance.busy_workers

The number of workers currently serving requests.

apache.performance.cpu_load

The percentage of CPU used.

apache.performance.idle_workers

The number of idle workers in the instance.

apache.performance.uptime

The amount of time the server has been running in seconds.

4.10.3.2.2 - Apache Kafka Metrics

Contents

4.10.3.2.2.1 - Apache Kafka Consumer Metrics

See Application Integrations for more information.

kafka.broker_offset

The current message offset value on the broker.

kafka.consumer_lag

The lag in messages between the consumer and the broker.

kafka.consumer_offset

The current message offset value on the consumer.

4.10.3.2.2.2 - Apache Kafka JMX Metrics

See Application Integrations for more information.

The kafka.consumer.* and kafka.producer.* metrics are only available with JMX customization as documented in Integrate JMX Metrics from Java Virtual Machines.

kafka.consumer.bytes_consumed

The average number of bytes consumed for a specific topic per second.

kafka.consumer.bytes_in

The rate of bytes coming in to the consumer.

kafka.consumer.delayed_requests

The number of delayed consumer requests.

kafka.consumer.expires_per_second

The rate of delayed consumer request expiration.

kafka.consumer.fetch_rate

The minimum rate at which the consumer sends fetch requests to a broker.

kafka.consumer.fetch_size_avg

The average number of bytes fetched for a specific topic per request.

kafka.consumer.fetch_size_max

The maximum number of bytes fetched for a specific topic per request.

kafka.consumer.kafka_commits

The rate of offset commits to Kafka.

kafka.consumer.max_lag

The maximum consumer lag.

kafka.consumer.messages_in

The rate of consumer message consumption.

kafka.consumer.records_consumed

The average number of records consumed per second for a specific topic.

kafka.consumer.records_per_request_avg

The average number of records in each request for a specific topic.

kafka.consumer.zookeeper_commits

The rate of offset commits to ZooKeeper.

kafka.expires_sec

The rate of delayed producer request expiration.

kafka.follower.expires_per_second

The rate of request expiration on followers.

kafka.log.flush_rate

The log flush rate.

kafka.messages_in

The incoming message rate.

kafka.net.bytes_in

The incoming byte rate.

kafka.net.bytes_out

The outgoing byte rate.

kafka.net.bytes_rejected

The rejected byte rate.

kafka.producer.available_buffer_bytes

The total amount of buffer memory, including unallocated buffer memory and memory in the free list, that is not being used.

kafka.producer.batch_size_avg

The average number of bytes sent per partition per-request.

kafka.producer.batch_size_max

The maximum number of bytes sent per partition per-request.

kafka.producer.buffer_bytes_total

The maximum amount of buffer memory the client can use.

kafka.producer.bufferpool_wait_time

The fraction of time an appender waits for space allocation.

kafka.producer.bytes_out

The rate of bytes going out for the producer.

kafka.producer.compression_rate

The average compression rate of record batches for a topic.

kafka.producer.compression_rate_avg

The average compression rate of record batches.

kafka.producer.delayed_requests

The number of producer requests delayed.

kafka.producer.expires_per_seconds

The rate of producer request expiration.

kafka.producer.io_wait

The producer I/O wait time.

kafka.producer.message_rate

The producer message rate.

kafka.producer.metadata_age

The age of the current producer metadata being used, in seconds.

kafka.producer.record_error_rate

The average number of retried record sends for a topic per second.

kafka.producer.record_queue_time_avg

The average time that record batches spent in the record accumulator, in milliseconds.

kafka.producer.record_queue_time_max

The maximum amount of time record batches can spend in the record accumulator, in milliseconds.

kafka.producer.record_retry_rate

The average number of retried record sends for a topic per second.

kafka.producer.record_send_rate

The average number of records sent per second for a topic.

kafka.producer.record_size_avg

The average record size.

kafka.producer.record_size_max

The maximum record size.

kafka.producer.records_per_request

The average number of records sent per second.

kafka.producer.request_latency_avg

The average request latency of the producer.

kafka.producer.request_latency_max

The maximum request latency in milliseconds.

kafka.producer.request_rate

The number of producer requests per second.

kafka.producer.requests_in_flight

The current number of in-flight requests awaiting a response

kafka.producer.response_rate

The number of producer responses per second.

kafka.producer.throttle_time_avg

The average time in a request was throttled by a broker, in milliseconds.

kafka.producer.throttle_time_max

The maximum time in a request was throttled by a broker, in milliseconds.

kafka.producer.waiting_threads

The number of user threads blocked waiting for buffer memory to enqueue their records.

kafka.replication.isr_expands

The rate of replicas joining the ISR pool.

kafka.replication.isr_shrinks

The rate of replicas leaving the ISR pool.

kafka.replication.leader_elections

The leader election rate.

kafka.replication.unclean_leader_elections

The unclean leader election rate.

kafka.replication.under_replicated_partitions

The number of unreplicated partitions.

kafka.request.fetch.failed

The number of client fetch request failures.

kafka.request.fetch.failed_per_second

The rate of client fetch request failures per second.

kafka.request.fetch.time.99percentile

The time for fetch requests for the 99th percentile.

kafka.request.fetch.time.avg

The average time per fetch request.

kafka.request.handler.avg.idle.pct

The average fraction of time the request handler threads are idle.

kafka.request.metadata.time.99percentile

The time for metadata requests for 99th percentile.

kafka.request.metadata.time.avg

The average time for a metadata request.

kafka.request.offsets.time.99percentile

The time for offset requests for the 99th percentile.

kafka.request.offsets.time.avg

The average time for an offset request.

kafka.request.produce.failed

The number of failed produce requests.

kafka.request.produce.failed_per_second

The rate of failed produce requests per second.

kafka.request.produce.time.99percentile

The time for produce requests for the 99th percentile.

kafka.request.produce.time.avg

The average time for a produce request.

kafka.request.update_metadata.time.99percentile

The time for update metadata requests for the 99th percentile

kafka.request.update_metadata.time.avg

The average time for a request to update metadata.

4.10.3.2.3 - Consul Metrics

Contents

4.10.3.2.3.1 - Base Consul Metrics

See Application Integrations for more information.

consul.catalog.nodes_critical

Number of nodes with service status `critical` from those registered.

consul.catalog.nodes_passing

Number of nodes with service status `passing` from those registered.

consul.catalog.nodes_up

Number of nodes.

consul.catalog.nodes_warning

Number of nodes with service status `warning` from those registered.

consul.catalog.services_critical

Total critical services on nodes.

consul.catalog.services_passing

Total passing services on nodes.

consul.catalog.services_up

Total services registered on nodes.

consul.catalog.services_warning

Total warning services on nodes.

consul.catalog.total_nodes

Number of nodes registered in the consul cluster.

consul.net.node.latency.max

Maximum latency from this node to all others.

consul.net.node.latency.median

Median latency from this node to all others.

consul.net.node.latency.min

Minimum latency from this node to all others.

consul.net.node.latency.p25

p25 latency from this node to all others.

consul.net.node.latency.p75

p75 latency from this node to all others.

consul.net.node.latency.p90

p90 latency from this node to all others.

consul.net.node.latency.p95

p95 latency from this node to all others.

consul.net.node.latency.p99

p99 latency from this node to all others.

consul.peers

Number of peers in the peer set.

4.10.3.2.3.2 - Consul StatsD Metrics

See Application Integrations for more information.

consul.memberlist.msg.suspect

Number of times an agent suspects another as failed while probing during gossip protocol.

consul.raft.apply

Number of raft transactions occurring.

consul.raft.commitTime.95percentile

The p95 time it takes to commit a new entry to the raft log on the leader.

consul.raft.commitTime.avg

The average time it takes to commit a new entry to the raft log on the leader.

consul.raft.commitTime.count

The number of samples of raft.commitTime

consul.raft.commitTime.max

The max time it takes to commit a new entry to the raft log on the leader.

consul.raft.commitTime.median

The median time it takes to commit a new entry to the raft log on the leader.

consul.raft.leader.dispatchLog.95percentile

The p95 time it takes for the leader to write log entries to disk.

consul.raft.leader.dispatchLog.avg

The average time it takes for the leader to write log entries to disk.

consul.raft.leader.dispatchLog.count

The number of samples of raft.leader.dispatchLog.

consul.raft.leader.dispatchLog.max

The max time it takes for the leader to write log entries to disk.

consul.raft.leader.dispatchLog.median

The median time it takes for the leader to write log entries to disk.

consul.raft.leader.lastContact.95percentile

P95 time elapsed since the leader was last able to check its lease with followers.

consul.raft.leader.lastContact.avg

Average time elapsed since the leader was last able to check its lease with followers.

consul.raft.leader.lastContact.count

The number of samples of raft.leader.lastContact.

consul.raft.leader.lastContact.max

Max time elapsed since the leader was last able to check its lease with followers.

consul.raft.leader.lastContact.median

Median time elapsed since the leader was last able to check its lease with followers.

consul.raft.state.candidate

The number of initiated leader elections.

consul.raft.state.leader

Number of completed leader elections.

consul.runtime.alloc_bytes

Current bytes allocated by the Consul process.

consul.runtime.free_count

Cumulative count of heap objects freed.

consul.runtime.heap_objects

Number of objects allocated on the heap.

consul.runtime.malloc_count

Cumulative count of heap objects allocated.

consul.runtime.num_goroutines

Number of running goroutines.

consul.runtime.sys_bytes

Total size of the virtual address space reserved by the Go runtime.

consul.runtime.total_gc_pause_ns

Cumulative nanoseconds in GC stop-the-world pauses since Consul started.

consul.runtime.total_gc_runs

Number of completed GC cycles.

consul.serf.events

Incremented when an agent processes a serf event.

consul.serf.member.flap

Number of times an agent is marked dead and then quickly recovers.

consul.serf.member.join

Incremented when an agent processes a join event.

4.10.3.2.4 - Couchbase Metrics

See Application Integrations for more information.

couchbase.by_bucket.avg_bg_wait_time

The average background wait time.

couchbase.by_bucket.avg_disk_commit_time

The average disk commit time.

couchbase.by_bucket.avg_disk_update_time

The average disk update time.

couchbase.by_bucket.bg_wait_total

The total background wait time.

couchbase.by_bucket.bytes_read

The number of bytes read.

couchbase.by_bucket.bytes_written

The number of bytes written.

couchbase.by_bucket.cas_badval

The number of compare and swap bad values.

couchbase.by_bucket.cas_hits

The number of compare and swap hits.

couchbase.by_bucket.cas_misses

The number of compare and swap misses.

couchbase.by_bucket.cmd_get

The number of compare and swap gets.

couchbase.by_bucket.cmd_set

The number of compare and swap sets.

couchbase.by_bucket.couch_docs_actual_disk_size

The size of the couchbase docs on disk.

couchbase.by_bucket.couch_docs_data_size

The data size of the couchbase docs.

couchbase.by_bucket.couch_docs_disk_size

Couch docs total size in bytes.

couchbase.by_bucket.couch_docs_fragmentation

The percentage of couchbase docs fragmentation.

couchbase.by_bucket.couch_spatial_data_size

The size of object data for spatial views.

couchbase.by_bucket.couch_spatial_disk_size

The amount of disk space occupied by spatial views.

couchbase.by_bucket.couch_spatial_ops

Spatial operations.

couchbase.by_bucket.couch_total_disk_size

The total disk size for couchbase.

couchbase.by_bucket.couch_views_data_size

The size of object data for views.

couchbase.by_bucket.couch_views_disk_size

The amount of disk space occupied by views.

couchbase.by_bucket.couch_views_fragmentation

The view fragmentation.

couchbase.by_bucket.couch_views_ops

View operations.

couchbase.by_bucket.cpu_idle_ms

CPU idle milliseconds.

couchbase.by_bucket.cpu_utilization_rate

CPU utilization percentage.

couchbase.by_bucket.curr_connections

Current bucket connections.

couchbase.by_bucket.curr_items

Number of active items in memory.

couchbase.by_bucket.curr_items_tot

Total number of items.

couchbase.by_bucket.decr_hits

Decrement hits.

couchbase.by_bucket.decr_misses

Decrement misses.

couchbase.by_bucket.delete_hits

Delete hits.

couchbase.by_bucket.delete_misses

Delete misses.

couchbase.by_bucket.disk_commit_count

Disk commits.

couchbase.by_bucket.disk_update_count

Disk updates.

couchbase.by_bucket.disk_write_queue

Disk write queue depth.

couchbase.by_bucket.ep_bg_fetched

Disk reads per second.

couchbase.by_bucket.ep_cache_miss_rate

Cache miss rate.

couchbase.by_bucket.ep_cache_miss_ratio

Cache miss ratio.

couchbase.by_bucket.ep_dcp_2i_backoff

Number of backoffs for indexes DCP connections.

couchbase.by_bucket.ep_dcp_2i_count

Number of indexes DCP connections.

couchbase.by_bucket.ep_dcp_2i_items_remaining

Number of indexes items remaining to be sent.

couchbase.by_bucket.ep_dcp_2i_items_sent

Number of indexes items sent.

couchbase.by_bucket.ep_dcp_2i_producer_count

Number of indexes producers

couchbase.by_bucket.ep_dcp_2i_total_bytes

Number bytes per second being sent for indexes DCP connections.

couchbase.by_bucket.ep_dcp_fts_backoff

Number of backoffs for fts DCP connections.

couchbase.by_bucket.ep_dcp_fts_count

Number of fts DCP connections.

couchbase.by_bucket.ep_dcp_fts_items_remaining

Number of fts items remaining to be sent.

couchbase.by_bucket.ep_dcp_fts_items_sent

Number of fts items sent.

couchbase.by_bucket.ep_dcp_fts_producer_count

Number of fts producers.

couchbase.by_bucket.ep_dcp_fts_total_bytes

Number bytes per second being sent for fts DCP connections.

couchbase.by_bucket.ep_dcp_other_backoff

Number of backoffs for other DCP connections.

couchbase.by_bucket.ep_dcp_other_count

Number of other DCP connections.

couchbase.by_bucket.ep_dcp_other_items_remaining

Number of other items remaining to be sent.

couchbase.by_bucket.ep_dcp_other_items_sent

Number of other items sent.

couchbase.by_bucket.ep_dcp_other_producer_count

Number of other producers.

couchbase.by_bucket.ep_dcp_other_total_bytes

Number bytes per second being sent for other DCP connections.

couchbase.by_bucket.ep_dcp_replica_backoff

Number of backoffs for replica DCP connections.

couchbase.by_bucket.ep_dcp_replica_count

Number of replica DCP connections.

couchbase.by_bucket.ep_dcp_replica_items_remaining

Number of replica items remaining to be sent.

couchbase.by_bucket.ep_dcp_replica_items_sent

Number of replica items sent.

couchbase.by_bucket.ep_dcp_replica_producer_count

Number of replica producers.

couchbase.by_bucket.ep_dcp_replica_total_bytes

Number bytes per second being sent for replica DCP connections.

couchbase.by_bucket.ep_dcp_views_backoff

Number of backoffs for views DCP connections.

couchbase.by_bucket.ep_dcp_views_count

Number of views DCP connections.

couchbase.by_bucket.ep_dcp_views_items_remaining

Number of views items remaining to be sent.

couchbase.by_bucket.ep_dcp_views_items_sent

Number of views items sent.

couchbase.by_bucket.ep_dcp_views_producer_count

Number of views producers.

couchbase.by_bucket.ep_dcp_views_total_bytes

Number bytes per second being sent for views DCP connections.

couchbase.by_bucket.ep_dcp_xdcr_backoff

Number of backoffs for xdcr DCP connections.

couchbase.by_bucket.ep_dcp_xdcr_count

Number of xdcr DCP connections.

couchbase.by_bucket.ep_dcp_xdcr_items_remaining

Number of xdcr items remaining to be sent.

couchbase.by_bucket.ep_dcp_xdcr_items_sent

Number of xdcr items sent.

couchbase.by_bucket.ep_dcp_xdcr_producer_count

Number of xdcr producers.

couchbase.by_bucket.ep_dcp_xdcr_total_bytes

Number bytes per second being sent for xdcr DCP connections.

couchbase.by_bucket.ep_diskqueue_drain

Total Drained items on disk queue.

couchbase.by_bucket.ep_diskqueue_fill

Total enqueued items on disk queue.

couchbase.by_bucket.ep_diskqueue_items

Total number of items waiting to be written to disk.

couchbase.by_bucket.ep_flusher_todo

Number of items currently being written.

couchbase.by_bucket.ep_item_commit_failed

Number of times a transaction failed to commit due to storage errors.

couchbase.by_bucket.ep_kv_size

Total amount of user data cached in RAM in this bucket.

couchbase.by_bucket.ep_max_size

The maximum amount of memory this bucket can use.

couchbase.by_bucket.ep_mem_high_wat

Memory usage high water mark for auto-evictions.

couchbase.by_bucket.ep_mem_low_wat

Memory usage low water mark for auto-evictions.

couchbase.by_bucket.ep_meta_data_memory

Total amount of item metadata consuming RAM in this bucket.

couchbase.by_bucket.ep_num_non_resident

Number of non-resident items.

couchbase.by_bucket.ep_num_ops_del_meta

Number of delete operations per second for this bucket as the target for XDCR.

couchbase.by_bucket.ep_num_ops_del_ret_meta

Number of delRetMeta operations per second for this bucket as the target for XDCR.

couchbase.by_bucket.ep_num_ops_get_meta

Number of read operations per second for this bucket as the target for XDCR.

couchbase.by_bucket.ep_num_ops_set_meta

Number of set operations per second for this bucket as the target for XDCR.

couchbase.by_bucket.ep_num_ops_set_ret_meta

Number of setRetMeta operations per second for this bucket as the target for XDCR.

couchbase.by_bucket.ep_num_value_ejects

Number of times item values got ejected from memory to disk.\

couchbase.by_bucket.ep_oom_errors

Number of times unrecoverable OOMs happened while processing operations.

couchbase.by_bucket.ep_ops_create

Create operations.

couchbase.by_bucket.ep_ops_update

Update operations.

couchbase.by_bucket.ep_overhead

Extra memory used by transient data like persistence queues or checkpoints.

couchbase.by_bucket.ep_queue_size

Number of items queued for storage.

couchbase.by_bucket.ep_resident_items_rate

Number of resident items.

couchbase.by_bucket.ep_tap_replica_queue_drain

Total drained items in the replica queue.

couchbase.by_bucket.ep_tap_total_queue_drain

Total drained items in the queue.

couchbase.by_bucket.ep_tap_total_queue_fill

Total enqueued items in the queue.

couchbase.by_bucket.ep_tap_total_total_backlog_size

Number of remaining items for replication.

couchbase.by_bucket.ep_tmp_oom_errors

Number of times recoverable OOMs happened while processing operations.

couchbase.by_bucket.ep_vb_total

Total number of vBuckets for this bucket.

couchbase.by_bucket.evictions

Number of evictions

couchbase.by_bucket.get_hits

Number of get hits

couchbase.by_bucket.get_misses

Number of get misses.

couchbase.by_bucket.hibernated_requests

Number of streaming requests now idle.

couchbase.by_bucket.hibernated_waked

Rate of streaming request wakeups.

couchbase.by_bucket.hit_ratio

Hit ratio.

couchbase.by_bucket.incr_hits

Number of increment hits.

couchbase.by_bucket.incr_misses

Number of increment misses.

couchbase.by_bucket.mem_actual_free

Free memory.

couchbase.by_bucket.mem_actual_used

Used memory.

couchbase.by_bucket.mem_free

Free memory.

couchbase.by_bucket.mem_total

Total available memory.

couchbase.by_bucket.mem_used (deprecated)

Engine’s total memory usage.

couchbase.by_bucket.mem_used_sys

System memory usage.

couchbase.by_bucket.misses

Total number of misses.

couchbase.by_bucket.ops

Total number of operations.

couchbase.by_bucket.page_faults

Number of page faults.

couchbase.by_bucket.replication_docs_rep_queue

couchbase.by_bucket.replication_meta_latency_aggr

couchbase.by_bucket.rest_requests

Number of HTTP requests.

couchbase.by_bucket.swap_total

Total amount of swap available.

couchbase.by_bucket.swap_used

Amount of swap used.

couchbase.by_bucket.vb_active_eject

Number of items per second being ejected to disk from active vBuckets.

couchbase.by_bucket.vb_active_itm_memory

Amount of active user data cached in RAM in this bucket.

couchbase.by_bucket.vb_active_meta_data_memory

Amount of active item metadata consuming RAM in this bucket.

couchbase.by_bucket.vb_active_num

Number of active items.

couchbase.by_bucket.vb_active_num_non_resident

Number of non resident vBuckets in the active state for this bucket.

couchbase.by_bucket.vb_active_ops_create

New items per second being inserted into active vBuckets in this bucket.

couchbase.by_bucket.vb_active_ops_update

Number of items updated on active vBucket per second for this bucket.

couchbase.by_bucket.vb_active_queue_age

Sum of disk queue item age in milliseconds.

couchbase.by_bucket.vb_active_queue_drain

Total drained items in the queue.

couchbase.by_bucket.vb_active_queue_fill

Number of active items per second being put on the active item disk queue.

couchbase.by_bucket.vb_active_queue_size

Number of active items in the queue.

couchbase.by_bucket.vb_active_resident_items_ratio

Number of resident items.

couchbase.by_bucket.vb_avg_active_queue_age

Average age in seconds of active items in the active item queue.

couchbase.by_bucket.vb_avg_pending_queue_age

Average age in seconds of pending items in the pending item queue.

couchbase.by_bucket.vb_avg_replica_queue_age

Average age in seconds of replica items in the replica item queue.

couchbase.by_bucket.vb_avg_total_queue_age

Average age of items in the queue.

couchbase.by_bucket.vb_pending_curr_items

Number of items in pending vBuckets.

couchbase.by_bucket.vb_pending_eject

Number of items per second being ejected to disk from pending vBuckets.

couchbase.by_bucket.vb_pending_itm_memory

Amount of pending user data cached in RAM in this bucket.

couchbase.by_bucket.vb_pending_meta_data_memory

Amount of pending item metadata consuming RAM in this bucket.

couchbase.by_bucket.vb_pending_num

Number of pending items.

couchbase.by_bucket.vb_pending_num_non_resident

Number of non resident vBuckets in the pending state for this bucket.

couchbase.by_bucket.vb_pending_ops_create

Number of pending create operations.

couchbase.by_bucket.vb_pending_ops_update

Number of items updated on pending vBucket per second for this bucket.

couchbase.by_bucket.vb_pending_queue_age

Sum of disk pending queue item age in milliseconds.

couchbase.by_bucket.vb_pending_queue_drain

Total drained pending items in the queue.

couchbase.by_bucket.vb_pending_queue_fill

Total enqueued pending items on disk queue.

couchbase.by_bucket.vb_pending_queue_size

Number of pending items in the queue.

couchbase.by_bucket.vb_pending_resident_items_ratio

Number of resident pending items.

couchbase.by_bucket.vb_replica_curr_items

Number of in memory items.

couchbase.by_bucket.vb_replica_eject

Number of items per second being ejected to disk from replica vBuckets.

couchbase.by_bucket.vb_replica_itm_memory

Amount of replica user data cached in RAM in this bucket.

couchbase.by_bucket.vb_replica_meta_data_memory

Total metadata memory.

couchbase.by_bucket.vb_replica_num

Number of replica vBuckets.

couchbase.by_bucket.vb_replica_num_non_resident

Number of non resident vBuckets in the replica state for this bucket.

couchbase.by_bucket.vb_replica_ops_create

Number of replica create operations.

couchbase.by_bucket.vb_replica_ops_update

Number of items updated on replica vBucket per second for this bucket.

couchbase.by_bucket.vb_replica_queue_age

Sum of disk replica queue item age in milliseconds.

couchbase.by_bucket.vb_replica_queue_drain

Total drained replica items in the queue.

couchbase.by_bucket.vb_replica_queue_fill

Total enqueued replica items on disk queue.

couchbase.by_bucket.vb_replica_queue_size

Replica items in disk queue.

couchbase.by_bucket.vb_replica_resident_items_ratio

Number of resident replica items.

couchbase.by_bucket.vb_total_queue_age

Sum of disk queue item age in milliseconds.

couchbase.by_bucket.xdc_ops

Number of cross-datacenter replication operations.

couchbase.by_node.couch_docs_actual_disk_size

Couch docs total size on disk in bytes.

couchbase.by_node.couch_docs_data_size

Couch docs data size in bytes.

couchbase.by_node.couch_views_actual_disk_size

Couch views total size on disk in bytes.

couchbase.by_node.couch_views_data_size

Couch views data size on disk in bytes.

couchbase.by_node.curr_items

Number of active items in memory.

couchbase.by_node.curr_items_tot

Total number of items.

couchbase.by_node.vb_replica_curr_items

Number of in memory items.

couchbase.hdd.free

Free hard disk space.

couchbase.hdd.quota_total

Hard disk quota.

couchbase.hdd.total

Total hard disk space.

couchbase.hdd.used

Used hard disk space.

couchbase.hdd.used_by_data

Hard disk used for data.

couchbase.query.cores

couchbase.query.cpu_sys_percent

couchbase.query.cpu_user_percent

couchbase.query.gc_num

couchbase.query.gc_pause_percent

couchbase.query.gc_pause_time

couchbase.query.memory_system

couchbase.query.memory_total

couchbase.query.memory_usage

couchbase.query.request_active_count

couchbase.query.request_completed_count

couchbase.query.request_per_sec_15min

couchbase.query.request_per_sec_1min

couchbase.query.request_per_sec_5min

couchbase.query.request_prepared_percent

couchbase.query.request_time_80percentile

couchbase.query.request_time_95percentile

couchbase.query.request_time_99percentile

couchbase.query.request_time_mean

couchbase.query.request_time_median

couchbase.query.total_threads

couchbase.ram.quota_total

RAM quota.

couchbase.ram.total

The total RAM available.

couchbase.ram.used

The amount of RAM in use.

couchbase.ram.used_by_data

The amount of RAM used for data.

4.10.3.2.5 - Elasticsearch Metrics

See Application Integrations for more information.

All Elasticsearch metrics have the type gauge.

elasticsearch.active_primary_shards

The number of active primary shards in the cluster.

elasticsearch.active_shards

The number of active shards in the cluster.

elasticsearch.breakers.fielddata.estimated_size_in_bytes

The estimated size in bytes of the field data circuit breaker.

elasticsearch.breakers.fielddata.overhead

The constant multiplier for byte estimations of the field data circuit breaker.

elasticsearch.breakers.fielddata.tripped

The number of times the field data circuit breaker has tripped.

elasticsearch.breakers.parent.estimated_size_in_bytes

The estimated size in bytes of the parent circuit breaker.

elasticsearch.breakers.parent.overhead

The constant multiplier for byte estimations of the parent circuit breaker.

elasticsearch.breakers.parent.tripped

The number of times the parent circuit breaker has tripped.

elasticsearch.breakers.request.estimated_size_in_bytes

The estimated size in bytes of the request circuit breaker.

elasticsearch.breakers.request.overhead

The constant multiplier for byte estimations of the request circuit breaker.

elasticsearch.breakers.request.tripped

The number of times the request circuit breaker has tripped.

elasticsearch.breakers.inflight_requests.tripped

The number of times the inflight circuit breaker has tripped.

elasticsearch.breakers.inflight_requests.overhead

The constant multiplier for byte estimations of the inflight circuit breaker.

elasticsearch.breakers.inflight_requests.estimated_size_in_bytes

The estimated size in bytes of the inflight circuit breaker.

elasticsearch.cache.field.evictions

The total number of evictions from the field data cache.

elasticsearch.cache.field.size

The size of the field cache.

elasticsearch.cache.filter.count

The number of items in the filter cache.

elasticsearch.cache.filter.evictions

The total number of evictions from the filter cache.

elasticsearch.cache.filter.size

The size of the filter cache.

elasticsearch.cluster_status

The elasticsearch cluster health as a number: red = 0, yellow = 1, green = 2

elasticsearch.docs.count

The total number of documents in the cluster across all shards.

elasticsearch.docs.deleted

The total number of documents deleted from the cluster across all shards.

elasticsearch.fielddata.evictions

The total number of evictions from the fielddata cache.

elasticsearch.fielddata.size

The size of the fielddata cache.

elasticsearch.flush.total

The total number of index flushes to disk since start.

elasticsearch.flush.total.time

The total time spent flushing the index to disk.

elasticsearch.fs.total.available_in_bytes

The total number of bytes available to this Java virtual machine on this file store.

elasticsearch.fs.total.disk_io_op

The total I/O operations on the file store.

elasticsearch.fs.total.disk_io_size_in_bytes

Total bytes used for all I/O operations on the file store.

elasticsearch.fs.total.disk_read_size_in_bytes

The total bytes read from the file store.

elasticsearch.fs.total.disk_reads

The total number of reads from the file store.

elasticsearch.fs.total.disk_write_size_in_bytes

The total bytes written to the file store.

elasticsearch.fs.total.disk_writes

The total number of writes to the file store.

elasticsearch.fs.total.free_in_bytes

The total number of unallocated bytes in the file store.

elasticsearch.fs.total.total_in_bytes

The total size in bytes of the file store.

elasticsearch.get.current

The number of get requests currently running.

elasticsearch.get.exists.time

The total time spent on get requests where the document existed.

elasticsearch.get.exists.total

The total number of get requests where the document existed.

elasticsearch.get.missing.time

The total time spent on get requests where the document was missing.

elasticsearch.get.missing.total

The total number of get requests where the document was missing.

elasticsearch.get.time

The total time spent on get requests.

elasticsearch.get.total

The total number of get requests.

elasticsearch.http.current_open

The number of current open HTTP connections.

elasticsearch.http.total_opened

The total number of opened HTTP connections.

elasticsearch.id_cache.size

The size of the id cache

elasticsearch.indexing.delete.current

The number of documents currently being deleted from an index.

elasticsearch.indexing.delete.time

The total time spent deleting documents from an index.

elasticsearch.indexing.delete.total

The total number of documents deleted from an index.

elasticsearch.indexing.index.current

The number of documents currently being indexed to an index.

elasticsearch.indexing.index.time

The total time spent indexing documents to an index.

elasticsearch.indexing.index.total

The total number of documents indexed to an index.

elasticsearch.indices.count

The number of indices in the cluster.

elasticsearch.indices.indexing.index_failed

The number of failed indexing operations.

elasticsearch.indices.indexing.throttle_time

The total time indexing waited due to throttling.

elasticsearch.indices.query_cache.evictions

The number of query cache evictions.

elasticsearch.indices.query_cache.hit_count

The number of query cache hits.

elasticsearch.indices.query_cache.memory_size_in_bytes

The memory used by the query cache.

elasticsearch.indices.query_cache.miss_count

The number of query cache misses.

elasticsearch.indices.recovery.current_as_source

The number of ongoing recoveries for which a shard serves as a source.

elasticsearch.indices.recovery.current_as_target

The number of ongoing recoveries for which a shard serves as a target.

elasticsearch.indices.recovery.throttle_time

The total time recoveries waited due to throttling.

elasticsearch.indices.request_cache.evictions

The number of request cache evictions.

elasticsearch.indices.request_cache.hit_count

The number of request cache hits.

elasticsearch.indices.request_cache.memory_size_in_bytes

The memory used by the request cache.

elasticsearch.indices.request_cache.miss_count

The number of request cache misses.

elasticsearch.indices.segments.count

The number of segments in an index shard.

elasticsearch.indices.segments.doc_values_memory_in_bytes

The memory used by doc values.

elasticsearch.indices.segments.fixed_bit_set_memory_in_bytes

The memory used by fixed bit set.

elasticsearch.indices.segments.index_writer_max_memory_in_bytes

The maximum memory used by the index writer.

elasticsearch.indices.segments.index_writer_memory_in_bytes

The memory used by the index writer.

elasticsearch.indices.segments.memory_in_bytes

The memory used by index segments.

elasticsearch.indices.segments.norms_memory_in_bytes

The memory used by norms.

elasticsearch.indices.segments.stored_fields_memory_in_bytes

The memory used by stored fields.

elasticsearch.indices.segments.term_vectors_memory_in_bytes

The memory used by term vectors.

elasticsearch.indices.segments.terms_memory_in_bytes

The memory used by terms.

elasticsearch.indices.segments.version_map_memory_in_bytes

The memory used by the segment version map.

elasticsearch.indices.translog.operations

The number of operations in the transaction log.

elasticsearch.indices.translog.size_in_bytes

The size of the transaction log.

elasticsearch.initializing_shards

The number of shards that are currently initializing.

elasticsearch.merges.current

The number of currently active segment merges.

elasticsearch.merges.current.docs

The number of documents across segments currently being merged.

elasticsearch.merges.current.size

The size of the segments currently being merged.

elasticsearch.merges.total

The total number of segment merges.

elasticsearch.merges.total.docs

The total number of documents across all merged segments.

elasticsearch.merges.total.size

The total size of all merged segments.

elasticsearch.merges.total.time

The total time spent on segment merging.

elasticsearch.number_of_data_nodes

The number of data nodes in the cluster.

elasticsearch.number_of_nodes

The total number of nodes in the cluster.

elasticsearch.pending_tasks_priority_high

The number of high priority pending tasks.

elasticsearch.pending_tasks_priority_urgent

The number of urgent priority pending tasks.

elasticsearch.pending_tasks_time_in_queue

The average time spent by tasks in the queue.

elasticsearch.pending_tasks_total

The total number of pending tasks.

elasticsearch.process.open_fd

The number of opened file descriptors associated with the current process, or -1 if not supported.

elasticsearch.refresh.total

The total number of index refreshes.

elasticsearch.refresh.total.time

The total time spent on index refreshes.

elasticsearch.relocating_shards

The number of shards that are relocating from one node to another.

elasticsearch.search.fetch.current

The number of search fetches currently running.

elasticsearch.search.fetch.open_contexts

The number of active searches.

elasticsearch.search.fetch.time

The total time spent on the search fetch.

elasticsearch.search.fetch.total

The total number of search fetches.

elasticsearch.search.query.current

The number of currently active queries.

elasticsearch.search.query.time

The total time spent on queries.

elasticsearch.search.query.total

The total number of queries.

elasticsearch.store.size

The total size in bytes of the store.

elasticsearch.thread_pool.bulk.active

The number of active threads in the bulk pool.

elasticsearch.thread_pool.bulk.queue

The number of queued threads in the bulk pool.

elasticsearch.thread_pool.bulk.threads

The total number of threads in the bulk pool.

elasticsearch.thread_pool.bulk.rejected

The number of rejected threads in the bulk pool.

elasticsearch.thread_pool.fetch_shard_started.active

The number of active threads in the fetch shard started pool.

elasticsearch.thread_pool.fetch_shard_started.threads

The total number of threads in the fetch shard started pool.

elasticsearch.thread_pool.fetch_shard_started.queue

The number of queued threads in the fetch shard started pool.

elasticsearch.thread_pool.fetch_shard_started.rejected

The number of rejected threads in the fetch shard started pool.

elasticsearch.thread_pool.fetch_shard_store.active

The number of active threads in the fetch shard store pool.

elasticsearch.thread_pool.fetch_shard_store.threads

The total number of threads in the fetch shard store pool.

elasticsearch.thread_pool.fetch_shard_store.queue

The number of queued threads in the fetch shard store pool.

elasticsearch.thread_pool.fetch_shard_store.rejected

The number of rejected threads in the fetch shard store pool.

elasticsearch.thread_pool.flush.active

The number of active threads in the flush queue.

elasticsearch.thread_pool.flush.queue

The number of queued threads in the flush pool.

elasticsearch.thread_pool.flush.threads

The total number of threads in the flush pool.

elasticsearch.thread_pool.flush.rejected

The number of rejected threads in the flush pool.

elasticsearch.thread_pool.force_merge.active

The number of active threads for force merge operations.

elasticsearch.thread_pool.force_merge.threads

The total number of threads for force merge operations.

elasticsearch.thread_pool.force_merge.queue

The number of queued threads for force merge operations.

elasticsearch.thread_pool.force_merge.rejected

The number of rejected threads for force merge operations.

elasticsearch.thread_pool.generic.active

The number of active threads in the generic pool.

elasticsearch.thread_pool.generic.queue

The number of queued threads in the generic pool.

elasticsearch.thread_pool.generic.threads

The total number of threads in the generic pool.

elasticsearch.thread_pool.generic.rejected

The number of rejected threads in the generic pool.

elasticsearch.thread_pool.get.active

The number of active threads in the get pool.

elasticsearch.thread_pool.get.queue

The number of queued threads in the get pool.

elasticsearch.thread_pool.get.threads

The total number of threads in the get pool.

elasticsearch.thread_pool.get.rejected

The number of rejected threads in the get pool.

elasticsearch.thread_pool.index.active

The number of active threads in the index pool.

elasticsearch.thread_pool.index.queue

The number of queued threads in the index pool.

elasticsearch.thread_pool.index.threads

The total number of threads in the index pool.

elasticsearch.thread_pool.index.rejected

The number of rejected threads in the index pool.

elasticsearch.thread_pool.listener.active

The number of active threads in the listener pool.

elasticsearch.thread_pool.listener.queue

The number of queued threads in the listener pool.

elasticsearch.thread_pool.listener.threads

The total number of threads in the listener pool.

elasticsearch.thread_pool.listener.rejected

The number of rejected threads in the listener pool.

elasticsearch.thread_pool.management.active

The number of active threads in the management pool.

elasticsearch.thread_pool.management.queue

The number of queued threads in the management pool.

elasticsearch.thread_pool.management.threads

The total number of threads in the management pool.

elasticsearch.thread_pool.management.rejected

The number of rejected threads in the management pool.

elasticsearch.thread_pool.merge.active

The number of active threads in the merge pool.

elasticsearch.thread_pool.merge.queue

The number of queued threads in the merge pool.

elasticsearch.thread_pool.merge.threads

The total number of threads in the merge pool.

elasticsearch.thread_pool.merge.rejected

The number of rejected threads in the merge pool.

elasticsearch.thread_pool.percolate.active

The number of active threads in the percolate pool.

elasticsearch.thread_pool.percolate.queue

The number of queued threads in the percolate pool.

elasticsearch.thread_pool.percolate.threads

The total number of threads in the percolate pool.

elasticsearch.thread_pool.percolate.rejected

The number of rejected threads in the percolate pool.

elasticsearch.thread_pool.refresh.active

The number of active threads in the refresh pool.

elasticsearch.thread_pool.refresh.queue

The number of queued threads in the refresh pool.

elasticsearch.thread_pool.refresh.threads

The total number of threads in the refresh pool.

elasticsearch.thread_pool.refresh.rejected

The number of rejected threads in the refresh pool.

elasticsearch.thread_pool.search.active

The number of active threads in the search pool.

elasticsearch.thread_pool.search.queue

The number of queued threads in the search pool.

elasticsearch.thread_pool.search.threads

The total number of threads in the search pool.

elasticsearch.thread_pool.search.rejected

The number of rejected threads in the search pool.

elasticsearch.thread_pool.snapshot.active

The number of active threads in the snapshot pool.

elasticsearch.thread_pool.snapshot.queue

The number of queued threads in the snapshot pool.

elasticsearch.thread_pool.snapshot.threads

The total number of threads in the snapshot pool.

elasticsearch.thread_pool.snapshot.rejected

The number of rejected threads in the snapshot pool.

elasticsearch.thread_pool.write.active

The number of active threads in the write pool.

elasticsearch.thread_pool.write.queue

The number of queued threads in the write pool.

elasticsearch.thread_pool.write.threads

The total number of threads in the write pool.

elasticsearch.thread_pool.write.rejected

The number of rejected threads in the write pool.

elasticsearch.transport.rx_count

The total number of packets received in cluster communication.

elasticsearch.transport.rx_size

The total size of data received in cluster communication.

elasticsearch.transport.server_open

The number of connections opened for cluster communication.

elasticsearch.transport.tx_count

The total number of packets sent in cluster communication.

elasticsearch.transport.tx_size

The total size of data sent in cluster communication.

elasticsearch.unassigned_shards

The number of shards that are unassigned to a node.

elasticsearch.delayed_unassigned_shards

The number of shards whose allocation has been delayed.

jvm.gc.collection_count

The total number of garbage collections run by the JVM.

jvm.gc.collection_time

The total time spent on garbage collection in the JVM.

jvm.gc.collectors.old.collection_time

The total time spent in major GCs in the JVM that collect old generation objects.

jvm.gc.collectors.old.count

The total count of major GCs in the JVM that collect old generation objects.

jvm.gc.collectors.young.collection_time

The total time spent in minor GCs in the JVM that collects young generation objects.

jvm.gc.collectors.young.count

The total count of minor GCs in the JVM that collects young generation objects.

jvm.gc.concurrent_mark_sweep.collection_time

The total time spent on “concurrent mark & sweep” GCs in the JVM.

jvm.gc.concurrent_mark_sweep.count

The total count of “concurrent mark & sweep” GCs in the JVM.

jvm.gc.par_new.collection_time

The total time spent on “parallel new” GCs in the JVM.

jvm.gc.par_new.count

The total count of “parallel new” GCs in the JVM.

jvm.mem.heap_committed

The amount of memory guaranteed to be available to the JVM heap.

jvm.mem.heap_in_use

The amount of memory currently used by the JVM heap as a value between 0 and 1.

jvm.mem.heap_max

The maximum amount of memory that can be used by the JVM heap.

jvm.mem.heap_used

The amount of memory in bytes currently used by the JVM heap.

jvm.mem.non_heap_committed

The amount of memory guaranteed to be available to JVM non-heap.

jvm.mem.non_heap_used

The amount of memory in bytes currently used by the JVM non-heap.

jvm.mem.pools.young.used

The amount of memory in bytes currently used by the Young Generation heap region.

jvm.mem.pools.young.max

The maximum amount of memory that can be used by the Young Generation heap region.

jvm.mem.pools.old.used

The amount of memory in bytes currently used by the Old Generation heap region.

jvm.mem.pools.old.max

The maximum amount of memory that can be used by the Old Generation heap region.

jvm.mem.pools.survivor.used

The amount of memory in bytes currently used by the Survivor Space.

jvm.mem.pools.survivor.max

The maximum amount of memory that can be used by the Survivor Space.

jvm.threads.count

The number of active threads in the JVM.

jvm.threads.peak_count

The peak number of threads used by the JVM.

elasticsearch.index.health

The status of the index.

elasticsearch.index.docs.count

The number of documents in the index.

elasticsearch.index.docs.deleted

The number of deleted documents in the index.

elasticsearch.index.primary_shards

The number of primary shards in the index.

elasticsearch.index.replica_shards

The number of replica shards in the index.

elasticsearch.index.primary_store_size

The store size of primary shards in the index.

elasticsearch.index.store_size

The store size of primary and replica shards in the index.

4.10.3.2.6 - etcd Metrics

See Application Integrations for more information.

etcd.leader.counts.fail

Rate of failed Raft RPC requests.

etcd.leader.counts.success

Rate of successful Raft RPC requests.

etcd.leader.latency.avg

Average latency to each peer in the cluster.

etcd.leader.latency.current

Current latency to each peer in the cluster.

etcd.leader.latency.max

Maximum latency to each peer in the cluster.

etcd.leader.latency.min

Minimum latency to each peer in the cluster.

etcd.leader.latency.stddev

Standard deviation latency to each peer in the cluster.

etcd.self.recv.appendrequest.count

Rate of append requests this node has processed.

etcd.self.recv.bandwidthrate

Rate of bytes received.

etcd.self.recv.pkgrate

Rate of packets received.

etcd.self.send.appendrequest.count

Rate of append requests this node has sent.

etcd.self.send.bandwidthrate

Rate of bytes sent.

etcd.self.send.pkgrate

Rate of packets sent.

etcd.store.compareanddelete.fail

Rate of compare and delete requests failure.

etcd.store.compareanddelete.success

Rate of compare and delete requests success.

etcd.store.compareandswap.fail

Rate of compare and swap requests failure.

etcd.store.compareandswap.success

Rate of compare and swap requests success.

etcd.store.create.fail

Rate of failed create requests.

etcd.store.create.success

Rate of successful create requests.

etcd.store.delete.fail

Rate of failed delete requests.

etcd.store.delete.success

Rate of successful delete requests.

etcd.store.expire.count

Rate of expired keys.

etcd.store.gets.fail

Rate of failed get requests.

etcd.store.gets.success

Rate of successful get requests.

etcd.store.sets.fail

Rate of failed set requests.

etcd.store.sets.success

Rate of successful set requests.

etcd.store.update.fail

Rate of failed update requests.

etcd.store.update.success

Rate of successful update requests.

etcd.store.watchers

Rate of watchers.

4.10.3.2.7 - fluentd Metrics

See Application Integrations for more information.

fluentd.buffer_queue_length

The length of the plugin buffer queue for this plugin.

fluentd.buffer_total_queued_size

The size of the buffer queue for this plugin.

fluentd.retry_count

The number of retries for this plugin.

4.10.3.2.8 - Go Metrics

See Application Integrations for more information.

go_expvar.memstats.alloc

The number of bytes allocated and not yet freed.

go_expvar.memstats.frees

The number of free bytes.

go_expvar.memstats.heap_alloc

go_expvar.memstats.heap_idle

The number of bytes in idle spans.

go_expvar.memstats.heap_inuse

The number of bytes in non-idle spans.

go_expvar.memstats.heap_objects

The total number of allocated objects.

go_expvar.memstats.heap_released

The number of bytes released to the OS.

go_expvar.memstats.heap_sys

The number of bytes obtained from the system.

go_expvar.memstats.lookups

The number of pointer lookups.

go_expvar.memstats.mallocs

The number of mallocs.

go_expvar.memstats.num_gc

The number of garbage collections.

go_expvar.memstats.pause_ns.avg

The average of recent GC pause durations.

go_expvar.memstats.pause_ns.count

The number of submitted GC pause durations.

go_expvar.memstats.pause_ns.max

The max GC pause duration.

go_expvar.memstats.pause_ns.median

The median GC pause duration.

go_expvar.memstats.pause_total_ns

The total GC pause duration over the lifetime of process.

go_expvar.memstats.total_alloc

The bytes allocated (even if freed).

4.10.3.2.9 - HTTP Metrics

See Application Integrations for more information.

http.ssl.days_left

The number of days until the SSL certificate expires.

network.http.response_time

The response time of a HTTP request to a specified URL.

4.10.3.2.10 - HAProxy Metrics

See Application Integrations for more information.

haproxy.backend_hosts

The number of backend hosts.

haproxy.backend.bytes.in_rate

The rate of bytes in on backend hosts.

haproxy.backend.bytes.out_rate

The rate of bytes out on backend hosts.

haproxy.backend.connect.time

The average connect time over the last 1024 requests.

haproxy.backend.denied.req_rate

The number of requests denied due to security concerns.

haproxy.backend.denied.resp_rate

The number of responses denied due to security concerns.

haproxy.backend.errors.con_rate

The rate of requests that encountered an error trying to connect to a backend server.

haproxy.backend.errors.resp_rate

The rate of responses aborted due to error.

haproxy.backend.queue.current

The number of requests without an assigned backend.

haproxy.backend.queue.time

The average queue time over the last 1024 requests.

haproxy.backend.response.1xx

The backend HTTP responses with 1xx code.

haproxy.backend.response.2xx

The backend HTTP responses with 2xx code.

haproxy.backend.response.3xx

The backend HTTP responses with 3xx code.

haproxy.backend.response.4xx

The backend HTTP responses with 4xx code.

haproxy.backend.response.5xx

The backend HTTP responses with 5xx code.

haproxy.backend.response.other

The backend HTTP responses with another code (protocol error).

haproxy.backend.response.time

The average response time over the last 1024 requests (0 for TCP).

haproxy.backend.session.current

The number of active backend sessions.

haproxy.backend.session.limit

The configured backend session limit.

haproxy.backend.session.pct

The percentage of sessions in use. The formula used for this metric is backend.session.current / backend.session.limit * 100.

haproxy.backend.session.rate

The number of backend sessions created per second.

haproxy.backend.session.time

The average total session time over the last 1024 requests.

haproxy.backend.uptime

The number of seconds since the last UP<->DOWN transition.

haproxy.backend.warnings.redis_rate

The number of times a request was redispatched to another server.

haproxy.backend.warnings.retr_rate

The number of times a connection to a server was retried.

haproxy.count_per_status

The number of hosts by status (UP/DOWN/NOLB/MAINT).

haproxy.frontend.bytes.in_rate

The rate of bytes in on frontend hosts.

haproxy.frontend.bytes.out_rate

The rate of bytes out on frontend hosts.

haproxy.frontend.denied.req_rate

The number of requests denied due to security concerns.

haproxy.frontend.denied.resp_rate

The number of responses denied due to security concerns.

haproxy.frontend.errors.req_rate

The rate of request errors.

haproxy.frontend.requests.rate

The number of HTTP requests per second.

haproxy.frontend.response.1xx

The frontend HTTP responses with 1xx code.

haproxy.frontend.response.2xx

The frontend HTTP responses with 2xx code.

haproxy.frontend.response.3xx

The frontend HTTP responses with 3xx code.

haproxy.frontend.response.4xx

The frontend HTTP responses with 4xx code.

haproxy.frontend.response.5xx

The frontend HTTP responses with 5xx code.

haproxy.frontend.response.other

The frontend HTTP responses with another code (protocol error).

haproxy.frontend.session.current

The number of active frontend sessions.

haproxy.frontend.session.limit

The configured backend session limit.

haproxy.frontend.session.pct

The percentage of sessions in use. The formula used for this metric is frontend.session.current / frontend.session.limit * 100.

haproxy.frontend.session.rate

The number of frontend sessions created per second.

Agent 9.6.0 Additional HAProxy Metrics

  • haproxy.backend.requests.tot_rate

    Rate of total number of HTTP requests

  • haproxy.frontend.connections.rate

    Number of connections per second

  • haproxy.frontend.connections.tot_rate

    Rate of total number of connections

  • haproxy.frontend.requests.intercepted

    Number of intercepted requests per second

  • haproxy.frontend.requests.tot_rate

    Rate of total number of HTTP requests

4.10.3.2.11 - Jenkins Metrics

See Application Integrations for more information.

jenkins.job.duration

The duration of a job, measured in seconds.

jenkins.job.success

The status of a successful job.

jenkins.job.failure

The status of a failed job.

4.10.3.2.12 - Lighttpd Metrics

See Application Integrations for more information.

lighttpd.net.bytes

The total number of bytes sent and received.

lighttpd.net.bytes_per_s

The number of bytes sent and received per second.

lighttpd.net.hits

The total number of hits since the start.

lighttpd.net.request_per_s

The number of requests per second.

lighttpd.performance.busy_servers

The number of active connections.

lighttpd.performance.idle_server

The number of idle connections.

lighttpd.performance.uptime

The amount of time the server has been up and running.

4.10.3.2.13 - Memcached Metrics

See Application Integrations for more information.

memcache.avg_item_size

The average size of an item.

memcache.bytes

The current number of bytes used by this server to store items.

memcache.bytes_read_rate

The rate of bytes read from the network by this server.

memcache.bytes_written_rate

The rate of bytes written to the network by this server.

memcache.cas_badval_rate

The rate at which keys are compared and swapped where the comparison (original) value did not match the supplied value.

memcache.cas_hits_rate

The rate at which keys are compared and swapped and found present.

memcache.cas_misses_rate

The rate at which keys are compared and swapped and not found present.

memcache.cmd_flush_rate

The rate of flush_all commands.

memcache.cmd_get_rate

The rate of get commands.

memcache.cmd_set_rate

The rate of set commands.

memcache.connection_structures

The number of connection structures allocated by the server.

memcache.curr_connections

The number of open connections to this server.

memcache.curr_items

The current number of items stored by the server.

memcache.delete_hits_rate

The rate at which delete commands result in items being removed.

memcache.delete_misses_rate

The rate at which delete commands result in no items being removed.

memcache.evictions_rate

The rate at which valid items are removed from cache to free memory for new items.

memcache.fill_percent

The amount of memory being used by the server for storing items as a percentage of the max allowed.

memcache.get_hit_percent

The percentage of requested keys that are found present since the start of the Memcached server.

memcache.get_hits_rate

The rate at which keys are requested and found present.

memcache.get_misses_rate

The rate at which keys are requested and not found.

memcache.items.age

The age of the oldest item in the LRU.

memcache.items.crawler_reclaimed_rate

The rate at which items freed by the LRU Crawler.

memcache.items.direct_reclaims_rate

The rate at which worker threads had to directly pull LRU tails to find memory for a new item.

memcache.items.evicted_nonzero_rate

The rate at which nonzero items which had an explicit expire time set had to be evicted from the LRU before expiring.

memcache.items.evicted_rate

The rate st which items had to be evicted from the LRU before expiring.

memcache.items.evicted_time

The number of seconds since the last access for the most recent item evicted from this class.

memcache.items.evicted_unfetched_rate

The rate at which valid items evicted from the LRU which were never touched after being set.

memcache.items.expired_unfetched_rate

The rate at which expired items reclaimed from the LRU which were never touched after being set.

memcache.items.lrutail_reflocked_rate

The rate at which items found to be refcount locked in the LRU tail.

memcache.items.moves_to_cold_rate

The rate at which items were moved from HOT or WARM into COLD.

memcache.items.moves_to_warm_rate

The rate at which items were moved from COLD to WARM.

memcache.items.moves_within_lru_rate

The rate at which active items were bumped within HOT or WARM.

memcache.items.number

The number of items presently stored in this slab class.

memcache.items.number_cold

The number of items presently stored in the COLD LRU.

memcache.items.number_hot

The number of items presently stored in the HOT LRU.

memcache.items.number_noexp

The number of items presently stored in the NOEXP class.

memcache.items.number_warm

The number of items presently stored in the WARM LRU.

memcache.items.outofmemory_rate

The rate at which the underlying slab class was unable to store a new item.

memcache.items.reclaimed_rate

The rate at which entries were stored using memory from an expired entry.

memcache.items.tailrepairs_rate

The rate at which Memcached self-healed a slab with a refcount leak.

memcache.limit_maxbytes

The number of bytes this server is allowed to use for storage.

memcache.listen_disabled_num_rate

The rate at which the server has reached the max connection limit.

memcache.pointer_size

The default size of pointers on the host OS (generally 32 or 64).

memcache.rusage_system_rate

The fraction of user time the CPU spent executing this server process.

memcache.rusage_user_rate

The fraction of time the CPU spent executing kernel code on behalf of this server process.

memcache.slabs.active_slabs

The total number of slab classes allocated.

memcache.slabs.cas_badval_rate

The rate at which CAS commands failed to modify a value due to a bad CAS ID.

memcache.slabs.cas_hits_rate

The rate at which CAS commands modified this slab class.

memcache.slabs.chunk_size

The amount of space each chunk uses.

memcache.slabs.chunks_per_page

The number of chunks that exist within one page.

memcache.slabs.cmd_set_rate

The rate at which set requests stored data in this slab class.

memcache.slabs.decr_hits_rate

The rate at which decrs commands modified this slab class.

memcache.slabs.delete_hits_rate

The rate at which delete commands succeeded in this slab class.

memcache.slabs.free_chunks

The number of chunks not yet allocated to items or freed via delete.

memcache.slabs.free_chunks_end

The number of free chunks at the end of the last allocated page.

memcache.slabs.get_hits_rate

The rate at which get requests were serviced by this slab class.

memcache.slabs.incr_hits_rate

The rate at which incrs commands modified this slab class.

memcache.slabs.mem_requested

The number of bytes requested to be stored in this slab.

memcache.slabs.total_chunks

The total number of chunks allocated to the slab class.

memcache.slabs.total_malloced

The total amount of memory allocated to slab pages.

memcache.slabs.total_pages

The total number of pages allocated to the slab class.

memcache.slabs.touch_hits_rate

The rate of touches serviced by this slab class.

memcache.slabs.used_chunks

The number of chunks that have been allocated to items.

memcache.slabs.used_chunks_rate

The rate at which chunks have been allocated to items.

memcache.threads

The number of threads used by the current Memcached server process.

memcache.total_connections_rate

The rate at which connections to this server are opened.

memcache.total_items

The total number of items stored by this server since it started.

memcache.uptime

The number of seconds this server has been running.

4.10.3.2.14 - Mesos/Marathon Metrics

Contents

4.10.3.2.14.1 - Mesos Agent Metrics

See Application Integrations for more information.

mesos.slave.cpus_percent

The percentage of CPUs allocated to the slave.

mesos.slave.cpus_total

The total number of CPUs.

mesos.slave.cpus_used

The number of CPUs allocated to the slave.

mesos.slave.disk_percent

The percentage of disk space allocated to the slave.

mesos.slave.disk_total

The total disk space available.

mesos.slave.disk_used

The amount of disk space allocated to the slave.

mesos.slave.executors_registering

The number of executors registering.

mesos.slave.executors_running

The number of executors currently running.

mesos.slave.executors_terminated

The number of terminated executors.

mesos.slave.executors_terminating

The number of terminating executors.

mesos.slave.frameworks_active

The number of active frameworks.

mesos.slave.invalid_framework_messages

The number of invalid framework messages.

mesos.slave.invalid_status_updates

The number of invalid status updates.

mesos.slave.mem_percent

The percentage of memory allocated to the slave.

mesos.slave.mem_total

The total memory available.

mesos.slave.mem_used

The amount of memory allocated to the slave.

mesos.slave.recovery_errors

The number of errors encountered during slave recovery.

mesos.slave.tasks_failed

The number of failed tasks.

mesos.slave.tasks_finished

The number of finished tasks.

mesos.slave.tasks_killed

The number of killed tasks.

mesos.slave.tasks_lost

The number of lost tasks.

mesos.slave.tasks_running

The number of running tasks.

mesos.slave.tasks_staging

The number of staging tasks.

mesos.slave.tasks_starting

The number of starting tasks.

mesos.slave.valid_framework_messages

The number of valid framework messages.

mesos.slave.valid_status_updates

The number of valid status updates.

mesos.state.task.cpu

The task CPU.

mesos.state.task.disk

The disk space available for the task.

mesos.state.task.mem

The amount of memory used by the task.

mesos.stats.registered

Defines whether this slave is registered with a master.

mesos.stats.system.cpus_total

The total number of CPUs available.

mesos.stats.system.load_1min

The average load for the last minute.

mesos.stats.system.load_5min

The average load for the last five minutes.

mesos.stats.system.load_15min

The average load for the last 15 minutes.

mesos.stats.system.mem_free_bytes

The amount of free memory.

mesos.stats.system.mem_total_bytes

The total amount of memory.

mesos.stats.uptime_secs

The current uptime for the slave.

4.10.3.2.14.2 - Mesos Master Metrics

See Application Integrations for more information.

mesos.cluster.cpus_percent

The percentage of CPUs allocated to the cluster.

mesos.cluster.cpus_total

The total number of CPUs.

mesos.cluster.cpus_used

The number of CPUs used by the cluster.

mesos.cluster.disk_percent

The percentage of disk space allocated to the cluster.

mesos.cluster.disk_total

The total amount of disk space.

mesos.cluster.disk_used

The amount of disk space used by the cluster.

mesos.cluster.dropped_messages

The number of dropped messages.

mesos.cluster.event_queue_dispatches

The number of dispatches in the event queue.

mesos.cluster.event_queue_http_requests

The number of HTTP requests in the event queue.

mesos.cluster.event_queue_messages

The number of messages in the event queue.

mesos.cluster.frameworks_active

The number of active frameworks.

mesos.cluster.frameworks_connected

The number of connected frameworks.

mesos.cluster.frameworks_disconnected

The number of disconnected frameworks.

mesos.cluster.frameworks_inactive

The number of inactive frameworks.

mesos.cluster.gpus_total

The total number of GPUs.

mesos.cluster.invalid_framework_to_executor_messages

The number of invalid messages between the framework and the executor.

mesos.cluster.invalid_status_update_acknowledgements

The number of invalid status update acknowledgements.

mesos.cluster.invalid_status_updates

The number of invalid framework messages.

mesos.cluster.mem_percent

The percentage of memory allocated to the cluster.

mesos.cluster.mem_total

The total amount of memory available.

mesos.cluster.mem_used

The amount of memory the cluster is using.

mesos.cluster.outstanding_offers

The number of outstanding resource offers.

mesos.cluster.slave_registrations

The number of slaves able to rejoin the cluster after a disconnect.

mesos.cluster.slave_removals

The number of slaves that have been removed for any reason, including maintenance.

mesos.cluster.slave_reregistrations

The number of slaves that have re-registered.

mesos.cluster.slave_shutdowns_canceled

The number of slave shutdowns processes that have been cancelled.

mesos.cluster.slave_shutdowns_scheduled

The number of slaves that have failed health checks and are scheduled for removal.

mesos.cluster.slaves_active

The number of active slaves.

mesos.cluster.slaves_connected

The number of connected slaves.

mesos.cluster.slaves_disconnected

The number of disconnected slaves.

mesos.cluster.slaves_inactive

The number of inactive slaves.

mesos.cluster.tasks_error

The number of cluster tasks that resulted in an error.

mesos.cluster.tasks_failed

The number of failed cluster tasks.

mesos.cluster.tasks_finished

The number of completed cluster tasks.

mesos.cluster.tasks_killed

The number of killed cluster tasks.

mesos.cluster.tasks_lost

The number of lost cluster tasks.

mesos.cluster.tasks_running

The number of cluster tasks currently running.

mesos.cluster.tasks_staging

The number of cluster tasks currently staging.

mesos.cluster.tasks_starting

The number of cluster tasks starting.

mesos.cluster.valid_framework_to_executor_messages

The number of valid framework messages.

mesos.cluster.valid_status_update_acknowledgements

The number of valid status update acknowledgements.

mesos.cluster.valid_status_updates

The number of valid status updates.

mesos.framework.cpu

The CPU of the Mesos framework.

mesos.framework.disk

The total disk space of the Mesos framework, measured in mebibytes.

mesos.framework.mem

The total memory of the Mesos framework, measured in mebibytes.

mesos.registrar.queued_operations

The number of queued operations.

mesos.registrar.registry_size_bytes

The size of the Mesos registry in bytes.

mesos.registrar.state_fetch_ms

The Mesos registry’s read latency, in bytes.

mesos.registrar.state_store_ms

The Mesos registry’s write latency, in bytes.

mesos.registrar.state_store_ms.count

The Mesos registry’s write count, in bytes.

mesos.registrar.state_store_ms.max

The maximum write latency for the registry, in milliseconds.

mesos.registrar.state_store_ms.min

The minimum write latency for the registry, in miliseconds.

mesos.registrar.state_store_ms.p50

The median registry write latency, in milliseconds.

mesos.registrar.state_store_ms.p90

The 90th percentile registry write latency, in milliseconds.

mesos.registrar.state_store_ms.p95

The 95th percentile registry write latency, in milliseconds.

mesos.registrar.state_store_ms.p99

The 99th percentile registry write latency, in milliseconds.

mesos.registrar.state_store_ms.p999

The 99.9th percentile registry write latency, in milliseconds.

mesos.registrar.state_store_ms.p9999

The 99.99th percentile registry write latency, in milliseconds.

mesos.role.cpu

The CPU capacity of the configured role.

mesos.role.disk

The total disk space available to the Mesos role, in mebibytes.

mesos.role.mem

The total memory available to the Mesos role, in mebibytes.

mesos.stats.elected

Defines whether this is the elected master or not.

mesos.stats.system.cpus_total

The total number of CPUs in the system.

mesos.stats.system.load_1min

The average load for the last minute.

mesos.stats.system.load_5min

The average load for the last five minutes.

mesos.stats.system.load_15min

The average load for the last fifteen minutes.

mesos.stats.system.mem_free_bytes

The total amount of free system memory, in bytes.

mesos.stats.system.mem_total_bytes

The total cluster memory in bytes.

mesos.stats.uptime_secs

The current uptime of the cluster.

4.10.3.2.14.3 - Marathon Metrics

See Application Integrations for more information.

marathon.apps

The total number of applications.

marathon.backoffFactor

The multiplication factor for the delay between each consecutive failed task. This value is multiplied by the value of marathon.backoffSeconds each time the task fails until the maximum delay is reached, or the task succeeds.

marathon.backoffSeconds

The period of time between attempts to run a failed task. This value is multiplied by marathon.backoffFactor for each consecutive task failure, until either the task succeeds or the maximum delay is reached.

marathon.cpus

The number of CPUs configured for each application instance.

marathon.disk

The amount of disk space configured for each application instance.

marathon.instances

The number of instances of a specific application.

marathon.mem

The total amount of configured memory for each instance of a specific application.

marathon.tasksRunning

The number of tasks running for a specific application.

marathon.tasksStaged

The number of tasks staged for a specific application.

4.10.3.2.15 - MongoDB Metrics

See Application Integrations for more information.

Metrics Introduced with Agent v9.7.0

The following metrics are supported by Sysdig Agent v9.7.0 and above.

Metric NameDescription
mongodb.tcmalloc.generic.current_allocated_bytesThe number of bytes used by the application.
mongodb.tcmalloc.generic.heap_sizeBytes of system memory reserved by TCMalloc.
mongodb.tcmalloc.tcmalloc.aggressive_memory_decommitStatus of aggressive memory de-commit mode.
mongodb.tcmalloc.tcmalloc.central_cache_free_bytesThe number of free bytes in the central cache.
mongodb.tcmalloc.tcmalloc.current_total_thread_cache_bytesThe number of bytes used across all thread caches.
mongodb.tcmalloc.tcmalloc.max_total_thread_cache_bytesThe upper limit on the total number of bytes stored across all per-thread caches.
mongodb.tcmalloc.tcmalloc.pageheap_free_bytesThe number of bytes in free mapped pages in page heap.
mongodb.tcmalloc.tcmalloc.pageheap_unmapped_bytesThe number of bytes in free unmapped pages in page heap.
mongodb.tcmalloc.tcmalloc.spinlock_total_delay_nsGives the spinlock delay time.
mongodb.tcmalloc.tcmalloc.thread_cache_free_bytesThe number of free bytes in thread caches.
mongodb.tcmalloc.tcmalloc.transfer_cache_free_bytesThe number of free bytes that are waiting to be transferred between the central cache and a thread cache.

mongodb.asserts.msgps

Number of message assertions raised per second.

mongodb.asserts.regularps

Number of regular assertions raised per second.

mongodb.asserts.rolloversps

Number of times that the rollover counters roll over per second. The counters rollover to zero every 2^30 assertions.

mongodb.asserts.userps

Number of user assertions raised per second.

mongodb.asserts.warningps

Number of warnings raised per second.

mongodb.backgroundflushing.average_ms

Average time for each flush to disk.

mongodb.backgroundflushing.flushesps

Number of times the database has flushed all writes to disk.

mongodb.backgroundflushing.last_ms

Amount of time that the last flush operation took to complete.

mongodb.backgroundflushing.total_ms

Total number of time that the `mongod` processes have spent writing (i.e. flushing) data to disk.

mongodb.connections.available

Number of unused available incoming connections the database can provide.

mongodb.connections.current

Number of connections to the database server from clients.

mongodb.connections.totalcreated

Total number of connections created.

mongodb.cursors.timedout

Total number of cursors that have timed out since the server process started.

mongodb.cursors.totalopen

Number of cursors that MongoDB is maintaining for clients

mongodb.dbs

Total number of existing databases

mongodb.dur.commits

Number of transactions written to the journal during the last journal group commit interval.

mongodb.dur.commitsinwritelock

Count of the commits that occurred while a write lock was held.

mongodb.dur.compression

Compression ratio of the data written to the journal.

mongodb.dur.earlycommits

Number of times MongoDB requested a commit before the scheduled journal group commit interval.

mongodb.dur.journaledmb

Amount of data written to journal during the last journal group commit interval.

mongodb.dur.timems.commits

Amount of time spent for commits.

mongodb.dur.timems.commitsinwritelock

Amount of time spent for commits that occurred while a write lock was held.

mongodb.dur.timems.dt

Amount of time over which MongoDB collected the `dur.timeMS` data.

mongodb.dur.timems.preplogbuffer

Amount of time spent preparing to write to the journal.

mongodb.dur.timems.remapprivateview

Amount of time spent remapping copy-on-write memory mapped views.

mongodb.dur.timems.writetodatafiles

Amount of time spent writing to data files after journaling.

mongodb.dur.timems.writetojournal

Amount of time spent writing to the journal

mongodb.dur.writetodatafilesmb

Amount of data written from journal to the data files during the last journal group commit interval.

mongodb.extra_info.page_faultsps

Number of page faults per second that require disk operations.

mongodb.fsynclocked

Number of fsynclocked performed on a mongo instance.

mongodb.globallock.activeclients.readers

Count of the active client connections performing read operations.

mongodb.globallock.activeclients.total

Total number of active client connections to the database.

mongodb.globallock.activeclients.writers

Count of active client connections performing write operations.

mongodb.globallock.currentqueue.readers

Number of operations that are currently queued and waiting for the read lock.

mongodb.globallock.currentqueue.total

Total number of operations queued waiting for the lock.

mongodb.globallock.currentqueue.writers

Number of operations that are currently queued and waiting for the write lock.

mongodb.globallock.locktime

Time since the database last started that the globalLock has been held.

mongodb.globallock.ratio

Ratio of the time that the globalLock has been held to the total time since it was created.

mongodb.globallock.totaltime

Time since the database last started and created the global lock.

mongodb.indexcounters.accessesps

Number of times that operations have accessed indexes per second.

mongodb.indexcounters.hitsps

Number of times per second that an index has been accessed and mongod is able to return the index from memory.

mongodb.indexcounters.missesps

Number of times per second that an operation attempted to access an index that was not in memory.

mongodb.indexcounters.missratio

Ratio of index hits to misses.

mongodb.indexcounters.resetsps

Number of times per second the index counters have been reset.

mongodb.locks.collection.acquirecount.exclusiveps

Number of times the collection lock type was acquired in the Exclusive (X) mode.

mongodb.locks.collection.acquirecount.intent_exclusiveps

Number of times the collection lock type was acquired in the Intent Exclusive (IX) mode.

mongodb.locks.collection.acquirecount.intent_sharedps

Number of times the collection lock type was acquired in the Intent Shared (IS) mode.

mongodb.locks.collection.acquirecount.sharedps

Number of times the collection lock type was acquired in the Shared (S) mode.

mongodb.locks.collection.acquirewaitcount.exclusiveps

Number of times the collection lock type acquisition in the Exclusive (X) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.collection.acquirewaitcount.sharedps

Number of times the collection lock type acquisition in the Shared (S) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.collection.timeacquiringmicros.exclusiveps

Wait time for the collection lock type acquisitions in the Exclusive (X) mode.

mongodb.locks.collection.timeacquiringmicros.sharedps

Wait time for the collection lock type acquisitions in the Shared (S) mode.

mongodb.locks.database.acquirecount.exclusiveps

Number of times the database lock type was acquired in the Exclusive (X) mode.

mongodb.locks.database.acquirecount.intent_exclusiveps

Number of times the database lock type was acquired in the Intent Exclusive (IX) mode.

mongodb.locks.database.acquirecount.intent_sharedps

Number of times the database lock type was acquired in the Intent Shared (IS) mode.

mongodb.locks.database.acquirecount.sharedps

Number of times the database lock type was acquired in the Shared (S) mode.

mongodb.locks.database.acquirewaitcount.exclusiveps

Number of times the database lock type acquisition in the Exclusive (X) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.database.acquirewaitcount.intent_exclusiveps

Number of times the database lock type acquisition in the Intent Exclusive (IX) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.database.acquirewaitcount.intent_sharedps

Number of times the database lock type acquisition in the Intent Shared (IS) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.database.acquirewaitcount.sharedps

Number of times the database lock type acquisition in the Shared (S) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.database.timeacquiringmicros.exclusiveps

Wait time for the database lock type acquisitions in the Exclusive (X) mode.

mongodb.locks.database.timeacquiringmicros.intent_exclusiveps

Wait time for the database lock type acquisitions in the Intent Exclusive (IX) mode.

mongodb.locks.database.timeacquiringmicros.intent_sharedps

Wait time for the database lock type acquisitions in the Intent Shared (IS) mode.

mongodb.locks.database.timeacquiringmicros.sharedps

Wait time for the database lock type acquisitions in the Shared (S) mode.

mongodb.locks.global.acquirecount.exclusiveps

Number of times the global lock type was acquired in the Exclusive (X) mode.

mongodb.locks.global.acquirecount.intent_exclusiveps

Number of times the global lock type was acquired in the Intent Exclusive (IX) mode.

mongodb.locks.global.acquirecount.intent_sharedps

Number of times the global lock type was acquired in the Intent Shared (IS) mode.

mongodb.locks.global.acquirecount.sharedps

Number of times the global lock type was acquired in the Shared (S) mode.

mongodb.locks.global.acquirewaitcount.exclusiveps

Number of times the global lock type acquisition in the Exclusive (X) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.global.acquirewaitcount.intent_exclusiveps

Number of times the global lock type acquisition in the Intent Exclusive (IX) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.global.acquirewaitcount.intent_sharedps

Number of times the global lock type acquisition in the Intent Shared (IS) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.global.acquirewaitcount.sharedps

Number of times the global lock type acquisition in the Shared (S) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.global.timeacquiringmicros.exclusiveps

Wait time for the global lock type acquisitions in the Exclusive (X) mode.

mongodb.locks.global.timeacquiringmicros.intent_exclusiveps

Wait time for the global lock type acquisitions in the Intent Exclusive (IX) mode.

mongodb.locks.global.timeacquiringmicros.intent_sharedps

Wait time for the global lock type acquisitions in the Intent Shared (IS) mode.

mongodb.locks.global.timeacquiringmicros.sharedps

Wait time for the global lock type acquisitions in the Shared (S) mode.

mongodb.locks.metadata.acquirecount.exclusiveps

Number of times the metadata lock type was acquired in the Exclusive (X) mode.

mongodb.locks.metadata.acquirecount.sharedps

Number of times the metadata lock type was acquired in the Shared (S) mode.

mongodb.locks.mmapv1journal.acquirecount.intent_exclusiveps

Number of times the MMAPv1 storage engine lock type was acquired in the Intent Exclusive (IX) mode.

mongodb.locks.mmapv1journal.acquirecount.intent_sharedps

Number of times the MMAPv1 storage engine lock type was acquired in the Intent Shared (IS) mode.

mongodb.locks.mmapv1journal.acquirewaitcount.intent_exclusiveps

Number of times the MMAPv1 storage engine lock type acquisition in the Intent Exclusive (IX) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.mmapv1journal.acquirewaitcount.intent_sharedps

Number of times the MMAPv1 storage engine lock type acquisition in the Intent Shared (IS) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.mmapv1journal.timeacquiringmicros.intent_exclusiveps

Wait time for the MMAPv1 storage engine lock type acquisitions in the Intent Exclusive (IX) mode.

mongodb.locks.mmapv1journal.timeacquiringmicros.intent_sharedps

Wait time for the MMAPv1 storage engine lock type acquisitions in the Intent Shared (IS) mode.

mongodb.locks.oplog.acquirecount.intent_exclusiveps

Number of times the oplog lock type was acquired in the Intent Exclusive (IX) mode.

mongodb.locks.oplog.acquirecount.sharedps

Number of times the oplog lock type was acquired in the Shared (S) mode.

mongodb.locks.oplog.acquirewaitcount.intent_exclusiveps

Number of times the oplog lock type acquisition in the Intent Exclusive (IX) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.oplog.acquirewaitcount.sharedps

Number of times the oplog lock type acquisition in the Shared (S) mode encountered waits because the locks were held in a conflicting mode.

mongodb.locks.oplog.timeacquiringmicros.intent_exclusiveps

Wait time for the oplog lock type acquisitions in the Intent Exclusive (IX) mode.

mongodb.locks.oplog.timeacquiringmicros.sharedps

Wait time for the oplog lock type acquisitions in the Shared (S) mode.

mongodb.mem.bits

Size of the in-memory storage engine.

mongodb.mem.mapped

Amount of mapped memory by the database.

mongodb.mem.mappedwithjournal

The amount of mapped memory, including the memory used for journaling.

mongodb.mem.resident

Amount of memory currently used by the database process.

mongodb.mem.virtual

Amount of virtual memory used by the mongod process.

mongodb.metrics.commands.count.failed

Number of times count failed

mongodb.metrics.commands.count.total

Number of times count executed

mongodb.metrics.commands.createIndexes.failed

Number of times createIndexes failed

mongodb.metrics.commands.createIndexes.total

Number of times createIndexes executed

mongodb.metrics.commands.delete.failed

Number of times delete failed

mongodb.metrics.commands.delete.total

Number of times delete executed

mongodb.metrics.commands.eval.failed

Number of times eval failed

mongodb.metrics.commands.eval.total

Number of times eval executed

mongodb.metrics.commands.findAndModify.failed

Number of times findAndModify failed

mongodb.metrics.commands.findAndModify.total

Number of times findAndModify executed

mongodb.metrics.commands.insert.failed

Number of times insert failed

mongodb.metrics.commands.insert.total

Number of times insert executed

mongodb.metrics.commands.update.failed

Number of times update failed

mongodb.metrics.commands.update.total

Number of times update executed

mongodb.metrics.cursor.open.notimeout

Number of open cursors with the option `DBQuery.Option.noTimeout` set to prevent timeout after a period of inactivity.

mongodb.metrics.cursor.open.pinned

Number of pinned open cursors.

mongodb.metrics.cursor.open.total

Number of cursors that MongoDB is maintaining for clients.

mongodb.metrics.cursor.timedoutps

Number of cursors that time out, per second.

mongodb.metrics.document.deletedps

Number of documents deleted per second.

mongodb.metrics.document.insertedps

Number of documents inserted per second.

mongodb.metrics.document.returnedps

Number of documents returned by queries per second.

mongodb.metrics.document.updatedps

Number of documents updated per second.

mongodb.metrics.getlasterror.wtime.numps

Number of getLastError operations per second with a specified write concern (i.e. w) that wait for one or more members of a replica set to acknowledge the write operation.

mongodb.metrics.getlasterror.wtime.totalmillisps

Fraction of time (ms/s) that the mongod has spent performing getLastError operations with write concern (i.e. w) that wait for one or more members of a replica set to acknowledge the write operation.

mongodb.metrics.getlasterror.wtimeoutsps

Number of times per second that write concern operations have timed out as a result of the wtimeout threshold to getLastError

mongodb.metrics.operation.fastmodps

Number of update operations per second that neither cause documents to grow nor require updates to the index.

mongodb.metrics.operation.idhackps

Number of queries per second that contain the _id field.

mongodb.metrics.operation.writeconflictsps

Number of times per second that write concern operations has encounter a conflict.

mongodb.metrics.operation.scanandorderps

Number of queries per second that return sorted numbers that cannot perform the sort operation using an index.

mongodb.metrics.queryexecutor.scannedps

Number of index items scanned per second during queries and query-plan evaluation.

mongodb.metrics.record.movesps

Number of times per second documents move within the on-disk representation of the MongoDB data set.

mongodb.metrics.repl.apply.batches.numps

Number of batches applied across all databases per second.

mongodb.metrics.repl.apply.batches.totalmillisps

Fraction of time (ms/s) the mongod has spent applying operations from the oplog.

mongodb.metrics.repl.apply.opsps

Number of oplog operations applied per second.

mongodb.metrics.repl.buffer.count

Number of operations in the oplog buffer.

mongodb.metrics.repl.buffer.maxsizebytes

Maximum size of the buffer.

mongodb.metrics.repl.buffer.sizebytes

Current size of the contents of the oplog buffer.

mongodb.metrics.repl.network.bytesps

Amount of data read from the replication sync source per second.

mongodb.metrics.repl.network.getmores.numps

Number of getmore operations per second.

mongodb.metrics.repl.network.getmores.totalmillisps

Fraction of time (ms/s) required to collect data from getmore operations.

mongodb.metrics.repl.network.opsps

Number of operations read from the replication source per second.

mongodb.metrics.repl.network.readerscreatedps

Number of oplog query processes created per second.

mongodb.metrics.repl.preload.docs.numps

Number of documents loaded during the pre-fetch stage of replication.

mongodb.metrics.repl.preload.docs.totalmillisps

Amount of time spent loading documents as part of the pre-fetch stage of replication.

mongodb.metrics.repl.preload.indexes.numps

Number of index entries loaded by members before updating documents as part of the pre-fetch stage of replication.

mongodb.metrics.repl.preload.indexes.totalmillisps

Amount of time spent loading documents as part of the pre-fetch stage of replication.

mongodb.metrics.ttl.deleteddocumentsps

Number of documents deleted from collections with a ttl index per second.

mongodb.metrics.ttl.passesps

Number of times per second the background process removes documents from collections with a ttl index.

mongodb.network.bytesinps

The number of bytes that reflects the amount of network traffic received by this database.

mongodb.network.bytesoutps

The number of bytes that reflects the amount of network traffic sent from this database.

mongodb.network.numrequestsps

Number of distinct requests that the server has received.

mongodb.opcounters.commandps

Total number of commands per second issued to the database.

mongodb.opcounters.deleteps

Number of delete operations per second.

mongodb.opcounters.getmoreps

Number of getmore operations per second.

mongodb.opcounters.insertps

Number of insert operations per second.

mongodb.opcounters.queryps

Total number of queries per second.

mongodb.opcounters.updateps

Number of update operations per second.

mongodb.opcountersrepl.commandps

Total number of replicated commands issued to the database per second.

mongodb.opcountersrepl.deleteps

Number of replicated delete operations per second.

mongodb.opcountersrepl.getmoreps

Number of replicated getmore operations per second.

mongodb.opcountersrepl.insertps

Number of replicated insert operations per second.

mongodb.opcountersrepl.queryps

Total number of replicated queries per second.

mongodb.opcountersrepl.updateps

Number of replicated update operations per second.

mongodb.oplog.logsizemb

Total size of the oplog.

mongodb.oplog.timediff

Oplog window: difference between the first and last operation in the oplog.

mongodb.oplog.usedsizemb

Total amount of space used by the oplog.

mongodb.replset.health

Member health value of the replica set: conveys if the member is up (i.e. 1) or down (i.e. 0).

mongodb.replset.replicationlag

Delay between a write operation on the primary and its copy to a secondary.

mongodb.replset.state

State of a replica that reflects its disposition within the set.

mongodb.replset.votefraction

Fraction of votes a server will cast in a replica set election.

mongodb.replset.votes

The number of votes a server will cast in a replica set election.

mongodb.stats.datasize

Total size of the data held in this database including the padding factor.

mongodb.stats.indexes

Total number of indexes across all collections in the database.

mongodb.stats.indexsize

Total size of all indexes created on this database.

mongodb.stats.objects

Number of objects (documents) in the database across all collections.

mongodb.stats.storagesize

Total amount of space allocated to collections in this database for document storage.

mongodb.uptime

Number of seconds that the mongos or mongod process has been active.

mongodb.wiredtiger.cache.bytes_currently_in_cache

Size of the data currently in cache.

mongodb.wiredtiger.cache.failed_eviction_of_pages_exceeding_the_in_memory_maximumps

Number of failed eviction of pages that exceeded the in-memory maximum, per second.

mongodb.wiredtiger.cache.in_memory_page_splits

In-memory page splits.

mongodb.wiredtiger.cache.maximum_bytes_configured

Maximum cache size.

mongodb.wiredtiger.cache.maximum_page_size_at_eviction

Maximum page size at eviction.

mongodb.wiredtiger.cache.modified_pages_evicted

Number of pages, that have been modified, evicted from the cache.

mongodb.wiredtiger.cache.pages_currently_held_in_cache

Number of pages currently held in the cache.

mongodb.wiredtiger.cache.pages_evicted_by_application_threadsps

Number of page evicted by application threads per second.

mongodb.wiredtiger.cache.pages_evicted_exceeding_the_in_memory_maximumps

Number of pages evicted because they exceeded the cache in-memory maximum, per second.

mongodb.wiredtiger.cache.tracked_dirty_bytes_in_cache

Size of the dirty data in the cache.

mongodb.wiredtiger.cache.unmodified_pages_evicted

Number of pages, that were not modified, evicted from the cache.

mongodb.wiredtiger.concurrenttransactions.read.available

Number of available read tickets (concurrent transactions) remaining.

mongodb.wiredtiger.concurrenttransactions.read.out

Number of read tickets (concurrent transactions) in use.

mongodb.wiredtiger.concurrenttransactions.read.totaltickets

Total number of read tickets (concurrent transactions) available.

mongodb.wiredtiger.concurrenttransactions.write.available

Number of available write tickets (concurrent transactions) remaining.

mongodb.wiredtiger.concurrenttransactions.write.out

Number of write tickets (concurrent transactions) in use.

mongodb.wiredtiger.concurrenttransactions.write.totaltickets

Total number of write tickets (concurrent transactions) available.

mongodb.collection.size

The total size in bytes of the data in the collection plus the size of every indexes on the mongodb.collection.

mongodb.collection.avgObjSize

The size of the average object in the collection in bytes.

mongodb.collection.count

Total number of objects in the collection.

mongodb.collection.capped

Whether or not the collection is capped.

mongodb.collection.max

Maximum number of documents in a capped collection.

mongodb.collection.maxSize

Maximum size of a capped collection in bytes.

mongodb.collection.storageSize

Total storage space allocated to this collection for document storage.

mongodb.collection.nindexes

Total number of indices on the collection.

mongodb.collection.indexSizes

Size of index in bytes.

mongodb.collection.indexes.accesses.ops

Number of time the index was used.

mongodb.usage.commands.countps

Number of commands per second

mongodb.usage.commands.count

Number of commands since server start (deprecated)

mongodb.usage.commands.time

Total time spent performing commands in microseconds

mongodb.usage.getmore.countps

Number of getmore per second

mongodb.usage.getmore.count

Number of getmore since server start (deprecated)

mongodb.usage.getmore.time

Total time spent performing getmore in microseconds

mongodb.usage.insert.countps

Number of inserts per second

mongodb.usage.insert.count

Number of inserts since server start (deprecated)

mongodb.usage.insert.time

Total time spent performing inserts in microseconds

mongodb.usage.queries.countps

Number of queries per second

mongodb.usage.queries.count

Number of queries since server start (deprecated)

mongodb.usage.queries.time

Total time spent performing queries in microseconds

mongodb.usage.readLock.countps

Number of read locks per second

mongodb.usage.readLock.count

Number of read locks since server start (deprecated)

mongodb.usage.readLock.time

Total time spent performing read locks in microseconds

mongodb.usage.remove.countps

Number of removes per second

mongodb.usage.remove.count

Number of removes since server start (deprecated)

mongodb.usage.remove.time

Total time spent performing removes in microseconds

mongodb.usage.total.countps

Number of operations per second

mongodb.usage.total.count

Number of operations since server start (deprecated)

mongodb.usage.total.time

Total time spent performing operations in microseconds

mongodb.usage.update.countps

Number of updates per second

mongodb.usage.update.count

Number of updates since server start (deprecated)

mongodb.usage.update.time

Total time spent performing updates in microseconds

mongodb.usage.writeLock.countps

Number of write locks per second

mongodb.usage.writeLock.count

Number of write locks since server start (deprecated)

mongodb.usage.writeLock.time

Total time spent performing write locks in microseconds

4.10.3.2.16 - MySQL Metrics

See Application Integrations for more information.

mysql.galera.wsrep_cluster_size

The current number of nodes in the Galera cluster.

mysql.innodb.buffer_pool_free

The number of free pages in the InnoDB Buffer Pool.

mysql.innodb.buffer_pool_total

The total number of pages in the InnoDB Buffer Pool.

mysql.innodb.buffer_pool_used

The number of used pages in the InnoDB Buffer Pool.

mysql.innodb.buffer_pool_utilization

The utilization of the InnoDB Buffer Pool.

mysql.innodb.current_row_locks

The number of current row locks.

mysql.innodb.data_reads

The rate of data reads.

mysql.innodb.data_writes

The rate of data writes.

mysql.innodb.mutex_os_waits

The rate of mutex OS waits.

mysql.innodb.mutex_spin_rounds

The rate of mutex spin rounds.

mysql.innodb.mutex_spin_waits

The rate of mutex spin waits.

mysql.innodb.os_log_fsyncs

The rate of fsync writes to the log file.

mysql.innodb.row_lock_time

The fraction of time spent (ms/s) acquring row locks.

mysql.innodb.row_lock_waits

The number of times per second a row lock had to be waited for.

mysql.net.connections

The rate of connections to the server.

mysql.net.max_connections

The maximum number of connections that have been in use simultaneously since the server started.

mysql.performance.com_delete

The rate of delete statements.

mysql.performance.com_delete_multi

The rate of delete-multi statements.

mysql.performance.com_insert

The rate of insert statements.

mysql.performance.com_insert_select

The rate of insert-select statements.

mysql.performance.com_replace_select

The rate of replace-select statements.

mysql.performance.com_select

The rate of select statements.

mysql.performance.com_update

The rate of update statements.

mysql.performance.com_update_multi

The rate of update-multi.

mysql.performance.created_tmp_disk_tables

The rate of internal on-disk temporary tables created by second by the server while executing statements.

mysql.performance.created_tmp_files

The rate of temporary files created by second.

mysql.performance.created_tmp_tables

The rate of internal temporary tables created by second by the server while executing statements.

mysql.performance.kernel_time

The percentage of CPU time spent in kernel space by MySQL.

mysql.performance.key_cache_utilization

The key cache utilization ratio.

mysql.performance.open_files

The number of open files.

mysql.performance.open_tables

The number of of tables that are open.

mysql.performance.qcache_hits

The rate of query cache hits.

mysql.performance.queries

The rate of queries.

mysql.performance.questions

The rate of statements executed by the server.

mysql.performance.slow_queries

The rate of slow queries.

mysql.performance.table_locks_waited

The total number of times that a request for a table lock could not be granted immediately and a wait was needed.

mysql.performance.table_locks_waited.gauge

mysql.performance.threads_connected

The number of currently open connections.

mysql.performance.threads_running

The number of threads that are not sleeping.

mysql.performance.user_time

The percentage of CPU time spent in user space by MySQL.

mysql.replication.seconds_behind_master

The lag in seconds between the master and the slave.

mysql.replication.slave_running

A boolean showing if this server is a replication slave that is connected to a replication master.

mysql.replication.slaves_connected

The number of slaves connected to a replication master.

4.10.3.2.17 - NGINX and NGINX Plus Metrics

Contents

4.10.3.2.17.1 - NGINX Metrics

See Application Integrations for more information.

nginx.net.conn_dropped_per_s

The rate of connections dropped.

nginx.net.conn_opened_per_s

The rate of connections opened.

nginx.net.connections

The total number of active connections.

nginx.net.reading

The number of connections reading client requests.

nginx.net.request_per_s

The rate of requests processed.

nginx.net.waiting

The number of keep-alive connections waiting for work.

nginx.net.writing

The number of connections waiting on upstream responses and/or writing responses back to the client.

4.10.3.2.17.2 - NGINX Plus Metrics

See Application Integrations for more information.

nginx.plus.cache.bypass.bytes

The total number of bytes read from the proxied server.

nginx.plus.cache.bypass.bytes_written

The total number of bytes written to the cache.

nginx.plus.cache.bypass.responses

The total number of responses from the cache.

nginx.plus.cache.bypass.responses_written

The total number of responses written to the cache.

nginx.plus.cache.cold

Boolean. Defines whether the cache loader process is still loading data from the disk into the cache or not.

nginx.plus.cache.expired.bytes

The total number of bytes read from the proxied server.

nginx.plus.cache.expired.bytes_written

The total number of bytes written to the cache.

nginx.plus.cache.expired.responses

The total number of responses not taken from the cache

nginx.plus.cache.expired.responses_written

The total number of responses written to the cache

nginx.plus.cache.hit.bytes

The total number of bytes read from the cache

nginx.plus.cache.hit.responses

The total number of responses read from the cache

nginx.plus.cache.max_size

The limit on the maximum size of the cache specified in the configuration

nginx.plus.cache.miss.bytes

The total number of bytes read from the proxied server

nginx.plus.cache.miss.bytes_written

The total number of bytes written to the cache

nginx.plus.cache.miss.responses

The total number of responses not taken from the cache

nginx.plus.cache.miss.responses_written

The total number of responses written to the cache

nginx.plus.cache.revalidated.bytes

The total number of bytes read from the cache

nginx.plus.cache.revalidated.response

The total number of responses read from the cache

nginx.plus.cache.size

The current size of the cache

nginx.plus.cache.stale.bytes

The total number of bytes read from the cache

nginx.plus.cache.stale.responses

The total number of responses read from the cache

nginx.plus.cache.updating.bytes

The total number of bytes read from the cache

nginx.plus.cache.updating.responses

The total number of responses read from the cache

nginx.plus.connections.accepted

The total number of accepted client connections.

nginx.plus.connections.active

The current number of active client connections.

nginx.plus.connections.dropped

The total number of dropped client connections.

nginx.plus.connections.idle

The current number of idle client connections.

nginx.plus.generation

The total number of configuration reloads

nginx.plus.load_timestamp

Time of the last reload of configuration (time since Epoch).

nginx.plus.pid

The ID of the worker process that handled status request.

nginx.plus.plus.upstream.peers.fails

The total number of unsuccessful attempts to communicate with the server.

nginx.plus.ppid

The ID of the master process that started the worker process

nginx.plus.processes.respawned

The total number of abnormally terminated and re-spawned child processes.

nginx.plus.requests.current

The current number of client requests.

nginx.plus.requests.total

The total number of client requests.

nginx.plus.server_zone.discarded

The total number of requests completed without sending a response.

nginx.plus.server_zone.processing

The number of client requests that are currently being processed.

nginx.plus.server_zone.received

The total amount of data received from clients.

nginx.plus.server_zone.requests

The total number of client requests received from clients.

nginx.plus.server_zone.responses.1xx

The number of responses with 1xx status code.

nginx.plus.server_zone.responses.2xx

The number of responses with 2xx status code.

nginx.plus.server_zone.responses.3xx

The number of responses with 3xx status code.

nginx.plus.server_zone.responses.4xx

The number of responses with 4xx status code.

nginx.plus.server_zone.responses.5xx

The number of responses with 5xx status code.

nginx.plus.server_zone.responses.total

The total number of responses sent to clients.

nginx.plus.server_zone.sent

The total amount of data sent to clients.

nginx.plus.slab.pages.free

The current number of free memory pages

nginx.plus.slab.pages.used

The current number of used memory pages

nginx.plus.slab.slots.fails

The number of unsuccessful attempts to allocate memory of specified size

nginx.plus.slab.slots.free

The current number of free memory slots

nginx.plus.slab.slots.reqs

The total number of attempts to allocate memory of specified size

nginx.plus.slab.slots.used

The current number of used memory slots

nginx.plus.ssl.handshakes

The total number of successful SSL handshakes.

nginx.plus.ssl.handshakes_failed

The total number of failed SSL handshakes.

nginx.plus.ssl.session_reuses

The total number of session reuses during SSL handshake.

nginx.plus.stream.server_zone.connections

The total number of connections accepted from clients

nginx.plus.stream.server_zone.connections

The total number of connections accepted from clients

nginx.plus.stream.server_zone.discarded

The total number of requests completed without sending a response.

nginx.plus.stream.server_zone.discarded

The total number of requests completed without sending a response.

nginx.plus.stream.server_zone.processing

The number of client requests that are currently being processed.

nginx.plus.stream.server_zone.processing

The number of client requests that are currently being processed.

nginx.plus.stream.server_zone.received

The total amount of data received from clients.

nginx.plus.stream.server_zone.received

The total amount of data received from clients.

nginx.plus.stream.server_zone.sent

The total amount of data sent to clients.

nginx.plus.stream.server_zone.sent

The total amount of data sent to clients.

nginx.plus.stream.server_zone.sessions.1xx

The number of responses with 1xx status code.

nginx.plus.stream.server_zone.sessions.2xx

The number of responses with 2xx status code.

nginx.plus.stream.server_zone.sessions.3xx

The number of responses with 3xx status code.

nginx.plus.stream.server_zone.sessions.4xx

The number of responses with 4xx status code.

nginx.plus.stream.server_zone.sessions.5xx

The number of responses with 5xx status code.

nginx.plus.stream.server_zone.sessions.total

The total number of responses sent to clients.

nginx.plus.stream.upstream.peers.active

The current number of connections

nginx.plus.stream.upstream.peers.backup

A boolean value indicating whether the server is a backup server.

nginx.plus.stream.upstream.peers.connections

The total number of client connections forwarded to this server.

nginx.plus.stream.upstream.peers.downstart

The time (time since Epoch) when the server became “unavail” or “checking” or “unhealthy”

nginx.plus.stream.upstream.peers.downtime

Total time the server was in the “unavail” or “checking” or “unhealthy” states.

nginx.plus.stream.upstream.peers.fails

The total number of unsuccessful attempts to communicate with the server.

nginx.plus.stream.upstream.peers.health_checks.checks

The total number of health check requests made.

nginx.plus.stream.upstream.peers.health_checks.fails

The number of failed health checks.

nginx.plus.stream.upstream.peers.health_checks.last_passed

Boolean indicating if the last health check request was successful and passed tests.

nginx.plus.stream.upstream.peers.health_checks.unhealthy

How many times the server became unhealthy (state “unhealthy”).

nginx.plus.stream.upstream.peers.id

The ID of the server.

nginx.plus.stream.upstream.peers.received

The total number of bytes received from this server.

nginx.plus.stream.upstream.peers.selected

The time (time since Epoch) when the server was last selected to process a connection.

nginx.plus.stream.upstream.peers.sent

The total number of bytes sent to this server.

nginx.plus.stream.upstream.peers.unavail

How many times the server became unavailable for client connections (state “unavail”).

nginx.plus.stream.upstream.peers.weight

Weight of the server.

nginx.plus.stream.upstream.zombies

The current number of servers removed from the group but still processing active client connections.

nginx.plus.timestamp

Current time since Epoch.

nginx.plus.upstream.keepalive

The current number of idle keepalive connections.

nginx.plus.upstream.peers.active

The current number of active connections.

nginx.plus.upstream.peers.backup

A boolean value indicating whether the server is a backup server.

nginx.plus.upstream.peers.downstart

The time (since Epoch) when the server became “unavail” or “unhealthy”.

nginx.plus.upstream.peers.downtime

Total time the server was in the “unavail” and “unhealthy” states.

nginx.plus.upstream.peers.health_checks.checks

The total number of health check requests made.

nginx.plus.upstream.peers.health_checks.fails

The number of failed health checks.

nginx.plus.upstream.peers.health_checks.last_passed

Boolean indicating if the last health check request was successful and passed tests.

nginx.plus.upstream.peers.health_checks.unhealthy

How many times the server became unhealthy (state “unhealthy”).

nginx.plus.upstream.peers.id

he ID of the server.

nginx.plus.upstream.peers.received

The total amount of data received from this server.

nginx.plus.upstream.peers.requests

The total number of client requests forwarded to this server.

nginx.plus.upstream.peers.responses.1xx

The number of responses with 1xx status code.

nginx.plus.upstream.peers.responses.1xx_count

The number of responses with 1xx status code (shown as count).

nginx.plus.upstream.peers.responses.2xx

The number of responses with 2xx status code.

nginx.plus.upstream.peers.responses.2xx_count

The number of responses with 2xx status code (shown as count).

nginx.plus.upstream.peers.responses.3xx

The number of responses with 3xx status code.

nginx.plus.upstream.peers.responses.3xx_count

The number of responses with 3xx status code (shown as count).

nginx.plus.upstream.peers.responses.4xx

The number of responses with 4xx status code.

nginx.plus.upstream.peers.responses.4xx_count

The number of responses with 4xx status code (shown as count).

nginx.plus.upstream.peers.responses.5xx

The number of responses with 5xx status code.

nginx.plus.upstream.peers.responses.5xx_count

The number of responses with 5xx status code (shown as count).

nginx.plus.upstream.peers.responses.total

The total number of responses obtained from this server.

nginx.plus.upstream.peers.selected

The time (since Epoch) when the server was last selected to process a request (1.7.5).

nginx.plus.upstream.peers.sent

The total amount of data sent to this server.

nginx.plus.upstream.peers.unavail

How many times the server became unavailable for client requests (state “unavail”) due to the number of unsuccessful attempts reaching the max_fails threshold.

nginx.plus.upstream.peers.weight

The weight of the server.

nginx.plus.version

The NGINX version.

4.10.3.2.18 - NTP Metrics

See Application Integrations for more information.

ntp.offset

The time difference between the local clock and the NTP reference clock, in seconds.

4.10.3.2.19 - PGBouncer Metrics

See Application Integrations for more information.

pgbouncer.pools.cl_active

The number of client connections linked to a server connection and able to process queries.

pgbouncer.pools.cl_waiting

The number of client connections waiting on a server connection.

pgbouncer.pools.maxwait

The age of the oldest unserved client connection.

pgbouncer.pools.sv_active

The number of server connections linked to a client connection.

pgbouncer.pools.sv_idle

The number of server connections idle and ready for a client query.

pgbouncer.pools.sv_login

The number of server connections currently in the process of logging in.

pgbouncer.pools.sv_tested

The number of server connections currently running either server_reset_query or server_check_query.

pgbouncer.pools.sv_used

The number of server connections idle more than server_check_delay, needing server_check_query.

pgbouncer.stats.avg_query

The average query duration.

pgbouncer.stats.avg_recv

The average amount of client network traffic received.

pgbouncer.stats.avg_req

The average number of requests per second in the last stat period.

pgbouncer.stats.avg_sent

The average amount of client network traffic sent.

pgbouncer.stats.bytes_received_per_second

The total network traffic received.

pgbouncer.stats.bytes_sent_per_second

The total network traffic sent.

pgbouncer.stats.requests_per_second

The request rate.

pgbouncer.stats.total_query_time

The time spent by PgBouncer actively querying PostgreSQL.

4.10.3.2.20 - PHP-FPM Metrics

See Application Integrations for more information.

php_fpm.listen_queue.size

The size of the socket queue of pending connections.

php_fpm.processes.active

The total number of active processes.

php_fpm.processes.idle

The total number of idle processes.

php_fpm.processes.max_reached

The number of times the process limit has been reached.

php_fpm.processes.total

The total number of processes.

php_fpm.requests.accepted

The total number of accepted requests.

php_fpm.requests.slow

The total number of slow requests.

4.10.3.2.21 - PostgreSQL Metrics

See Application Integrations for more information.

Metric NameTypeDescription
postgresql.seq_scansgaugeThe number of sequential scans initiated on this table.
postgresql.index_scansgaugeThe number of index scans initiated on this table.
postgresql.index_rows_fetchedgaugeThe number of live rows fetched by index scans.
postgresql.rows_hot_updatedgaugeThe number of rows HOT updated, meaning no separate index update was needed.
postgresql.live_rowsgaugeThe estimated number of live rows.
postgresql.dead_rowsgaugeThe estimated number of dead rows.
postgresql.index_rows_readgaugeThe number of index entries returned by scans on this index.
postgresql.table_sizegaugeThe total disk space used by the specified table. Includes TOAST, free space map, and visibility map. Excludes indexes.
postgresql.index_sizegaugeThe total disk space used by indexes attached to the specified table.
postgresql.total_sizegaugeThe total disk space used by the table, including indexes and TOAST data.
postgresql.heap_blocks_readgaugeThe number of disk blocks read from this table.
postgresql.heap_blocks_hitgaugeThe number of buffer hits in this table.
postgresql.index_blocks_readgaugeThe number of disk blocks read from all indexes on this table.
postgresql.index_blocks_hitgaugeThe number of buffer hits in all indexes on this table.
postgresql.toast_blocks_readgaugeThe number of disk blocks read from this table’s TOAST table.
postgresql.toast_blocks_hitgaugeThe number of buffer hits in this table’s TOAST table.
postgresql.toast_index_blocks_readgaugeThe number of disk blocks read from this table’s TOAST table index.
postgresql.toast_index_blocks_hitgaugeThe number of buffer hits in this table’s TOAST table index.
postgresql.active_queriesgaugeThe number of active queries in this database.
postgresql.archiver.archived_countgaugeThe number of WAL files that have been successfully archived.
postgresql.archiver.failed_countgaugeThe number of failed attempts for archiving WAL files.
postgresql.before_xid_wraparoundgaugeThe number of transactions that can occur until a transaction wraparound.
postgresql.index_rel_rows_fetchedrateThe number of live rows fetched by index scans.
postgresql.transactions.idle_in_transactiongaugeThe number of ‘idle in transaction’ transactions in this database.
postgresql.transactions.opengaugeThe number of open transactions in this database.
postgresql.waiting_queriesgaugeThe number of waiting queries in this database.
postgresql.waiting_queriesgaugeThe number of buffers allocated
postgresql.bgwriter.buffers_backendgaugeThe number of buffers written directly by a backend.
postgresql.bgwriter.buffers_backend_fsyncgaugeThe of times a backend had to execute its own fsync call instead of the background writer.
postgresql.bgwriter.buffers_checkpointgaugeThe number of buffers written during checkpoints.
postgresql.bgwriter.buffers_cleangaugeThe number of buffers written by the background writer.
postgresql.bgwriter.checkpoints_requestedgaugeThe number of requested checkpoints that were performed.
postgresql.bgwriter.checkpoints_timedgaugeThe number of scheduled checkpoints that were performed.
postgresql.bgwriter.maxwritten_cleangauge.The number of times the background writer stopped a cleaning scan due to writing too many buffers.
postgresql.bgwriter.sync_timegaugeThe total amount of checkpoint processing time spent synchronizing files to disk.
postgresql.bgwriter.write_timegaugeThe total amount of checkpoint processing time spent writing files to disk.
postgresql.buffer_hitgaugeThe number of times disk blocks were found in the buffer cache, preventing the need to read from the database.
postgresql.commitsgaugeThe number of transactions that have been committed in this database.
postgresql.connectionsgaugeThe number of active connections to this database.
postgresql.database_sizegaugeThe disk space used by this database.
postgresql.deadlocksgaugeThe number of deadlocks detected in this database
postgresql.disk_readgaugeThe number of disk blocks read in this database.
postgresql.locksgaugeThe number of locks active for this database.
postgresql.max_connectionsgaugeThe maximum number of client connections allowed to this database.
postgresql.percent_usage_connectionsgaugeThe number of connections to this database as a fraction of the maximum number of allowed connections.
postgresql.replication_delaygaugeThe current replication delay in seconds. Only available with PostgreSQL 9.1 and newer.
postgresql.replication_delay_bytesgaugeThe current replication delay in bytes. Only available with PostgreSQL 9.2 and newer.
postgresql.rollbacksgaugeThe number of transactions that have been rolled back in this database.
postgresql.rows_deletedgaugeThe number of rows deleted by queries in this database.
postgresql.rows_fetchedgaugeThe number of rows fetched by queries in this database.
postgresql.rows_insertedgaugeThe number of rows inserted by queries in this database. The metrics can be segmented by ‘db’ or ’table’ and can be viewed per-relation.
postgresql.rows_returnedgaugeThe number of rows returned by queries in this database. The metrics can be segmented by ‘db’ or ’table’ and can be viewed per-relation.
postgresql.rows_updatedgaugeThe number of rows updated by queries in this database.
postgresql.rows_deletedgaugeThe number of rows deleted by queries in this database. The metrics can be segmented by ‘db’ or ’table’ and can be viewed per-relation.
postgresql.table.countgaugeThe number of user tables in this database.
postgresql.temp_bytesgaugeThe amount of data written to temporary files by queries in this database.
postgresql.temp_filesgaugeThe number of temporary files created by queries in this database.
postgresql.toast_blocks_readgaugeThe number of disk blocks read from this table’s TOAST table.
postgresql.transactions.idle_in_transactiongaugeThe number of ‘idle in transaction’ transactions in this database.
postgresql.transactions.opengaugeThe number of open transactions in this database.

4.10.3.2.22 - RabbitMQ Metrics

See Application Integrations for more information.

rabbitmq.connections

The number of current connections to a given rabbitmq vhost. Each connection is tagged as rabbitmq_vhost:<vhost_name>.

rabbitmq.connections.state

The number of connections in the specified connection state.

rabbitmq.exchange.messages.ack.count

The number of messages delivered to clients and acknowledged.

rabbitmq.exchange.messages.ack.rate

The rate of messages delivered to clients and acknowledged per second.

rabbitmq.exchange.messages.confirm.count

The number of messages confirmed.

rabbitmq.exchange.messages.confirm.rate

The rate of messages confirmed per second.

rabbitmq.exchange.messages.deliver_get.count

The sum of messages delivered in acknowledgement mode to consumers, in no-acknowledgement mode to consumers, in acknowledgement mode in response to basic.get, and in no-acknowledgement mode in response to basic.get.

rabbitmq.exchange.messages.deliver_get.rate

The rate per second of the sum of messages delivered in acknowledgement mode to consumers, in no-acknowledgement mode to consumers, in acknowledgement mode in response to basic.get, and in no-acknowledgement mode in response to basic.get.

rabbitmq.exchange.messages.publish_in.count

The number of messages published from channels into this exchange.

rabbitmq.exchange.messages.publish_in.rate

The amount of messages published from channels into this exchange per second.

rabbitmq.exchange.messages.publish_out.count

The number of messages published from this exchange into queues.

rabbitmq.exchange.messages.publish_out.rate

The amount of messages published from this exchange into queues per second.

rabbitmq.exchange.messages.publish.count

The number of messages published.

rabbitmq.exchange.messages.publish.rate

The amount of messages published per second.

rabbitmq.exchange.messages.redeliver.count

The number of subset of messages in deliver_get which had the redelivered flag set.

rabbitmq.exchange.messages.redeliver.rate

The amount of subset of messages in deliver_get which had the redelivered flag set per second.

rabbitmq.exchange.messages.return_unroutable.count

The number of messages returned to the publisher as unroutable.

rabbitmq.exchange.messages.return_unroutable.rate

The amount of messages returned to publisher as unroutable per second.

rabbitmq.node.disk_alarm

Defines whether the node has a disk alarm configured.

rabbitmq.node.disk_free

The current free disk space.

rabbitmq.node.fd_used

Used file descriptors.

rabbitmq.node.mem_alarm

Defines whether the node has a memory alarm configured.

rabbitmq.node.mem_used

The total memory used in bytes.

rabbitmq.node.partitions

The number of network partitions this node is seeing.

rabbitmq.node.run_queue

The average number of Erlang processes waiting to run.

rabbitmq.node.running

Defines whether the node is running or not.

rabbitmq.node.sockets_used

The number of file descriptors used as sockets.

rabbitmq.overview.messages.ack.count

The number of messages delivered to clients and acknowledged.

rabbitmq.overview.messages.ack.rate

The rate of messages delivered to clients and acknowledged per second.

rabbitmq.overview.messages.confirm.count

The number of messages confirmed.

rabbitmq.overview.messages.confirm.rate

The rate of messages confirmed per second.

rabbitmq.overview.messages.deliver_get.count

The sum of messages delivered in acknowledgement mode to consumers, in no-acknowledgement mode to consumers, in acknowledgement mode in response to basic.get, and in no-acknowledgement mode in response to basic.get.

rabbitmq.overview.messages.deliver_get.rate

The rate per second of the sum of messages delivered in acknowledgement mode to consumers, in no-acknowledgement mode to consumers, in acknowledgement mode in response to basic.get, and in no-acknowledgement mode in response to basic.get.

rabbitmq.overview.messages.publish_in.count

The number of messages published from channels into this overview.

rabbitmq.overview.messages.publish_in.rate

The rate of messages published from channels into this overview per second.

rabbitmq.overview.messages.publish_out.count

The number of messages published from this overview into queues.

rabbitmq.overview.messages.publish_out.rate

The rate of messages published from this overview into queues per second.

rabbitmq.overview.messages.publish.count

The number of messages published.

rabbitmq.overview.messages.publish.rate

The rate of messages published per second.

rabbitmq.overview.messages.redeliver.count

The number of subset of messages in deliver_get which had the redelivered flag set.

rabbitmq.overview.messages.redeliver.rate

The rate of subset of messages in deliver_get which had the redelivered flag set per second.

rabbitmq.overview.messages.return_unroutable.count

The number of messages returned to publisher as unroutable.

rabbitmq.overview.messages.return_unroutable.rate

The rate of messages returned to publisher as unroutable per second.

rabbitmq.overview.object_totals.channels

The total number of channels.

rabbitmq.overview.object_totals.connections

The total number of connections.

rabbitmq.overview.object_totals.consumers

The total number of consumers.

rabbitmq.overview.object_totals.queues

The total number of queues.

rabbitmq.overview.queue_totals.messages_ready.count

The number of messages ready for delivery.

rabbitmq.overview.queue_totals.messages_ready.rate

The rate of messages ready for delivery.

rabbitmq.overview.queue_totals.messages_unacknowledged.count

The number of unacknowledged messages.

rabbitmq.overview.queue_totals.messages_unacknowledged.rate

The rate of unacknowledged messages.

rabbitmq.overview.queue_totals.messages.count

The total number of messages (ready plus unacknowledged).

rabbitmq.overview.queue_totals.messages.rate

The rate of messages (ready plus unacknowledged).

rabbitmq.queue.active_consumers

The number of active consumers, consumers that can immediately receive any messages sent to the queue.

rabbitmq.queue.bindings.count

The number of bindings for a specific queue.

rabbitmq.queue.consumer_utilisation

The ratio of time that a queue’s consumers can take new messages.

rabbitmq.queue.consumers

The number of consumers.

rabbitmq.queue.memory

The number of bytes of memory consumed by the Erlang process associated with the queue, including stack, heap and internal structures.

rabbitmq.queue.messages

The total number of messages in the queue.

rabbitmq.queue.messages_ready

The number of messages ready to be delivered to clients.

rabbitmq.queue.messages_ready.rate

The number of messages ready to be delivered to clients per second.

rabbitmq.queue.messages_unacknowledged

The number of messages delivered to clients but not yet acknowledged.

rabbitmq.queue.messages_unacknowledged.rate

The number of messages delivered to clients but not yet acknowledged per second.

rabbitmq.queue.messages.ack.count

The number of messages delivered to clients and acknowledged.

rabbitmq.queue.messages.ack.rate

The number of messages delivered to clients and acknowledged per second.

rabbitmq.queue.messages.deliver_get.count

The sum of messages delivered in acknowledgement mode to consumers, in no-acknowledgement mode to consumers, in acknowledgement mode in response to basic.get, and in no-acknowledgement mode in response to basic.get.

rabbitmq.queue.messages.deliver_get.rate

The sum of messages delivered in acknowledgement mode to consumers, in no-acknowledgement mode to consumers, in acknowledgement mode in response to basic.get, and in no-acknowledgement mode in response to basic.get per second.

rabbitmq.queue.messages.deliver.count

The number of messages delivered in acknowledgement mode to consumers.

rabbitmq.queue.messages.deliver.rate

The number of messages delivered in acknowledgement mode to consumers.

rabbitmq.queue.messages.publish.count

The number of messages published.

rabbitmq.queue.messages.publish.rate

The rate of messages published per second.

rabbitmq.queue.messages.rate

The total number of messages in the queue per second.

rabbitmq.queue.messages.redeliver.count

The number of subset of messages in deliver_get which had the redelivered flag set.

rabbitmq.queue.messages.redeliver.rate

The rate per second of subset of messages in deliver_get which had the redelivered flag set.

4.10.3.2.23 - Supervisord Metrics

See Application Integrations for more information.

supervisord.process.count

The number of supervisord monitored processes.

supervisord.process.uptime

The process uptime.

4.10.3.2.24 - TCP Metrics

See Application Integrations for more information.

network.tcp.response_time

The response time of a given host and TCP port.

4.10.3.2.25 - Varnish Metrics

See Application Integrations for more information.

All Varnish metrics have the type gauge except varnish.n_purgesps, which has the type rate.

varnish.accept_fail

Accept failures. This metric is only provided by varnish 3.x.

varnish.backend_busy

Maximum number of connections to a given backend.

varnish.backend_conn

Successful connections to a given backend.

varnish.backend_fail

Failed connections for a given backend.

varnish.backend_recycle

Backend connections with keep-alive that are returned to the pool of connections.

varnish.backend_req

Backend requests.

varnish.backend_retry

Backend connection retries.

varnish.backend_reuse

Recycled connections that has were reused.

varnish.backend_toolate

Backend connections closed because they were idle too long.

varnish.backend_unhealthy

Backend connections not tried because the backend was unhealthy.

varnish.bans

Bans in system, including bans superseded by newer bans and bans already checked by the ban-lurker. This metric is only provided by varnish 4.x.

varnish.bans_added

Bans added to ban list. This metric is only provided by varnish 4.x.

varnish.bans_completed

Bans which are no longer active, either because they got checked by the ban-lurker or superseded by newer identical bans. This metric is only provided by varnish 4.x.

varnish.bans_deleted

Bans deleted from ban list. This metric is only provided by varnish 4.x.

varnish.bans_dups

Bans replaced by later identical bans. This metric is only provided by varnish 4.x.

varnish.bans_lurker_contention

Times the ban-lurker waited for lookups. This metric is only provided by varnish 4.x.

varnish.bans_lurker_obj_killed

Objects killed by ban-lurker. This metric is only provided by varnish 4.x.

varnish.bans_lurker_tested

Bans and objects tested against each other by the ban-lurker. This metric is only provided by varnish 4.x.

varnish.bans_lurker_tests_tested

Tests and objects tested against each other by the ban-lurker. ‘ban req.url == foo && req.http.host == bar’ counts as one in ‘bans_tested’ and as two in ‘bans_tests_tested’. This metric is only provided by varnish 4.x.

varnish.bans_obj

Bans which use obj.* variables. These bans can possibly be washed by the ban-lurker. This metric is only provided by varnish 4.x.

varnish.bans_obj_killed

Objects killed by bans during object lookup. This metric is only provided by varnish 4.x

varnish.bans_persisted_bytes

Bytes used by the persisted ban lists. This metric is only provided by varnish 4.x.

varnish.bans_persisted_fragmentation

Extra bytes accumulated through dropped and completed bans in the persistent ban lists. This metric is only provided by varnish 4.x.

varnish.bans_req

Bans which use req.* variables. These bans can not be washed by the ban-lurker. This metric is only provided by varnish 4.x.

varnish.bans_tested

Bans and objects tested against each other during hash lookup. This metric is only provided by varnish 4.x.

varnish.bans_tests_tested

Tests and objects tested against each other during lookup. ‘ban req.url == foo && req.http.host == bar’ counts as one in ‘bans_tested’ and as two in ‘bans_tests_tested’. This metric is only provided by varnish 4.x.

varnish.busy_sleep

Requests sent to sleep without a worker thread because they found a busy object. This metric is only provided by varnish 4.x.

varnish.busy_wakeup

Requests taken off the busy object sleep list and and rescheduled. This metric is only provided by varnish 4.x.

varnish.cache_hit

Requests served from the cache.

varnish.cache_hitpass

Requests passed to a backend where the decision to pass them found in the cache.

varnish.cache_miss

Requests fetched from a backend server.

varnish.client_conn

Client connections accepted. This metric is only provided by varnish 3.x.

varnish.client_drop

Client connection dropped, no session. This metric is only provided by varnish 3.x.

varnish.client_drop_late

Client connection dropped late. This metric is only provided by varnish 3.x.

varnish.client_req

Parseable client requests seen.

varnish.client_req_400

Requests that were malformed in some drastic way. This metric is only provided by varnish 4.x.

varnish.client_req_411

Requests that were missing a Content-Length: header. This metric is only provided by varnish 4.x.

varnish.client_req_413

Requests that were too big. This metric is only provided by varnish 4.x.

varnish.client_req_417

Requests with a bad Expect: header. This metric is only provided by varnish 4.x.

varnish.dir_dns_cache_full

DNS director full DNS cache. This metric is only provided by varnish 3.x.

varnish.dir_dns_failed

DNS director failed lookup. This metric is only provided by varnish 3.x.

varnish.dir_dns_hit

DNS director cached lookup hit. This metric is only provided by varnish 3.x.

varnish.dir_dns_lookups

DNS director lookups. This metric is only provided by varnish 3.x.

varnish.esi_errors

Edge Side Includes (ESI) parse errors.

varnish.esi_warnings

Edge Side Includes (ESI) parse warnings.

varnish.exp_mailed

Objects mailed to expiry thread for handling. This metric is only provided by varnish 4.x.

varnish.exp_received

Objects received by expiry thread for handling. This metric is only provided by varnish 4.x.

varnish.fetch_1xx

Back end response with no body because of 1XX response (Informational).

varnish.fetch_204

Back end response with no body because of 204 response (No Content).

varnish.fetch_304

Back end response with no body because of 304 response (Not Modified).

varnish.fetch_bad

Back end response’s body length could not be determined and/or had bad headers.

varnish.fetch_chunked

Back end response bodies that were chunked.

varnish.fetch_close

Fetch wanted close.

varnish.fetch_eof

Back end response bodies with EOF.

varnish.fetch_failed

Back end response fetches that failed.

varnish.fetch_head

Back end HEAD requests.

varnish.fetch_length

Back end response bodies with Content-Length.

varnish.fetch_no_thread

Back end fetches that failed because no thread was available. This metric is only provided by varnish 4.x.

varnish.fetch_oldhttp

Number of responses served by backends with http < 1.1

varnish.fetch_zero

Number of responses that have zero length.

varnish.hcb_insert

HCB inserts.

varnish.hcb_lock

HCB lookups with lock.

varnish.hcb_nolock

HCB lookups without lock.

varnish.LCK.backend.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.backend.creat

Created locks.

varnish.LCK.backend.destroy

Destroyed locks.

varnish.LCK.backend.locks

Lock operations.

varnish.LCK.ban.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.ban.creat

Created locks.

varnish.LCK.ban.destroy

Destroyed locks.

varnish.LCK.ban.locks

Lock operations.

varnish.LCK.busyobj.creat

Created locks. This metric is only provided by varnish 4.x.

varnish.LCK.busyobj.destroy

Destroyed locks. This metric is only provided by varnish 4.x.

varnish.LCK.busyobj.locks

Lock operations. This metric is only provided by varnish 4.x.

varnish.LCK.cli.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.cli.creat

Created locks.

varnish.LCK.cli.destroy

Destroyed locks.

varnish.LCK.cli.locks

Lock operations.

varnish.LCK.exp.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.exp.creat

Created locks.

varnish.LCK.exp.destroy

Destroyed locks.

varnish.LCK.exp.locks

Lock operations.

varnish.LCK.hcb.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.hcb.creat

Created locks.

varnish.LCK.hcb.destroy

Destroyed locks.

varnish.LCK.hcb.locks

Lock operations.

varnish.LCK.hcl.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.hcl.creat

Created locks.

varnish.LCK.hcl.destroy

Destroyed locks.

varnish.LCK.hcl.locks

Lock operations.

varnish.LCK.herder.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.herder.creat

Created locks.

varnish.LCK.herder.destroy

Destroyed locks.

varnish.LCK.herder.locks

Lock operations.

varnish.LCK.hsl.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.hsl.creat

Created locks.

varnish.LCK.hsl.destroy

Destroyed locks.

varnish.LCK.hsl.locks

Lock operations.

varnish.LCK.lru.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.lru.creat

Created locks.

varnish.LCK.lru.destroy

Destroyed locks.

varnish.LCK.lru.locks

Lock operations.

varnish.LCK.mempool.creat

Created locks. This metric is only provided by varnish 4.x.

varnish.LCK.mempool.destroy

Destroyed locks. This metric is only provided by varnish 4.x.

varnish.LCK.mempool.locks

Lock operations. This metric is only provided by varnish 4.x.

varnish.LCK.nbusyobj.creat

Created locks. This metric is only provided by varnish 4.x.

varnish.LCK.nbusyobj.destroy

Destroyed locks. This metric is only provided by varnish 4.x.

varnish.LCK.nbusyobj.locks

Lock operations. This metric is only provided by varnish 4.x.

varnish.LCK.objhdr.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.objhdr.creat

Created locks.

varnish.LCK.objhdr.destroy

Destroyed locks.

varnish.LCK.objhdr.locks

Lock operations.

varnish.LCK.pipestat.creat

Created locks. This metric is only provided by varnish 4.x.

varnish.LCK.pipestat.destroy

Destroyed locks. This metric is only provided by varnish 4.x.

varnish.LCK.pipestat.locks

Lock operations. This metric is only provided by varnish 4.x.

varnish.LCK.sess.creat

Created locks. This metric is only provided by varnish 4.x.

varnish.LCK.sess.destroy

Destroyed locks. This metric is only provided by varnish 4.x.

varnish.LCK.sess.locks

Lock operations. This metric is only provided by varnish 4.x.

varnish.LCK.sessmem.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.sessmem.creat

Created locks.

varnish.LCK.sessmem.destroy

Destroyed locks.

varnish.LCK.sessmem.locks

Lock operations.

varnish.LCK.sma.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.sma.creat

Created locks.

varnish.LCK.sma.destroy

Destroyed locks.

varnish.LCK.sma.locks

Lock operations.

varnish.LCK.smf.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.smf.creat

Created locks.

varnish.LCK.smf.destroy

Destroyed locks.

varnish.LCK.smf.locks

Lock operations.

varnish.LCK.smp.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.smp.creat

Created locks.

varnish.LCK.smp.destroy

Destroyed locks.

varnish.LCK.smp.locks

Lock operations.

varnish.LCK.sms.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.sms.creat

Created locks.

varnish.LCK.sms.destroy

Destroyed locks.

varnish.LCK.sms.locks

Lock operations.

varnish.LCK.stat.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.stat.creat

Created locks. This metric is only provided by varnish 3.x.

varnish.LCK.stat.destroy

Destroyed locks. This metric is only provided by varnish 3.x.

varnish.LCK.stat.locks

Lock operations. This metric is only provided by varnish 3.x.

varnish.LCK.vbe.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.vbe.creat

Created locks. This metric is only provided by varnish 3.x.

varnish.LCK.vbe.destroy

Destroyed locks. This metric is only provided by varnish 3.x.

varnish.LCK.vbe.locks

Lock operations. This metric is only provided by varnish 3.x.

varnish.LCK.vbp.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.vbp.creat

Created locks.

varnish.LCK.vbp.destroy

Destroyed locks.

varnish.LCK.vbp.locks

Lock operations.

varnish.LCK.vcapace.creat

Created locks. This metric is only provided by varnish 4.x.

varnish.LCK.vcapace.destroy

Destroyed locks. This metric is only provided by varnish 4.x.

varnish.LCK.vcapace.locks

Lock operations. This metric is only provided by varnish 4.x.

varnish.LCK.vcl.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.vcl.creat

Created locks.

varnish.LCK.vcl.destroy

Destroyed locks.

varnish.LCK.vcl.locks

Lock operations.

varnish.LCK.vxid.creat

Created locks. This metric is only provided by varnish 4.x.

varnish.LCK.vxid.destroy

Destroyed locks. This metric is only provided by varnish 4.x.

varnish.LCK.vxid.locks

Lock operations. This metric is only provided by varnish 4.x.

varnish.LCK.wq.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.wq.creat

Created locks.

varnish.LCK.wq.destroy

Destroyed locks.

varnish.LCK.wq.locks

Lock operations.

varnish.LCK.wstat.colls

Collisions. This metric is only provided by varnish 3.x.

varnish.LCK.wstat.creat

Created locks.

varnish.LCK.wstat.destroy

Destroyed locks.

varnish.LCK.wstat.locks

Lock operations.

varnish.losthdr

HTTP header overflows.

varnish.MEMPOOL.busyobj.allocs

Allocations. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.frees

Frees. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.live

In use. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.pool

In pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.randry

Pool ran dry. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.recycle

Recycled from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.surplus

Too many for pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.sz_needed

Size allocated. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.sz_wanted

Size requested. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.timeout

Timed out from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.busyobj.toosmall

Too small to recycle. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.allocs

Allocations. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.frees

Frees. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.live

In use. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.pool

In pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.randry

Pool ran dry. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.recycle

Recycled from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.surplus

Too many for pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.sz_needed

Size allocated. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.sz_wanted

Size requested. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.timeout

Timed out from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req0.toosmall

Too small to recycle. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.allocs

Allocations. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.frees

Frees. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.live

In use. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.pool

In pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.randry

Pool ran dry. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.recycle

Recycled from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.surplus

Too many for pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.sz_needed

Size allocated. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.sz_wanted

Size requested. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.timeout

Timed out from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.req1.toosmall

Too small to recycle. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.allocs

Allocations. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.frees

Frees. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.live

In use. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.pool

In pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.randry

Pool ran dry. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.recycle

Recycled from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.surplus

Too many for pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.sz_needed

Size allocated. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.sz_wanted

Size requested. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.timeout

Timed out from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess0.toosmall

Too small to recycle. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.allocs

Allocations. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.frees

Frees. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.live

In use. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.pool

In pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.randry

Pool ran dry. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.recycle

Recycled from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.surplus

Too many for pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.sz_needed

Size allocated. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.sz_wanted

Size requested. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.timeout

Timed out from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.sess1.toosmall

Too small to recycle. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.allocs

Allocations. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.frees

Frees. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.live

In use. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.pool

In pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.randry

Pool ran dry. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.recycle

Recycled from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.surplus

Too many for pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.sz_needed

Size allocated. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.sz_wanted

Size requested. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.timeout

Timed out from pool. This metric is only provided by varnish 4.x.

varnish.MEMPOOL.vbc.toosmall

Too small to recycle. This metric is only provided by varnish 4.x.

varnish.MGT.child_died

Child processes that died due to signals. This metric is only provided by varnish 4.x.

varnish.MGT.child_dump

Child processes that produced core dumps. This metric is only provided by varnish 4.x.

varnish.MGT.child_exit

Child processes the were cleanly stopped. This metric is only provided by varnish 4.x.

varnish.MGT.child_panic

Child processes that panicked. This metric is only provided by varnish 4.x.

varnish.MGT.child_start

Child processes that started. This metric is only provided by varnish 4.x.

varnish.MGT.child_stop

Child processes that exited with an unexpected return code. This metric is only provided by varnish 4.x.

varnish.MGT.uptime

This metric is only provided by varnish 4.x.

varnish.n_backend

Number of backends.

varnish.n_ban

Active bans. This metric is only provided by varnish 3.x.

varnish.n_ban_add

New bans added. This metric is only provided by varnish 3.x.

varnish.n_ban_dups

Duplicate bans removed. This metric is only provided by varnish 3.x.

varnish.n_ban_obj_test

Objects tested. This metric is only provided by varnish 3.x.

varnish.n_ban_re_test

Regexps tested against. This metric is only provided by varnish 3.x.

varnish.n_ban_retire

Old bans deleted. This metric is only provided by varnish 3.x.

varnish.n_expired

Objects that expired from cache because of TTL.

varnish.n_gunzip

Gunzip operations.

varnish.n_gzip

Gzip operations.

varnish.n_lru_moved

Move operations done on the LRU list.

varnish.n_lru_nuked

Objects forcefully evicted from storage to make room for new objects.

varnish.n_obj_purged

Purged objects. This metric is only provided by varnish 4.x.

varnish.n_object

object structs made.

varnish.n_objectcore

objectcore structs made.

varnish.n_objecthead

objecthead structs made.

varnish.n_objoverflow

Objects overflowing workspace. This metric is only provided by varnish 3.x.

varnish.n_objsendfile

Objects sent with sendfile. This metric is only provided by varnish 3.x.

varnish.n_objwrite

Objects sent with write. This metric is only provided by varnish 3.x.

varnish.n_purges

Purges executed. This metric is only provided by varnish 4.x.

varnish.n_sess

sess structs made. This metric is only provided by varnish 3.x.

varnish.n_sess_mem

sess_mem structs made. This metric is only provided by varnish 3.x.

varnish.n_vampireobject

Unresurrected objects.

varnish.n_vbc

vbc structs made. This metric is only provided by varnish 3.x.

varnish.n_vcl

Total VCLs loaded.

varnish.n_vcl_avail

Available VCLs.

varnish.n_vcl_discard

Discarded VCLs.

varnish.n_waitinglist

waitinglist structs made.

varnish.n_wrk

Worker threads. This metric is only provided by varnish 3.x.

varnish.n_wrk_create

Worker threads created. This metric is only provided by varnish 3.x.

varnish.n_wrk_drop

Dropped work requests. This metric is only provided by varnish 3.x.

varnish.n_wrk_failed

Worker threads not created. This metric is only provided by varnish 3.x.

varnish.n_wrk_lqueue

Work request queue length. This metric is only provided by varnish 3.x.

varnish.n_wrk_max

Worker threads limited. This metric is only provided by varnish 3.x.

varnish.n_wrk_queued

Queued work requests. This metric is only provided by varnish 3.x.

varnish.pools

Thread pools. This metric is only provided by varnish 4.x.

varnish.s_bodybytes

Total body size. This metric is only provided by varnish 3.x.

varnish.s_fetch

Backend fetches.

varnish.s_hdrbytes

Total header size. This metric is only provided by varnish 3.x.

varnish.s_pass

Passed requests.

varnish.s_pipe

Pipe sessions seen.

varnish.s_pipe_hdrbytes

Total request bytes received for piped sessions. This metric is only provided by varnish 4.x.

varnish.s_pipe_in

Total number of bytes forwarded from clients in pipe sessions. This metric is only provided by varnish 4.x.

varnish.s_pipe_out

Total number of bytes forwarded to clients in pipe sessions. This metric is only provided by varnish 4.x.

varnish.s_req

Requests.

varnish.s_req_bodybytes

Total request body bytes received. This metric is only provided by varnish 4.x.

varnish.s_req_hdrbytes

Total request header bytes received. This metric is only provided by varnish 4.x.

varnish.s_resp_bodybytes

Total response body bytes transmitted. This metric is only provided by varnish 4.x.

varnish.s_resp_hdrbytes

Total response header bytes transmitted. This metric is only provided by varnish 4.x.

varnish.s_sess

Client connections.

varnish.s_synth

Synthetic responses made. This metric is only provided by varnish 4.x.

varnish.sess_closed

Client connections closed.

varnish.sess_conn

Client connections accepted. This metric is only provided by varnish 4.x.

varnish.sess_drop

Client connections dropped due to lack of worker thread. This metric is only provided by varnish 4.x.

varnish.sess_dropped

Client connections dropped due to a full queue. This metric is only provided by varnish 4.x.

varnish.sess_fail

Failures to accept a TCP connection. Either the client changed its mind, or the kernel ran out of some resource like file descriptors. This metric is only provided by varnish 4.x.

varnish.sess_herd varnish.sess_linger

This metric is only provided by varnish 3.x.

varnish.sess_pipe_overflow

This metric is only provided by varnish 4.x.

varnish.sess_pipeline varnish.sess_queued

Client connections queued to wait for a thread. This metric is only provided by varnish 4.x.

varnish.sess_readahead varnish.shm_cont

SHM MTX contention.

varnish.shm_cycles

SHM cycles through buffer.

varnish.shm_flushes

SHM flushes due to overflow.

varnish.shm_records

SHM records.

varnish.shm_writes

SHM writes.

varnish.SMA.s0.c_bytes

Total space allocated by this storage.

varnish.SMA.s0.c_fail

Times the storage has failed to provide a storage segment.

varnish.SMA.s0.c_freed

Total space returned to this storage.

varnish.SMA.s0.c_req

Times the storage has been asked to provide a storage segment.

varnish.SMA.s0.g_alloc

Storage allocations outstanding.

varnish.SMA.s0.g_bytes

Space allocated from the storage.

varnish.SMA.s0.g_space

Space left in the storage.

varnish.SMA.Transient.c_bytes

Total space allocated by this storage.

varnish.SMA.Transient.c_fail

Times the storage has failed to provide a storage segment.

varnish.SMA.Transient.c_freed

Total space returned to this storage.

varnish.SMA.Transient.c_req

Times the storage has been asked to provide a storage segment.

varnish.SMA.Transient.g_alloc

Storage allocations outstanding.

varnish.SMA.Transient.g_bytes

Space allocated from the storage.

varnish.SMA.Transient.g_space

Space left in the storage.

varnish.sms_balloc

SMS space allocated.

varnish.sms_bfree

SMS space freed.

varnish.sms_nbytes

SMS outstanding space.

varnish.sms_nobj

SMS outstanding allocations.

varnish.sms_nreq

SMS allocator requests.

varnish.thread_queue_len

Length of session queue waiting for threads. This metric is only provided by varnish 4.x.

varnish.threads

Number of threads. This metric is only provided by varnish 4.x.

varnish.threads_created

Threads created. This metric is only provided by varnish 4.x.

varnish.threads_destroyed

Threads destroyed. This metric is only provided by varnish 4.x.

varnish.threads_failed

Threads that failed to get created. This metric is only provided by varnish 4.x.

varnish.threads_limited

Threads that were needed but couldn’t be created because of a thread pool limit. This metric is only provided by varnish 4.x.

varnish.uptime

varnish.vmods

Loaded VMODs. This metric is only provided by varnish 4.x.

varnish.vsm_cooling

Space which will soon (max 1 minute) be freed in the shared memory used to communicate with tools like varnishstat, varnishlog etc. This metric is only provided by varnish 4.x.

varnish.vsm_free

Free space in the shared memory used to communicate with tools like varnishstat, varnishlog etc. This metric is only provided by varnish 4.x.

varnish.vsm_overflow

Data which does not fit in the shared memory used to communicate with tools like varnishstat, varnishlog etc. This metric is only provided by varnish 4.x.

varnish.vsm_overflowed

Total data which did not fit in the shared memory used to communicate with tools like varnishstat, varnishlog etc. This metric is only provided by varnish 4.x.

varnish.vsm_used

Used space in the shared memory used to communicate with tools like varnishstat, varnishlog etc. This metric is only provided by varnish 4.x.

varnish.n_purgesps

Purges executed. This metric is only provided by varnish 4.x.

4.10.3.3 - Benchmarks and Compliance

Note: Sysdig follows the Prometheus-compabtible naming convention for both metrics and labels as opposed to the previous statsd-compatible one. However, this page still shows metrics in the legacy Sysdig naming convention. Until this page is updated, see Metrics and Label Mapping for the mapping between legacy Sysdig and Prometheus naming conventions.

Compliance metrics are generated from scheduled CIS Benchmark scans that occur in Sysdig Secure. These metrics cover aggregate results of the various CIS Benchmark sections, as well as granular details about how many running containers are failing specific run-time compliance checks.

Contents

4.10.3.3.1 - Docker/CIS Benchmarks

compliance.docker-bench.container-images-and-build-file.pass_pct

The percentage of successful Docker benchmark tests run on the container images and build files.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.container-images-and-build-file.tests_fail

The number of failed Docker benchmark tests run against the container images and build file.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.container-images-and-build-file.tests_pass

The number of successful Docker benchmark tests run against the container images and build file.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.container-images-and-build-file.tests_total

The total number of tests run against the container images and build file.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.container-runtime.pass_pct

The percentage of successful container runtime Docker benchmark tests.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.container-runtime.tests_fail

The number of failed container runtime benchmark tests.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.container-runtime.tests_pass

The number of successful container runtime Docker benchmark tests.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.container-runtime.tests_total

The total number of Docker benchmark tests run against container runtimes.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-caps-added

The number of containers running without kernel restrictions in place.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-maxretry-not-set

The number of containers configured to not limit installation retries if the initial attempt fails.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-mount-prop-shared

The number of containers that use mount propagation.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-networking-host

The number of containers that share the host’s network namespace.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-no-apparmor

The number of containers running without an AppArmor profile.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-no-cpu-limits

The number of containers running with no CPU limits configured.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-no-health-check

The number of containers that have no HEALTHCHECK instruction configured.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-no-mem-limits

The number of containers configured to run without memory limitations.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-no-pids-cgroup-limit

The number of containers that do not use a cgroup for PIDs.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-no-restricted-privs

The number of containers running that can have additional privileges configured.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-no-seccomp

The number of containers that disable the default seccomp profile.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-no-securityopts

The number of containers running without SELinux options configured.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-no-ulimit-override

The number of containers running that override the default ulimit.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-privileged-ports

The number of containers that have privileged ports mapped into them.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-root-mounted-rw

The number of containers that mount the host’s root filesystem with read/write privileges.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-running-privileged

The number of containers running with the --privileged configuration option set.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-sensitive-dirs

The number of containers that have mounted a sensitive directory from the host.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-sharing-docker-sock

The number of containers that share the host’s docker socket.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-sharing-host-devs

The number of containers that share one or more host devices.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-sharing-host-ipc-ns

The number of containers that share the host’s IPC namespace.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-sharing-host-pid-ns

The number of containers that share the host’s PID namespace.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-sharing-host-user-ns

The number of containers that share the host’s user namespace.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-sharing-host-uts-ns

The number of containers that share the host’s UTS namespace.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-sshd-docker-exec-failures

The number of containers running an SSH daemon.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-unexpected-cgroup

The number of containers running without a dedicated cgroup configured.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-using-docker0-net

The number of containers using the default docker bridge network docker0.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.c-wildcard-bound-port

The number of containers that do not bind incoming traffic to a specific interface.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-daemon-configuration.pass_pct

The percentage of successful Docker benchmark tests run against the Docker daemon configuration.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-daemon-configuration.tests_fail

The number of benchmark tests run against the Docker daemon configuration that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-daemon-configuration.tests_pass

The number of benchmark tests run against the Docker daemon configuration that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-daemon-configuration.tests_total

The total number of benchmark tests run against the Docker daemon configuration.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-daemon-configuration-files.pass_pct

The percentage of successful Docker benchmark tests run against the Docker daemon configuration files.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-daemon-configuration-files.tests_fail

The number of benchmark tests run against the Docker daemon configuration files that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-daemon-configuration-files.tests_pass

The number of benchmark tests run against the Docker daemon configuration files that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-daemon-configuration-files.tests_total

The total number of benchmark tests run against the Docker daemon configuration files.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-security-operations.pass_pct

The percentage of benchmark tests run against Docker security operations that were successful.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-security-operations.tests_fail

The number of benchmark tests run against Docker security operations that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-security-operations.tests_pass

The number of benchmark tests run against Docker security operations that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-security-operations.tests_total

The total number of benchmark tests run against Docker security operations.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-swarm-configuration.pass_pct

The percentage of benchmark tests run against the Docker swarm configuration that were successful.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-swarm-configuration.tests_fail

The number of benchmark tests run against the Docker swarm configuration that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Maxv

compliance.docker-bench.docker-swarm-configuration.tests_pass

The number of benchmark tests run against the Docker swarm configuration that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-swarm-configuration.tests_total

The total number of benchmark tests run against the Docker swarm configuration.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.docker-users

The number of user accounts with permission to access the Docker daemon socket.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.host-configuration.pass_pct

The percentage of benchmark tests run against the host configuration that were successful.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.host-configuration.tests_fail

The number of benchmark tests run against the host configuration that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.host-configuration.tests_pass

The number of benchmark tests run against the host configuration that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.host-configuration.tests_total

The total number of benchmark tests run against the host configuration.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.img-images-using-add

The number of images that use the COPY function rather than the ADD function in Dockerfile.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.img-no-healthcheck

The number of images with no HEALTHCHECK instruction configured.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.img-running-root

The number of images that use the root user.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.img-update-insts-found

The number of images that run a package update step without a package installation step.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.pass_pct

The percentage of Docker benchmark tests run that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.score

The current pass/fail score for Docker benchmark tests run. The value of this metric is calculated by starting at zero, and incrementing once for every successful test, and decrementing once for every test that returns a WARN result or worse.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.tests_fail

The total number of Docker benchmark tests that have failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.tests_pass

The total number of Docker benchmark tests that have passed

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.docker-bench.tests_total

The total number of Docker benchmark tests that have been run.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

4.10.3.3.2 - Kubernetes Benchmarks

compliance.k8s-bench.api-server.pass_pct

The percentage of Kubernetes benchmark tests run on the API server that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.api-server.tests_fail

The number of Kubernetes benchmark tests run on the API server that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.api-server.tests_pass

The number of Kubernetes benchmark tests run on the API server that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.api-server.tests_total

The total number of Kubernetes benchmark tests run on the API server.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.api-server.tests_warn

The number of Kubernetes benchmark tests run on the API server that returned a result of WARN.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configuration-files.pass_pct

The percentage of Kubernetes benchmark tests run on the configuration files of non-master nodes that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configuration-files.tests_fail

The number of Kubernetes benchmark tests run on the configuration files of non-master nodes that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configuration-files.tests_pass

The number of Kubernetes benchmark tests run on the configuration files that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configuration-files.tests_total

The total number of Kubernetes benchmark tests run on the configuration files of non-master nodes.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configuration-files.tests_warn

The number of Kubernetes benchmark tests run on the configuration files of non-master nodes that returned a result of WARN.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configure-files.pass_pct

The percentage of Kubernetes benchmark tests run on the master node configuration files that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configure-files.tests_fail

The number of Kubernetes benchmark tests run on the master node configuration files that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configure-files.tests_pass

The number of Kubernetes benchmark tests run on the master node configuration files that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configure-files.tests_total

The total number of Kubernetes benchmark tests run on the master node configuration files.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.configure-files.tests_warn

The number of Kubernetes benchmark tests run on the master node configuration files that returned a result of WARN.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.controller-manager.pass_pct

The percentage of Kubernetes benchmark tests run on the controller manager that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.controller-manager.tests_fail

The number of Kubernetes benchmark tests run on the controller manager that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.controller-manager.tests_pass

The number of Kubernetes benchmark tests run on the controller manager that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.controller-manager.tests_total

The total number of Kubernetes benchmark tests run on the controller manager.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.controller-manager.tests_warn

The number of Kubernetes benchmark tests run on the controller manager that returned a result of WARN.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.etcd.pass_pct

The percentage of Kubernetes benchmark tests run on the etcd key value store that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.etcd.tests_fail

The number of Kubernetes benchmark tests run on the etcd key value store that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.etcd.tests_pass

The number of Kubernetes benchmark tests run on the etcd key value store that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.etcd.tests_total

The total number of Kubernetes benchmark tests run on the etcd key value store.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.etcd.tests_warn

The number of Kubernetes benchmark tests run on the etcd key value store that returned a result of WARN.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.general-security-primitives.pass_pct

The percentage of Kubernetes benchmark tests run on the security primitives that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.general-security-primitives.tests_fail

The number of Kubernetes benchmark tests run on the security primitives that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.general-security-primitives.tests_pass

The number of Kubernetes benchmark tests run on the security primitives that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.general-security-primitives.tests_total

The total number of Kubernetes benchmark tests run on the security primitives.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.general-security-primitives.tests_warn

The number of Kubernetes benchmark tests run on the security primitives that returned a result of WARN.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.kubelet.pass_pct

The percentage of Kubernetes benchmark tests run on the non-master node Kubernetes agent that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.kubelet.tests_fail

The number of Kubernetes benchmark tests run on the non-master node Kubernetes agent that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.kubelet.tests_pass

The number of Kubernetes benchmark tests run on the non-master node Kubernetes agent that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.kubelet.tests_total

The total number of Kubernetes benchmark tests run on the non-master node Kubernetes agent.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.kubelet.tests_warn

The number of Kubernetes benchmark tests run on the non-master node Kubernetes agent that returned a result of WARN.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.pass_pct

The percentage of Kubernetes benchmark tests that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.scheduler.pass_pct

The percentage of Kubernetes benchmark tests run on the scheduler that passed.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByContainer
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.scheduler.tests_fail

The number of Kubernetes benchmark tests run on the scheduler that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.scheduler.tests_pass

The number of Kubernetes benchmark tests run on the scheduler that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.scheduler.tests_total

The total number of Kubernetes benchmark tests run on the scheduler.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.scheduler.tests_warn

The number of Kubernetes benchmark tests run on the scheduler that returned a result of WARN.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.tests_fail

The number of Kubernetes benchmark tests that failed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.tests_pass

The number of Kubernetes benchmark tests that passed.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.tests_total

The total number of Kubernetes benchmark tests run.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

compliance.k8s-bench.tests_warn

The number of Kubernetes benchmark tests that returned a result of WARN.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByContainer
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

4.10.3.4 - Containers

Note: Sysdig follows the Prometheus-compabtible naming convention for both metrics and labels as opposed to the previous statsd-compatible, legacy Sysdig naming convention. However, this page still shows metrics in the legacy Sysdig naming convention. Until this page is updated, see Metrics and Label Mapping for the mapping between Sysdig legacy and Prometheus naming conventions.

This topic introduces you to the Container metrics.

container.count

The number of containers in the infrastructure.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByHost, Container, Process, Kubernetes, Mesos, Swarm, CloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Time Aggregation FormatsAvg, Sum, Min, Max

container.id

The container’s identifier.

For Docker containers, this value is a 12 digit hex number.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByContainer
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Time Aggregation FormatsN/A

container.image

The name of the image used to run the container.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByContainer
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Time Aggregation FormatsN/A

container.name

The name of the container.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByContainer
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Time Aggregation FormatsN/A

container.type

The type of container (for example, Docker, LXC, or Mesos).

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByContainer
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Time Aggregation FormatsN/A

cpu.quota.used.percent

The percentage of CPU quota a container actually used over a defined period of time.

CPU quotas are a common way of creating a CPU limit for a container. A container can only spend its quota of time on CPU cycles across a given time period. The default time period is 100ms.

Unlike CPU shares, CPU quota is a hard limit for the amount of CPU the container can use. For this reason, the CPU quota should not exceed 100% for an extended period of time. For a shorter time, containers are allowed to consume higher than the CPU quota.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByHost, Container, Process, Kubernetes, Mesos, Swarm, CloudProvider
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Time Aggregation FormatsAvg, Sum, Min, Max

cpu.shares.count

The amount of CPU shares assigned to the container’s cgroup. CPU shares represent a relative weight used by the kernel to distribute CPU cycles across different containers. Each container receives its own allocation of CPU cycles, based on the ratio of share allocation for the container versus the total share allocation for all containers. For example, if an environment has three containers, each with 1024 shares, then each will receive 1/3 of the CPU cycles.

The default value for a container is 1024.

Defining a CPU shares count is a common way to create a CPU limit for a container.

The CPU shares count is not a hard limit. A container can consume more than its allocation, as long as the CPU has cycles that are not being consumed by the container they were originally allocated to.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByHost, Container, Process, Kubernetes, Mesos, Swarm, CloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Time Aggregation FormatsAvg, Sum, Min, Max

cpu.shares.used.percent

The percentage of a container’s allocated CPU shares that are used. CPU shares are a common way of creating a CPU limit for a container, as they represent a relative weight used by the kernel to distribute CPU cycles across different containers. Each container receives its own allocation of CPU cycles, according to the ratio of share count vs the total number of shares claimed by all containers. For example, in an infrastructure with three containers, each with 1024 shares, each container receives 1/3 of the CPU cycles.

A container can use more CPU cycles than allocated if the CPU has cycles that are not being consumed by the container they were originally allocated to. This means that the value of cpu.shares.used.percent can exceed 100%.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByHost, Container, Process, Kubernetes, Mesos, Swarm, CloudProvider
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Time Aggregation FormatsAvg, Sum, Min, Max

memory.limit.bytes

The RAM limit assigned to a container. The default value is 0.

MetadataDescription
Metric TypeGauge
Value TypeByte
Segment ByHost, Container, Process, Kubernetes, Mesos, Swarm, CloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Time Aggregation FormatsAvg, Sum, Min, Max

memory.limit.used.percent

The percentage of the memory limit used by a container.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByHost, Container, Process, Kubernetes, Mesos, Swarm, CloudProvider
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Time Aggregation FormatsAvg, Sum, Min, Max

swap.limit.bytes

The swap limit assigned to a container.

MetadataDescription
Metric TypeGauge
Value TypeByte
Segment ByHost, Container, Process, Kubernetes, Mesos, Swarm, CloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Time Aggregation FormatsAvg, Sum, Min, Max

swap.limit.used.percent

The percentage of swap limit used by the container.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByHost, Container, Process, Kubernetes, Mesos, Swarm, CloudProvider
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Time Aggregation FormatsAvg, Sum, Min, Max

4.10.3.5 - Cloud Provider

Note: Sysdig follows the Prometheus-compabtible naming convention for both metrics and labels as opposed to the previous statsd-compatible, legacy Sysdig naming convention. However, this page still shows metrics in the legacy Sysdig naming convention. Until this page is updated, see Metrics and Label Mapping for the mapping between Sysdig legacy and Prometheus naming conventions.

At this time, all cloudProvider metrics are AWS-related.

cloudProvider.account.id

The cloud provider instance account number.

This metric is useful if there are multiple accounts linked with Sysdig Monitor.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.availabilityZone

The AWS Availability Zone where the entity or entities are located. Each availability zone is an isolated subsection of an AWS region. See cloudProvider.region.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.host.ip.private

The private IP address allocated by the cloud provider for the instance. This address can be used for communication between instances in the same network.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.host.ip.public

Public IP address of the selected host.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.host.name

The name of the host as reported by the cloud provider.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.id

The ID number as assigned and reported by the cloud provider.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.instance.type

The type of instance (for example, AWS or Rackspace).

This metric is extremely useful to segment instances and compare their resource usage and saturation. You can use it as a grouping criteria for the explore table to quickly explore AWS usage on a per-instance-type basis. You can also use it to compare things like CPU usage, number of requests or network utilization for different instance types.

Use this grouping criteria in conjunction with the host.count metric to easily create a report on how many instances of each type you have.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.name

The name of the instance (for example, AWS or Rackspace).

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.region

The region the cloud provider host (or group of hosts) is located in.

Use this grouping criteria in conjunction with the host.count metric to easily create a report on how many instances you have in each region.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.resource.endPoint

The DNS name for which the resource can be accessed.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.resource.name

The cloud provider service name (for example, Amazon EC2 or Amazon ELB).

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.resource.type

The cloud provider service type (for example, INSTANCE, LOAD_BALANCER, DATABASE).

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

cloudProvider.status

Resource status.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

4.10.3.5.1.1 - Elasticache

Amazon ElastiCache is a cloud-caching service that increases the performance, speed, and redundancy with which applications can retrieve data by providing an in-memory database caching system.

aws.elasticache.CPUUtilization

The percentage of CPU utilization.

When reaching high utilization and your main workload is from read requests, scale your cache cluster out by adding read replicas. If the main workload is from write requests, scale up by using a larger cache instance type.

For more information, refer to the ElastiCache documentation.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByCloudProvider
Default Time AggregationAverave
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elasticache.FreeableMemory

The amount of memory considered free, or that could be made available, for use by the node.

For more information, refer to the ElastiCache documentation.

MetadataDescription
Metric TypeGauge
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elasticache.NetworkBytesIn

The number of bytes the host has read from the network.

For more information, refer to the ElastiCache documentation.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elasticache.NetworkBytesOut

The number of bytes the host has written to the network.

For more information, refer to the ElastiCache documentation.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elasticache.SwapUsage

The amount of swap space used on the host.

If swap is being utilized, the node probably needs more memory than is available and cache performance may be negatively impacted. Consider adding more nodes or using larger ones to reduce or eliminate swapping.

For more information, refer to the ElastiCache documentation.

MetadataDescription
Metric TypeGauge
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

4.10.3.5.1.2 - Elastic Application Load Balancing (ALB)

Application Load Balancer is best suited for load balancing of HTTP and HTTPS traffic and provides advanced request routing targeted at the delivery of modern application architectures, including microservices and containers. For more information, refer to the Elastic Application Load Balancer documentation.

aws.alb.ActiveConnectionCount

The total number of concurrent TCP connections active from clients to the load balancer and from the load balancer to the targets.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.ClientTLSNegotiationErrorCount

The number of TLS connections initiated by the client that did not establish a session with the load balancer.

Possible causes include a mismatch of ciphers or protocols.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.ConsumedLCUs

The number of load balancer capacity units (LCU) used by the load balancer.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.HTTPCode_ELB_4XX_Count

The number of HTTP 4XX client error codes that originate form the load balancer. Client errors are generated when requests are malformed or incomplete. These requests have not been received by the target.

This count does not include any response codes generated by the targets.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.HTTPCode_ELB_5XX_Count

The number of HTTP 5XX server error codes that originate from the load balancer. Server errors are generated when requests are malformed or incomplete. These requests have not been received by the target.

This count does not include any response codes generated by the targets.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.HTTPCode_Target_2XX_Count

The number of HTTP 2XX response codes generated by the target.

This count does not include any response codes generated by the load balancer.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.HTTPCode_Target_3XX_Count

The number of HTTP 3XX response codes generated by the target.

This count does not include any response codes generated by the load balancer.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.HTTPCode_Target_4XX_Count

The number of HTTP 4XX response codes generated by the target.

This count does not include any response codes generated by the load balancer.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.HTTPCode_Target_5XX_Count

The number of HTTP 5XX response codes generated by the target.

This count does not include any response codes generated by the load balancer.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.HealthyHostCount

The number of targets that are considered healthy.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.IPv6ProcessedBytes

The total number of bytes processed by the load balancer over IPv6.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.IPv6RequestCount

The total number of data requested by the load balancer over IPv6.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.NewConnectionCount

The total number of new TCP connections established from clients to the load balancer and from the load balancer to targets.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.ProcessedBytes

The total number of bytes processed by the load balancer over IPv4 and IPv6.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.RejectedConnectionCount

The number of connections that were rejected because the load balancer had reached its maximum number of connections.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.RequestCount

The number of requests processed over IPv4 and IPv6. This count only includes the requests with a response generated by a target of the load balancer.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.RequestCountPerTarget

The average number of requests received by each target in a target group.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.RuleEvaluations

The number of rules processed by the load balancer given a request rate averaged over an hour.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.TargetConnectionErrorCount

The number of connections that were not successfully established between the load balancer and target.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.TargetResponseTime

The time elapsed, in seconds, after the request leaves the load balancer until a response from the target is received.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.TargetTLSNegotiationErrorCount

The number of TLS connections initiated by the load balancer that did not establish a session with the target.

Possible causes include a mismatch of ciphers or protocols.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.alb.UnHealthyHostCount

The number of targets that are considered unhealthy.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

4.10.3.5.1.3 - Elastic Cloud Compute (EC2)

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.

aws.ec2.CPUCreditBalance

The CPU credit balance of an instance, based on what has accrued since it started. For more information, refer to the Elastic Compute Cloud metric definition table.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.ec2.CPUCreditUsage

The CPU credit usage by the instance. For more information, refer to the Elastic Compute Cloud metric definition documentation.

MetadataDescription
Metric TypeGauge
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.ec2.CPUUtilization

The percentage of allocated EC2 compute units currently in use on the instance. For more information, refer to the Elastic Compute Cloud metric definition documentation.

This metric identifies the processing power required to run an application upon a selected instance.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByCloudProvider
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.ec2.DiskReadBytes

The total bytes read from all ephemeral disks available to the instance. This metric is used to determine the volume of the data the application reads from the disk and can be used to determine the speed of the application.

The number reported is the number of bytes received during a specified period. For a basic (five-minute) monitoring, divide this number by 300 to find Bytes/second. For a detailed (one-minute) monitoring, divide it by 60.

For more information, refer to the Elastic Compute Cloud metric definition documentation.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.ec2.DiskReadOps

Total completed read operations from all ephemeral disks available to the instance in a specified period of time. For more information, refer to the Elastic Compute Cloud metric definition documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.ec2.DiskWriteBytes

It is the total bytes written to all ephemeral disks available to the instance. This metric is used to determine the volume of the data the application writes to the disk and can be used to determine the speed of the application.

The number reported is the number of bytes received during a specified period. For a basic (five-minute) monitoring, divide this number by 300 to find Bytes/second. For a detailed (one-minute) monitoring, divide it by 60.

For more information, refer to the Elastic Compute Cloud metric definition documentation.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.ec2.DiskWriteOps

The completed write operations to all ephemeral disks available to the instance in a specified period of time. If your instance uses Amazon EBS volumes, see Amazon EBS Metrics. For more information, refer to the Elastic Compute Cloud metric definition documentation.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.ec2.NetworkIn

The number of bytes received on all network interfaces by the instance. For more information, refer to the Elastic Compute Cloud metric definition documentation.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.ec2.NetworkOut

The number of bytes sent out on all network interfaces by the instance. For more information, refer to the Elastic Compute Cloud metric definition documentation.

This metric identifies the volume of outgoing network traffic to an application on a single instance.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

4.10.3.5.1.4 - Elastic Container Service (ECS)

Amazon Elastic Container Service (Amazon ECS) is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS. Amazon ECS eliminates the need for you to install and operate your own container orchestration software, manage and scale a cluster of virtual machines, or schedule containers on those virtual machines.

ecs.clusterName

The name of the cluster. For more information, refer to the AWS CloudFormation documentation.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

ecs.serviceName

The name of the Elastic Container Service (Amazon ECS) service. For more information, refer to the AWS CloudFormation documentation.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

ecs.taskFamilyName

The name of the task definition family. For more information, refer to the AWS CloudFormation documentation.

MetadataDescription
Metric TypeGauge
Value TypeString
Segment ByCloudProvider
Default Time AggregationN/A
Available Time Aggregation FormatsN/A
Default Group AggregationN/A
Available Group Aggregation FormatsN/A

4.10.3.5.1.5 - Elastic Load Balancing (ELB)

Elastic Load Balancing automatically distributes incoming application traffic across multiple targets, such as Amazon EC2 instances, containers, IP addresses, and Lambda functions.

aws.elb.BackendConnectionErrors

The number of errors encountered by the load balancer while attempting to connect to your application.

For high error counts, look for network related issues or check that your servers are operating correctly. The ELB is having problems connecting to them.

For more information, refer to the Elastic Load Balancing documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.HealthyHostCount

A count of the number of healthy instances that are bound to the load balancer.

Hosts are declared healthy if they meet the threshold for the number of consecutive health checks that are successful. Hosts that have failed more health checks than the value of the unhealthy threshold are considered unhealthy. If cross-zone is enabled, the count of the number of healthy instances is calculated for all Availability Zones.

For more information, refer to the Elastic Load Balancing documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.HTTPCode_Backend_2XX

The count of the number of HTTP 2XX response codes generated by back-end instances. This metric does not include any response codes generated by the load balancer.

The 2XX class status codes represent successful actions (e.g., 200-OK, 201-Created, 202-Accepted, 203-Non-Authoritative Info).

For more information, refer to the Elastic Load Balancing documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.HTTPCode_Backend_3XX

The count of the number of HTTP 3XX response codes generated by back-end instances. This metric does not include any response codes generated by the load balancer.

The 3XX class status code indicates that the user agent requires action (e.g., 301-Moved Permanently, 302-Found, 305-Use Proxy, 307-Temporary Redirect).

For more information, refer to the Elastic Load Balancing documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.HTTPCode_Backend_4XX

The count of the number of HTTP 4XX response codes generated by back-end instances. This metric does not include any response codes generated by the load balancer. For more information, refer to the Elastic Load Balancing documentation.

The 4XX class status code represents client errors (e.g., 400-Bad Request, 401-Unauthorized, 403-Forbidden, 404-Not Found).

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.HTTPCode_Backend_5XX

The count of the number of HTTP 5XX response codes generated by back-end instances. This metric does not include any response codes generated by the load balancer. For more information, refer to the Elastic Load Balancing documentation.

The 5XX class status code represents back-end server errors e.g., 500-Internal Server Error, 501-Not implemented, 503-Service Unavailable).

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.HTTPCode_ELB_4XX

The count of the number of HTTP 4XX client error codes generated by the load balancer when the listener is configured to use HTTP or HTTPS protocols. For more information, refer to the Elastic Load Balancing documentation.

Client errors are generated when a request is malformed or is incomplete.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.HTTPCode_ELB_5XX

The count of the number of HTTP 5XX server error codes generated by the load balancer when the listener is configured to use HTTP or HTTPS protocols. This metric does not include any responses generated by back-end instances.For more information, refer to the Elastic Load Balancing documentation.

The metric is reported if there are no back-end instances that are healthy or registered to the load balancer, or if the request rate exceeds the capacity of the instances or the load balancers.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.Latency

A measurement of the time backend requests require to process. For more information, refer to the Elastic Load Balancing documentation.

Latency metrics from the ELB are good indicators of the overall performance of your application.

MetadataDescription
Metric TypeCounter
Value TyperelativeTime
Segment ByCloudProvider
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.RequestCount

The number of requests handled by the load balancer. For more information, refer to the Elastic Load Balancing documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.SpilloverCount

A count of the total number of requests that were rejected due to the queue being full. For more information, refer to the Elastic Load Balancing documentation.

Positive numbers indicate some requests are not being forwarded to any server. Clients are not notified that their request was dropped.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.SurgeQueueLength

A count of the total number of requests that are pending submission to a registered instance. For more information, refer to the Elastic Load Balancing documentation.

Positive numbers indicate clients are waiting for their requests to be forwarded to a server for processing.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.elb.UnHealthyHostCount

The count of the number of unhealthy instances that are bound to the load balancer. For more information, refer to the Elastic Load Balancing documentation.

Hosts are declared healthy if they meet the threshold for the number of consecutive health checks that are successful. Hosts that have failed more health checks than the value of the unhealthy threshold are considered unhealthy.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

4.10.3.5.1.6 - DynamoDB

DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures and is offered by Amazon as part of the Amazon Web Services portfolio. Amazon CloudWatch aggregates the DynamoDB metrics at one-minute intervals.

In DynamoDB, provisioned throughput requirements are specified in terms of capacity units: Read Capacity unit and Write Capacity unit. A unit of read capacity represents one strongly consistent read per second for items up to 4 KB in size. One write capacity unit represents one write per second for items up to 1 KB in size. Larger items will require more capacity. You can calculate the number of units of read and write capacity by estimating the number of reads or writes required per second and multiplying by the size of the items rounded up to the nearest KB.

For more information, see the Amazon DynamoDB documentation.

aws.dynamodb.ConditionalCheckFailedRequests

The number of failed attempts to perform conditional writes.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.ConsumedReadCapacityUnits

The amount of read capacity units consumed over the defined time period. Amazon CloudWatch aggregates the metrics at one-minute intervals. Use the Sum aggregation to calculate the consumed throughput. For example, get the Sum value over a span of one minute, and divide it by the number of seconds in a minute (60) to calculate the average ConsumedReadCapacityUnits per second.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.ConsumedWriteCapacityUnits

The amount of write capacity units consumed over the specified time interval. Amazon CloudWatch aggregates the metrics at one-minute intervals. Use the Sum aggregation to calculate the consumed throughput. For example, get the Sum value over a span of one minute, and divide it by the number of seconds in a minute (60) to calculate the average ConsumedWriteCapacityUnits per second.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.ProvisionedReadCapacityUnits

The number of read capacity units provisioned for a table or a global secondary index.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.ProvisionedWriteCapacityUnits

The number of write capacity units provisioned for a table or global secondary table.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.ReadThrottleEvents

The number of DynamoDB requests that exceed the amount of read capacity units provisioned.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.ReturnedBytes.GetRecords

The number of bytes returned by GetRecords operation during the specified time period.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.ReturnedItemCount

The number of items returned by query or scan operations during the specified time period.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.ReturnedRecordsCount.GetRecords

The number of stream records returned by the GetRecords operations during the specific period.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.SuccessfulRequestLatency

The number of successful requests to DynamoDB or Amazon DynamoDB Streams during the specified time period. The time period is in milliseconds.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.SystemErrors

The number of requests made to DynamoDB or Amazon DynamoDB Streams that resulted in an HTTP 500 status code during the specified time period. HTTP 500 usually indicates an internal service error.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.ThrottledRequests

The number of requests to DynamoDB that exceed the provisioned throughput limits on a resource, such as a table or an index. ThrottledRequests is incremented by one if any event within a request exceeds a provisioned throughput limit.

If any individual request for read or write events within the batch is throttled, ReadThrottleEvents metrics or WriteThrottleEvents metrics is incremented respectively.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.UserErrors

The number of requests to DynamoDB or Amazon DynamoDB Streams that returned an HTTP 400 status code during the specified time period. HTTP 400 usually indicates a client-side error.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.dynamodb.WriteThrottleEvents

The number of requests to DynamoDB that exceed the provisioned write capacity units for a table or a global secondary index.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

4.10.3.5.1.7 - Relational Database Service (RDS)

Amazon Relational Database Service (Amazon RDS) is a managed SQL database service provided by Amazon Web Services (AWS). Amazon RDS supports an array of database engines to store and organize data and helps with database management tasks, such as migration, backup, recovery, and patching.

aws.rds.BinLogDiskUsage

The amount of disk space occupied by binary logs on the master. Applies to MySQL read replicas.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.CPUUtilization

The percentage of CPU utilization.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeGauge
Value Type%
Segment ByCloudProvider
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.DatabaseConnections

The number of database connections in use.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.DiskQueueDepth

The number of outstanding I/Os (read/write requests) waiting to access the disk.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.FreeableMemory

The amount of available random access memory, in megabytes.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeGauge
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.FreeStorageSpace

The amount of available storage space in bytes.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeGauge
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.NetworkReceiveThroughput

The incoming (Receive) network traffic on the DB instance, including both customer database traffic and Amazon RDS traffic used for monitoring and replication. The metric is measured in bytes per second.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.NetworkTransmitThroughput

The outgoing (Transmit) network traffic on the DB instance, including both customer database traffic and Amazon RDS traffic used for monitoring and replication. The metric is measured in bytes per second.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.ReadIOPS

The average number of read I/O operations per second.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.ReadLatency

The average amount of seconds taken per read I/O operation.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TyperelativeTime
Segment ByCloudProvider
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.ReadThroughput

The average number of bytes read from disk per second.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.ReplicaLag

The amount of time, in nanoseconds, a Read Replica DB instance lags behind the source DB instance.

This metric applies to MySQL read replicas.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TyperelativeTime
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.SwapUsage

The amount of swap space used by the database, measured in megabytes.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeGauge
Value TypeByte
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.WriteIOPS

The average number of write I/O operations per second.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.WriteLatency

The average amount of time taken per write I/O operation.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TyperelativeTime
Segment ByCloudProvider
Default Time AggregationAverage
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAverage
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.rds.WriteThroughput

The average number of bytes written to disk per second.

For more information, refer to the Amazon Relational Database (RDS) documentation.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationSum
Available Group Aggregation FormatsAvg, Sum, Min, Max

4.10.3.5.1.8 - Simple Queue Service (SQS)

Amazon Simple Queue Service (Amazon SQS) is a pay-per-use web service for storing messages in transit between computers. Developers use SQS to build distributed applications with decoupled components without having to deal with the overhead of creating and maintaining message queues.

Amazon Simple Queue Service (Amazon SQS) is a pay-per-use web service for storing messages in transit between computers. Developers use SQS to build distributed applications with decoupled components without having to deal with the overhead of creating and maintaining message queues. For more information, see Amazon SQS Resources.

aws.sqs.ApproximateNumberOfMessagesDelayed

The number of messages in the queue that are delayed or currently unavailable for reading. Messages are stuck like this when the queue is configured as a delay queue or when a message has been sent with a delay parameter.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAvg
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.sqs.ApproximateNumberOfMessagesNotVisible

The number of undelivered messages. These messages are still in the queue, on their way to a client (in flight), but have not yet been deleted or have not yet reached the destination.

MetadataDescription
Metric TypeCounter
Value TypeInteger
Segment ByCloudProvider
Default Time AggregationRate
Available Time Aggregation FormatsAvg, Rate, Sum, Min, Max
Default Group AggregationAvg
Available Group Aggregation FormatsAvg, Sum, Min, Max

aws.sqs.ApproximateNumberOfMessagesVisible

The number of messages available for retrieval from the queue. These are the messages which have not yet been locked by an SQS worker.

MetadataDescription
Metric TypeCounter
V