This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

  • 1:
    • 2:
      • 3:
        • 4:
          • 5:
            • 6:
              • 7:

                Metrics

                Metrics are quantitative values or measures that can be grouped/divided by labels. Sysdig Monitor metrics are divided into two groups: default metrics (out-of-the-box metrics concerning the system, orchestrator, and network infrastructure), and custom metrics(JMX, StatsD, and multiple other integrated application metrics).

                Sysdig automatically collects all types of metrics, and auto-labels them. Custom metrics can also have custom (user-defined) labels.

                Out-of-the box, when an agent has been deployed on a host, Sysdig Monitor automatically begins collecting and reporting on a wide array of metrics. The sections below describe how those metrics are conceptualized within the system.

                Learn more about the metrics types and the data aggregation techniques supported by Sysdig Monitor in the following sections:

                1 -

                Grouping, Scoping, and Segmenting Metrics

                Data aggregation and filtering in Sysdig Monitor are done through the use of assigned labels. The sections below explain how labels work, the ways they can be used, and how to work with groupings, scopes, and segments.

                Labels

                Labels are used to identify and differentiate characteristics of a metric, allowing them to be aggregated or filtered for Explore module views, dashboards, alerts, and captures. Labels can be used in different ways:

                • To group infrastructure objects into logical hierarchies displayed on the Explore tab (called groupings). For more information, refer to Groupings .

                • To split aggregated data into segments. For more information, refer to Segments.

                There are two types of labels:

                • Infrastructure labels

                • Metric descriptor labels

                Infrastructure Labels

                Infrastructure labels are used to identify objects or entities within the infrastructure that a metric is associated with, including hosts, containers, and processes. An example label is shown below:

                kubernetes.pod.name
                

                The table below outlines what each part of the label represents:

                Example Label ComponentDescription
                kubernetesThe infrastructure type.
                podThe object.
                nameThe label key.

                Infrastructure labels are obtained from the infrastructure (including from orchestrators, platforms, and the runtime processes), and Sysdig automatically builds a relationship model using the labels. This allows users to create logical hierarchical groupings to better aggregate the infrastructure objects in the Explore module.

                For more information on groupings, refer to the Groupings.

                Metric Descriptor Labels

                Metric descriptor labels are custom descriptors or key-value pairs applied directly to metrics, obtained from integrations like StatsD, Prometheus, and JMX. Sysdig automatically collects custom metrics from these integrations, and parses the labels from them. Unlike infrastructure labels, these labels can be arbitrary, and do not necessarily map to any entity or object.

                Metric descriptor labels can only be used for segmenting, not grouping or scoping.

                An example metric descriptor label is shown below:

                website_failedRequests:20|region=‘Asia’, customer_ID=‘abc’
                

                The table below outlines what each part of the label represents:

                Example Label ComponentDescription
                website_failedRequestsThe metric name.
                20The metric value.
                region=‘Asia’, customer_ID=‘abc’The metric descriptor labels. Multiple key-value pairs can be assigned using a comma separated list.

                Sysdig recommends not using labels to store dimensions with high cardinalities (numerous different label values), such as user IDs, email addresses, URLs, or other unbounded sets of values. Each unique key-value label pair represents a new time series, which can dramatically increase the amount of data stored.

                Groupings

                Groupings are hierarchical organizations of labels, allowing users to organize their infrastructure views on the Explore tab in a logical hierarchy. An example grouping is shown below:

                The example above groups the infrastructure into four levels. This results in a tree view in the Explore module with four levels, with rows for each infrastructure object applicable to each level.

                As each label is selected, Sysdig Monitor automatically filters out labels for the next selection that no longer fit the hierarchy, to ensure that only logical groupings are created.

                The example below shows the logical hierarchy structure for Kubernetes:

                • Clusters: Cluster > Namespace > Replicaset > Pod

                • Namespace: Cluster > Namespace > HorizontalPodAutoscaler > Deployment > Pod

                • Daemonsets : Cluster > Namespace > Daemonsets > Pod

                • Services: Cluster > Namespace > Service > StatefulSet > Pod

                • Job: Cluster > Namespace > Job > Pod

                • ReplicationController: Cluster > Namespace > ReplicationController > Pod

                The default groupings are immutable: They cannot be modified or deleted. However, you can make a copy of them that you can modify.

                Unified Workload Labels

                Sysdig provides the following labels to help improve your infrastructure organization and troubleshooting easier.

                • kubernetes.workload.name: Displays all the Kubernetes workloads and indicates what type and name of workload resource (deployment, daemonSet, replicaSet, and so on) it is.

                • kubernetes.workload.type: Indicates what type of workload resource (deployment, daemonSet, replicaSet, and so on) it is.

                The availability of these labels also simplifies Groupings. You do not need different groupings for each type of deployment, instead, you have a single grouping for workloads.

                The labels allow you to segment metrics, such as cpu.used.percent , by kubernetes.workload.name to see CPU usage for all the workloads, instead of having a separate query for segmenting by kubernetes.deployment.name, kubernetes.replicaSet.name , and so on.

                Learn More

                Scopes

                A scope is a collection of labels that are used to filter out or define the boundaries of a group of data points when creating dashboards, dashboard panels, alerts, and teams. An example scope is shown below:

                In the example above, the scope is defined by two labels with operators and values defined. The table below defines each of the available operators.

                OperatorDescription
                isThe value matches the defined label value exactly.
                is notThe value does not match the defined label value exactly.
                inThe value is among the comma separated values entered.
                not inThe value is not among the comma separated values entered.
                containsThe label value contains the defined value.
                does not containThe label value does not contain the defined value.

                The scope editor provides dynamic filtering capabilities. It restricts the scope of the selection for subsequent filters by rendering valid values that are specific to the previously selected label. Expand the list to view unfiltered suggestions. At run time, users can also supply custom values to achieve more granular filtering. The custom values are preserved. Note that changing a label higher up in the hierarchy might render the subsequent labels incompatible. For example, changing the kubernetes.namespace.name > kubernetes.deployment.name hierarchy to swarm.service.name > kubernetes.deployment.name is invalid as these entities belong to different orchestrators and cannot be logically grouped.

                Dashboards and Panels

                Dashboard scopes define the criteria for what metric data will be included in the dashboard’s panels. The current dashboard’s scope can be seen at the top of the dashboard:

                By default, all dashboard panels abide by the scope of the overall dashboard. However, an individual panel scope can be configured for a different scope than the rest of the dashboard.

                For more information on Dashboards and Panels, refer to the Dashboards documentation.

                Alerts

                Alert scopes are defined during the creation process, and specify what areas within the infrastructure the alert is applicable for. In the example alerts below, the first alert has a scope defined, whereas the second alert does not have a custom scope defined. If no scope is defined, the alert is applicable to the entire infrastructure.

                For more information on Alerts, refer to the Alerts documentation.

                Teams

                A team’s scope determines the highest level of data that team members have visibility for:

                • If a team’s scope is set to Host, team members can see all host-level and container-level information.

                • If a team’s scope is set to Container, team members can only see container-level information.

                A team’s scope only applies to that team. Users that are members of multiple teams may have different visibility depending on which team is active.

                For more information on teams and configuring team scope, refer to the Manage Teams and Roles documentation.

                Segments

                Aggregated data can be split into smaller sections by segmenting the data with labels. This allows for the creation of multi-series comparisons and multiple alerts. In the first image, the metric is not segmented:

                In the second image, the same metric has been segmented by container.id:

                Line and Area panels can display up to five different segments for any given metric. The example image below displays the net.byte.in metric segmented by both container.id and net.http.url:

                For more information regarding segmentation in dashboard panels, refer to the Configure Panels documentation. For more information regarding configuring alerts, refer to the Alerts documentation.

                The Meaning of n/a

                Sysdig Monitor imports data related to entities such as hosts, containers, processes, and so on, and reports them in tables or panels on the Explore and Dashboards UI, as well as in events, so across the UI you see varieties of data. The term n/a can appear anywhere on the UI where some form of data is displayed.

                n/a is a term that indicates data that is not available or that it does not apply to a particular instance. In Sysdig parlance, the term signifies one or more entities defined by a particular label, such as hostname or Kubernetes service, for which the label is invalid. In other words, n/a collectively represent entities whose metadata is not relevant to aggregation and filtering techniques—Grouping, Scoping, and Segmenting. For instance, a list of Kubernetes services might display the list of all the services as well as n/a that includes all the containers without the metadata describing a Kubernetes service.

                You might encounter n/a sporadically in Explore UI as well as in drill-down panels or dashboards, events, and likely elsewhere on the Sysdig Monitor UI when no relevant metadata is available for that particular display. How n/a should be treated depends on the nature of your deployment. The deployment will not be affected by the entities marked n/a.

                The following are some of the cases that yield n/a on the UI:

                • Labels are partially available or not available. For example, a host has entities that are not associated with a monitored Kubernetes deployment, or a monitored host has an unmonitored Kubernetes deployment running.

                • Labels that do not apply to the grouping criteria or at the hierarchy level. For example:

                  • Containers that are not managed by Kubernetes. The containers managed by Kubernetes are identified with their  container.name labels.

                  • In certain groupings by DaemonSet, Deployments render N/A and vice versa. Not all containers belong to both DaemonSet and Deployment objects concurrently. Likewise, a Kubernetes ReplicaSet grouping with the  kubernetes.replicaset.name label will not show StatefulSets.

                  • In a kubernetes.cluster.name > kubernetes.namespace.name > kubernetes.deployment.name  grouping, the entities without the kubernetes.cluster.name label yield n/a.

                • Entities are incorrectly labeled in the infrastructure.

                • Kubernetes features that are yet to be in sync with Sysdig Monitoring.

                • The format is not applicable to a particular record in the database.

                2 -

                Understanding Default, Custom, and Missing Metrics

                Default Metrics

                Default metrics include various kinds of metadata which Sysdig Monitor automatically knows how to label, segment, and display.

                For example:

                • System metrics for hosts, containers, and processes (CPU used, etc.)

                • Orchestrator metrics (collected from Kubernetes, Mesos, etc.)

                • Network metrics (e.g. network traffic)

                • HTTP

                • Platform metrics (in some cases)

                Default metrics are collected mainly from two sources: syscalls and Kubernetes.

                Custom Metrics

                About Custom Metrics

                Custom metrics generally refer to any metrics that the Sysdig Agent collects from some third-party integration. The type of infrastructure and applications integrated determine the custom metrics that the Agent collects and reports to Sysdig Monitor. The supported custom metrics are:

                Each metric comes with a set of custom labels, and additional labels can be user-created. Sysdig Monitor simply collects and reports them with minimal or no internal processing. The limit currently enforced is 3000 metrics per host. Use the metrics_filter option in the dragent.yaml file to remove unwanted metrics or to choose the metrics to report when hosts exceed this limit. For more information on editing the dragent.yaml file, see Understanding the Agent Config Files.

                Unit for Custom Metrics

                Sysdig Monitor detects the default unit of custom metrics automatically with the delimiter suffix in the metrics name. For example, custom_expvar_time_seconds results in a base unit set to seconds. The supported base units are byte, percent, and time. Custom metrics name should carry one of the following delimiter suffixes in order for Sysdig Monitor to identify and configure the accurate unit type.

                • second

                • seconds

                • byte

                • bytes

                • total (represents accumulating count)

                • percent

                Custom metrics will not be auto-detected and the unit will be incorrect unless this naming convention is followed. For instance, custom_byte_expvar will not yield the correct unit, that is MiB.

                Editing the Unit Scale

                You have the flexibility to change the unit scale either by editing the panel on the Dashboard or in the Explore.

                Explore

                From the Search Metrics and Dashboard drop-down, select the custom metrics you want to edit the unit selection for, then click More Options. Select the desired unit scale from the Metric Format drop-down and click Save.

                Dashboard

                Select the Dashboard Panel associated with the custom metrics you want to modify. Select the desired unit scale from the Metrics drop-down and click Save.

                Display Missing Data

                Data can be missing for a few different reasons:

                • Problems such as faulty network connectivity in the communication channel between your infrastructure and Sysdig metrics store.

                • Metrics or StatsD batch jobs are submitted sporadically.

                Sysdig Monitor allows you to configure the behavior of missing data in Dashboards. Though metric type determines the default behavior, you can configure how to visualize missing data and define it at the per-query level. Use the No Data Display drop-down in the Options menu in the panel configuration. See Create a New Panel for more information.

                Consider the following guidelines:

                • The No Data Display drop-down has only two options for the Stacked Area timechart: gap and show as zero.

                • For the Number panel, the No Data Display option allows entering a custom no data text.

                • For form-based timechart panels, the default option for a metrics selection that does not contain a StatsD metric is gap.

                • Adding a StatsD metric to a query in a form-based timechart panel will default the selected No Data Display type to the show as zero , which is the default option for form-based StatsD metrics. You can change this selection to any other type.

                • The default display option is gap for PromQL Timechart panels.

                The options for No Data Display are:

                • gap: The default option for form-based timechart panel, where a query metrics selection does not contain a StatsD metric. gap is the best visualization type for most use cases because it is easy to spot indicating a problem.

                • show as zero: The best option for StatsD metrics which are only submitted sporadically. For example, batch jobs and count of errors. This is the default display option for StatsD metrics in form-based panels.

                  We do not recommend this option as setting zero could be misleading. For example, this setting will report the value for free disk space as 0% when the disk or host disappears, but in reality, the value is unknown.

                  Prometheus best practices recommend avoiding missing metrics.

                • connect - solid: Use for measuring the value of a metric, typically a gauge, where you want to visualize the missing samples flattened.

                  The leftmost and rightmost visible data points can be connected as Sysdig does not perform the interpolation.

                • connect - dotted: Use it for measuring the value of a metric, typically a gauge, where you want to visualize the missing samples flattened.

                  The leftmost and rightmost visible data points can be connected as Sysdig does not perform the interpolation.

                3 -

                Prometheus Metrics Types

                Sysdig Monitor transforms Prometheus metrics into usable, actionable entries in two ways:

                Calculated Metrics

                The Prometheus metrics that are scraped by the Sysdig agent and transformed into the traditional StatsD model are called calculated metrics. In calculated metrics, the delta is stored with the previous value. This delta is what Sysdig uses on the classic backend for metrics analyzing and visualization. While generating the calculated metrics, the gauge metrics are kept as they are, but the counter metrics are transformed.

                Prometheus calculated metrics cannot be used in PromQL.

                The Histogram and Summary metrics are transformed into a different format called Prometheus histogram and summary metrics respectively. The transformations include:

                • All of the quantiles are transformed into a different metric, with the quantile added as a suffix.

                • The count and sum of these summary metrics are exposed as different metrics with names slightly changed. _ (underscore) in the name is replaced with a period .. For more information, see Mapping Between Classic Metrics and PromQL Metrics.

                Prometheus calculated metrics (legacy metrics) are scheduled to be deprecated in the coming months.

                Raw Metrics

                In Sysdig parlance, the Prometheus metrics that are scraped (by the Sysdig agent), collected, sent, stored, visualized, and presented exactly as Prometheus exposes them are called raw metrics. Raw metrics are used with PromQL.

                Sysdig counter is a StatsD type counter, where the difference in value is kept, but not the raw value of the counter, whereas Prometheus raw metrics are counters that are always monotonically increasing. A rate function needs to be applied on Prometheus raw metrics to make sense of it.

                Time Aggregations Over Prometheus Metrics

                The following time aggregations are supported for both the metric types:

                • Average: Returns an average of a set of data points, keeping all the labels.

                • Maximum and Minimum: Returns a maximal or minimal value, keeping all the labels.

                • Sum: Returns a sum of the values of data points, keeping all the labels.

                • Rate (timeAvg): Returns a sum of changes to the counter across data points in a given time period and divides by time, keeping all the labels as they are. For Prometheus raw metrics, timeAvg is calculated by taking the difference and dividing it by time.

                Prometheus Calculated Metrics

                Prometheus calculated metrics are treated as gauges by Sysdig, and there the following time aggregations are available:

                • Average

                • Sum

                • Minimum

                • Maximum

                Rate (timeAvg) is not available because they are not applicable to gauge metrics.

                Prometheus Raw Metrics

                For the gauge type, the following types are available:

                • Average

                • Minimum

                • Maximum

                For the counter type, the following types are available:

                • Rate: Calculates the first derivative of the counter (change over time).

                • Sum: Calculates a complete change of the counter over a period of time.

                4 -

                Heuristic and Deprecated Metrics

                Heuristic Metrics

                Various network-related metrics reported by Sysdig, including response times, are calculated at the kernel level by measuring latency between systems calls. In an effort to ensure Sysdig remains the trusted source of infrastructure insights, moving forward we will be labeling some network related metrics as heuristic and are tagging with the symbol in the application.

                Existing alerts using these metrics will not be modified or disabled. However, these alerts will not be able to be updated.

                Additional heuristic metric details are listed below:

                MetricSet New Alerts
                net.http.request.timeYes
                net.http.request.countYes
                net.http.error.countYes
                net.sql.request.timeYes
                net.sql.request.countYes
                net.sql.error.countYes
                net.mongodb.request.timeYes
                net.mongodb.request.countYes
                net.mongodb.error.countYes
                net.request.time.file.percentYes
                net.request.time.local.percentYes
                net.request.time.net.percentYes
                net.request.time.nextTiers.percentYes
                net.request.time.processing.percentYes
                net.request.timeNo
                net.request.time.inNo
                net.request.time.outNo
                net.request.time.worst.inNo
                net.request.time.worst.outNo
                net.request.countNo
                net.request.count.inNo

                Deprecated Metrics:

                Based on low usage patterns, Sysdig has decided to deprecate the following metrics on August 1, 2018. Users will continue to have the ability to collect similar data using Prometheus, or another method of code instrumentation (i.e. StatsD or JMX for Java applications).

                The table below shows the current metrics and options for similar functionality.

                Current MetricAlternative Starting August 1, 2018
                capacity.estimated.request.stolen.countCreate your application metrics using Prometheus, StatsD or JMX for Java applications.
                capacity.estimated.request.total.count
                capacity.stolen.percent
                capacity.total.percent
                capacity.used.percent
                net.request.time.file
                net.request.time.local
                net.request.time.net
                net.request.time.nextTiers
                net.request.time.processing
                net.sql.request.time.worstMax aggregation (net.sql.request.time)
                net.mongodb.request.time.worstMax aggregation (net.mongodb.request.time)
                net.http.request.time.worstMax aggregation (net.http.request.time)

                5 -

                Manage Metric Scale

                Sysdig provides several knobs for managing metric scale.

                There are three primary ways in which you could include/exclude metrics, should you encounter unwanted metrics limits.

                1. Include/exclude custom metrics by name filters.

                  See Include/Exclude Custom Metrics.

                2. Include/exclude metrics emitted by certain containers, Kubernetes annotations, or any other container label at collection time.

                  See Prioritize/Include/Exclude Designated Containers.

                3. Exclude metrics from unwanted ports.

                  See Blacklist Ports.

                6 -

                Data Aggregation

                Sysdig Monitor allows users to adjust the aggregation settings when graphing or creating alerts for a metric, informing how Sysdig rolls up the available data samples in order to create the chart or evaluate the alert. There are two forms of aggregation used for metrics in Sysdig: time aggregation and group aggregation.

                Time aggregation is always performed before group aggregation.

                Time Aggregation

                Time aggregation comes into effect in two overlapping situations:

                • Charts can only render a limited number of data points. To look at a wide range of data, Sysdig Monitor may need to aggregate granular data into larger samples for visualization.

                • Sysdig Monitor rolls up historical data over time.

                  Sysdig retains rollups based on each aggregation type, to allow users to choose which data points to utilize when evaluating older data.

                Sysdig agents collect 1-second samples and report data at 10-second resolution. The data is stored and reported every 10-second with the available aggregations (average, rate, min, max, sum) to make them available via the Sysdig Monitor UI and the API. For time series charts covering five minutes or less, data points are drawn at this 10-second resolution, and any time aggregation selections will have no effect. When an amount of time greater than five minutes is displayed, data points are drawn as an aggregate for an appropriate time interval. For example, for a chart covering one hour, each data point would reflect a one minute interval.

                At time intervals of one minute and above, charts can be configured to display different aggregates for the 10-second metrics used to calculate each datapoint.

                Aggregation TypeDescription
                averageThe average of the retrieved metric values across the time period.
                rateThe average value of the metric across the time period evaluated.
                maximumThe highest value during the time period evaluated.
                minimumThe lowest value during the time period evaluated.
                sumThe combined sum of the metric across the time period evaluated.

                In the example images below, the kubernetes.deployment.replicas.available metrics first uses the average for time aggregation:

                Then uses the sum for time aggregation:

                • Rate and average are very similar and often provide the same result. However, the calculation of each is different.

                  • If time aggregation is set to one minute, the agent is supposed to retrieve six samples (one every 10 seconds).

                  • In some cases, samples may not be there, due to disconnections or other circumstances. For this example, four samples are available. If this was the case, the average would be calculated by dividing by four, while the rate would be calculated by dividing by six.

                • Most metrics are sampled once for each time interval, resulting in average and rate returning the same value. However, there will be a distinction for any metrics not reported at every time interval. For example, some custom statsd metrics.

                • Rate is currently referred to as timeAvg in the Sysdig Monitor API and advanced alerting language.

                • By default, average is used when displaying data points for a time interval.

                Group Aggregation

                Metrics applied to a group of items (for example, several containers, hosts, or nodes) are averaged between the members of the group by default. For example, three hosts report different CPU usage for one sample interval. The three values will be averaged, and reported on the chart as a single datapoint for that metric.

                There are several different types of group aggregation:

                Aggregation TypeDescription
                averageThe average value of the interval’s samples.
                maximumThe maximum value of the interval’s samples.
                minimumThe minimum value of the interval’s samples.
                sumThe combined value of all of the interval’s samples.

                If a chart or alert is segmented, the group aggregation settings will be utilized for both aggregations across the whole group, and aggregation within each individual segmentation.

                For example, the image below shows a chart for CPU% across the infrastructure:

                When segmented by proc.name, the chart shows one CPU% line for each process:

                Each line provides the average value for every process with the same name. To see the difference, change the group aggregation type to sum:

                The metric aggregation value showed beside the metric name is for the time aggregation. While the screenshot shows AVG, the group aggregation is set to SUM.

                Aggregation Examples

                The tables below provide an example of how each type of aggregation works. The first table provides the metric data, while the second displays the resulting value for each type of aggregation.

                In the example below, the CPU% metric is applied to a group of servers called webserver. The first chart shows metrics using average aggregation for both time and group. The second chart shows the metrics using maximum aggregation for both time and group.

                For each one minute interval, the second chart renders the highest CPU usage value found from the servers in the webserver group and from all of the samples reported during the one minute interval. This view can be useful when searching for transient spikes in metrics over long periods of time, that would otherwise be missed with average aggregation.

                The group aggregation type is dependent on the segmentation. For a view showing metrics for a group of items, the current group aggregation setting will revert to the default setting, if the Segment By selection is changed.

                7 -

                Metric Limits

                Sysdig ensures that you see the most relevant metric information relevant to your monitored environment. To achieve this, limits are enforced on the number of metrics that the datastore can store. Different limits apply to different metric types and agent versions.

                Enterprise

                The metric limits are automatically set by the Sysdig backend components based on your plan, agent version, and backend configuration. The default limits are provided below:

                Metrics TypesMetrics LimitDescription
                Prometheus8000Set other custom metric limits to zero to increase the Prometheus metrics limit to 10,000.
                StatsD1000
                JMX500
                AppChecks500
                Total10,000The total number of custom metrics across all metric types should not exceed 10,000.

                The custom metrics limit of 10,000 does not include the agent metrics that are provided out-of-the-box, such as host, container, and Kube State Metrics.

                View Metric Limits

                Use the Sysdig Agent Health & Status dashboard under Host Infrastructure templates to view current usage per host for each metric type.

                The metric limits are exposed to the UI through the following agent metrics.

                MetricsDescription
                dragent.metricCount.limit.appCheckThe maximum number of unique appCheck timeseries that are allowed in an individual sample from the agent per node.
                dragent.metricCount.limit.statsdThe maximum number of unique statsd timeseries that are allowed in an individual sample from the agent per node.
                dragent.metricCount.limit.jmxThe maximum number of unique JMX timeseries that are allowed in an individual sample from the agent per node.
                dragent.metricCount.limit.prometheusThe maximum number of unique Prometheus timeseries that are allowed in an individual sample from the agent per node.

                Learn More