Sysdig Documentation

Enforcing Limit on Prometheus Metric Collection

Sysdig enforces a limit on the number of Prometheus metrics processed and stored. Therefore, not all the time series data will be seen in the Sysdig Monitor UI, because those are discarded after the Agent scraped them. 

Imposing a limit helps reduce resource usage including disk space and time required for metrics aggregation.

Note

This document uses metric and time series interchangeably. The description of configuration parameters refers to "metric", but in strict Prometheus terms, those imply time series. That is, applying a limit of 100 metrics implies applying a limit on time series, where all the time series data might not have the same metric name.

Configuring the Agent

The Agent imposes a limit on the number of metrics read from a Prometheus metric endpoint transmitted to the backend components. This is controlled via dragent.yaml; these are the relevant settings and their defaults:

prometheus:
  max_tags_per_metric: 20
  max_metrics_per_process: 1000
  max_metrics: 1000

The max_metrics and max_metrics_per_process limits can be pushed from the backend components to the Agent via auto-configuration. See Configuring the Backend Components.

Tweaking Agent ingestion limits may require a slight change in backend write parameters, to preserve optimal read performances, which is described into detail.<ask claudio>

Configuring the Backend Components

The backend limits are applied during protobuf ingestion on collectors and during protobuf aggregations on workers. The limits are controlled by the plan settings configuration (managed via SDC admin or POST API requests), with the following schema. Th (values shown are defaults for the Pro" plan:

{
	"plan": {
		"metricsSettings": {
			"enforce": false,
			"limits": {
				"prometheus": 500,
				"prometheusPerProcess": 500,
				"progAggregationCount": 12,
				"appCheckAggregationCount": 12,
				"promMetricsWeight": 0.0,
				...
			},
			...
		},
		...
	}
}

Depending of the state of the enforce flag, the enforced defaults use the following values as of Sep 2019:

Enforce is Enabled

Table 2. Enforce=True

Plan

Parameters

Basic

  • prometheus: 200

  • prometheusPerProcess: 200

  • progAggregationCount: 12

  • appCheckAggregationCount: 12

  • promMetricsWeight: 0.0

Pro and On-Prem

  • prometheus: 500

  • prometheusPerProcess: 500

  • progAggregationCount: 12

  • appCheckAggregationCount: 12

  • promMetricsWeight: 0.0

ProCustom, OnpremiseCustom

  • prometheus: 1000

  • prometheusPerProcess: 1000

  • progAggregationCount: 12

  • appCheckAggregationCount: 12

  • promMetricsWeight: 0.0



When enforceis set totrue, the following limits are pushed to agent, taking precedence over any manually configured agent limits:

  • prometheus to prometheus.max_metrics

  • prometheusPerProcess to prometheus.max_metrics_per_process

Enforce is Disabled

  • prometheus: 20000

  • prometheusPerProcess: 20000

  • progAggregationCount: 12

  • appCheckAggregationCount: 12

  • promMetricsWeight: 0.0

Techniques to Enforce Backend Limit

The backend limits are applied via two separate techniques:

  • Limit number of metrics inside one program/process

  • Limit the total number of programs in protobuf

Limiting the Number of Metrics Inside a Process

During aggregation, the maximum number of metrics recorded for a specific process is defined by the limits.prometheus configuration parameter.

The limiting operation will preserve the first N metrics, ordered by the highest sum value. If some metrics have the same sum value, the metrics are ordered by a lexicographic comparison of its names.

  • Exceeding this limit will result in metrics that are available on lower samplings, but not available on larger samplings, partially or completely.

  • Exceeding this limit will not generate any logs.

Limiting the Number of Programs

During aggregation, storing metrics for all programs into larger sampling sizes might not be possible. It's useful however to store metrics for active programs, with many custom metrics than the less active programs with few custom metrics.

An ordering and limiting technique is applied to accomplish this, where a predefined ordered list of "limiters" is used. Each limiter applies a different type of measurement. The limiters are:

  • CPU Usage: Takes top limits.progAggregationCountprograms with the highest sum of CPU usage.

  • Memory usage: Takes toplimits.progAggregationCount programs with the highest sum of Memory usage.

  • File IO transport: Takes the top limits.progAggregationCount programs with the highest sum of file IO transport.

  • Network IO transport: Takes the toplimits.progAggregationCount programs with the highest sum of network IO transport.

  • AppCheck/Prometheus metric count: Takes the top limits.appCheckAggregationCount programs with the highest count of appCheck and Prometheus metrics.

Each limiter sorts the list of programs and preserves the top programs for aggregation. The remaining programs are passed to the next limiter. This process continues until the program list is exhausted, or all the limiters are applied. Any remaining programs and their metrics won't have their data recorded in the aggregation.

The AppCheck/Prometheus metric count limiter preserves the first N programs ordered by the combined number of appCheck and Prometheus metrics. If you have both of these in your environment, you can determine which one is more important for retention. The relative order between the two types can be tweaked via the limits.promMetricsWeight parameter:

  • promMetricsWeight = 0.0: Prometheus metric count will not have any impact on ordering logic.

  • promMetricsWeight = 1.0: Prometheus metric count will be included in ordering logic, in the same way as appCheck metric count do.

  • promMetricsWeight > 1.0: Prometheus metric count will be valued more than the appCheck metric count, for ordering logic, multiplied by the factor of configured weight.

Note

  • Exceeding this limit will result in metric values being available on lower samplings, but not available on larger samplings partially or completely.

  • Exceeding this limit will increase the sysdigcloud-backend.limiting_aggregation_programs statsd metric, which records the number of limited programs per account.

  • Exceeding this limit will not generate any logs.