Enable Prometheus Native Service Discovery

Prometheus service discovery is a standard method of finding endpoints to scrape for metrics. You configure prometheus.yaml and custom jobs to prepare for scraping endpoints in the same way you do for native Prometheus.

For metric collection, a lightweight Prometheus server, named promscrape, is directly embedded into the Sysdig agent to facilitate metric collection. Promscrape supports filtering and relabeling targets, instances, and jobs and identify them using the custom jobs configured in the prometheus.yaml file. The latest versions of Sysdig agent (above v12.0.0) by default identify the processes that expose Prometheus metric endpoints on its own host and send it to the Sysdig collector for storing and further processing. On older versions of Sysdig agent, you enable these features by configuring dragent.yaml.

Working with Promscrape

Promscrape is a lightweight Prometheus server that is embedded with the Sysdig agent. Promscrape scrapes metrics from Prometheus endpoints and sends them for storing and processing.

Promscrape has two versions: Promscrape V1 and Promscrape V2.

  • Promscrape V2

    Promscrape itself discovers targets by using the standard Prometheus configuration (native Prometheus service discovery), allowing the use of relabel_configs to find or modify targets. An instance of promscrape runs on every node that is running a Sysdig agent and is intended to collect metrics from local as well as remote targets specified in the prometheus.yaml file. The prometheus.yaml file you create is shared across all such nodes.

    Promscrape V2 is enabled by default on Sysdig agent v12.5.0 and above. On older versions of Sysdig agent, you need to manually enable Promscrape V2, which allows for native Prometheus service discovery, by setting the prom_service_discovery parameter to true in dragent.yaml.

  • Promscrape V1

    Sysdig agent discovers scrape targets through the Sysdig process_filter rules. For more information, see Process Filter.

About Promscrape V2

Supported Features

Promscrape V2 supports the following native Prometheus capabilities:

  • Relabeling: Promscrape V2 supports Prometheus native relabel_config and metric_relabel_configs. Relabel configuration enables the following:

    • Drop unnecessary metrics or unwanted labels from metrics

    • Edit the label format of the target before scraping the labels

  • Sample format: In addition to the regular sample format (metrics name, labels, and metrics reading), Promscrape V2 includes metrics type (counter, gauge, histogram, summary) to every sample sent to the agent.

  • Scraping configuration: Promscrape V2 supports all types of scraping configuration, such as federation, blackbox-exporter, and so on.

  • Label mapping: The metrics can be mapped to their source (pod, process) by using the source labels which in turn map certain Prometheus label names to the known agent tags.

Unsupported Features

  • Promscrape V2 does not support calculated metrics.

  • Promscrape V2 does not support cluster-wide features such as recording rules and alert management.

  • Service discovery configurations in Promscrape V1 (process_filter) and Promscrape V2 (prometheus.yaml) are incompatible and non-translatable.

  • Promscrape V2 collects metrics from both local and remote targets specified in the prometheus.yaml file and therefore it does not make sense to configure promscrape to scrape remote targets, because you will see metrics duplication in this case.

  • Promscrape V2 does not have the cluster view and therefore it ignores the configuration of recording rules and alerts, which is used in the cluster-wide metrics collection. Therefore, the following Prometheus Configurations are not supported

  • Sysdig uses __HOSTNAME__, which is not a standard Prometheus keyword.

Enable Promscrape V2 on Older Versions of Sysdig Agent

To enable Prometheus native service discovery on agent versions prior to 11.2:

  1. Open dragent.yaml file.

  2. Set the following Prometheus Service Discovery parameter to true:

    prometheus:
      prom_service_discovery: true
    

    If true, promscrape.v2 is used. Otherwise, promscrape.v1 is used to scrape the targets.

  3. Restart the agent.

Create Custom Jobs

Prerequisites

Ensure the following features are enabled:

  • Monitoring Integration
  • Promscrape V2

If you are using Sysdig agent v12.0.0 or above, these features are enabled by default.

Prepare Custom Job

You set up custom jobs in the Prometheus configuration file to identify endpoints that expose Prometheus metrics. Sysdig agent uses these custom jobs to scrape endpoints by using promscrape, the lightweight Prometheus server embedded in it.

Guidelines

  • Ensure that targets are scraped only by the agent running on the same node as the target. You do this by adding the host selection relabeling rules.

  • Use the sysdig specific relabeling rules to automatically get the right workload labels applied.

Example Prometheus Configuration file

The prometheus.yaml file comes with a default configuration for scraping the pods running on the local node. This configuration also includes the rules to preserve pod UID and container name labels for further correlation with Kubernetes State Metrics or Sysdig native metrics.

Here is an example prometheus.yaml file that you can use to set up custom jobs.

global:
  scrape_interval: 10s
scrape_configs:
- job_name: 'my_pod_job'
  sample_limit: 40000
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
    # Look for pod name starting with "my_pod_prefix" in namespace "my_namespace"
  - action: keep
    source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_pod_name]
    separator: /
    regex: my_namespace/my_pod_prefix.+

    # In those pods try to scrape from port 9876
  - source_labels: [__address__]
    action: replace
    target_label: __address__
    regex: (.+?)(\\:\\d)?
    replacement: $1:9876

    # Trying to ensure we only scrape local targets
    # __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
    # of all the active network interfaces on the host
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__

    # Target only running pods
  - source_labels: [__meta_kubernetes_pod_phase]
    action: keep
    regex: Running

    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

Default Scrape Job

If Monitoring Integration is not enabled for you and you still want to automatically collect metrics from pods with the Prometheus annotations set (prometheus.io/scrape=true), add the following default scrape job to your prometheus.yaml file:

- job_name: 'k8s-pods'
  sample_limit: 40000
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
    # Trying to ensure we only scrape local targets
    # __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
    # of all the active network interfaces on the host
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    regex: true
  - source_labels: [__meta_kubernetes_pod_phase]
    action: keep
    regex: Running
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    target_label: __metrics_path__
    regex: (.+)
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__

    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

Default Prometheus Configuration File

Here is the default prometheus.yaml file.

global:
  scrape_interval: 10s
scrape_configs:
- job_name: 'k8s-pods'
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
    # Trying to ensure we only scrape local targets
    # __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
    # of all the active network interfaces on the host
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    regex: true
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - source_labels: [__meta_kubernetes_pod_phase]
    action: keep
    regex: Running
  # Drop for nginx-ingress
  - action: drop
    source_labels:
    - __meta_kubernetes_pod_container_name
    regex: 'controller|nginx-ingress-controller'
  # Drop for ceph
  - action: drop
    source_labels: [__meta_kubernetes_pod_container_name, __meta_kubernetes_pod_annotation_prometheus_io_port]
    separator: ;
    regex: (chown-container-data-dir|log-collector|mgr|watch-active);9283
  # Drop for consul
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: '20200|8500'
  # Drop for istio
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotationpresent_sidecar_istio_io_status]
    target_label: __tmp_pod_has_istio
    regex: 'true'
    replacement: 'true'
  - action: drop
    source_labels: [__tmp_pod_has_istio, __meta_kubernetes_pod_container_name]
    regex: 'true;(discovery)|(.{0}$)'
  # Drop for opa
  - action: drop
    source_labels:        
    - __meta_kubernetes_pod_container_name
    regex: 'manager.*'
  # Drop for kafka
  - action: drop
    source_labels:
    - __meta_kubernetes_pod_container_name
    regex: 'kafka-jmx-exporter|kafka-exporter'
  # Drop for keda
  - action: drop
    source_labels:
    - __meta_kubernetes_pod_container_name
    regex: 'keda-operator-metrics-apiserver'
  # Drop for ntp
  - action: drop
    source_labels:
    - __meta_kubernetes_pod_container_name
    regex: 'ntp-exporter'
  # Drop for openshift-state-metrics
  - action: drop
    source_labels: [__meta_kubernetes_pod_container_name]
    regex: 'openshift-state-metrics'
  # Drop for portworx
  - action: drop
    source_labels: [__meta_kubernetes_pod_container_name]
    regex: 'portworx'
  # Drop for rabbitmq
  - action: drop
    source_labels: [__meta_kubernetes_pod_container_name]
    regex: '(rabbitmq|prepare-plugins-dir)'
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: '15691'
  # Drop for Sysdig Admission Controller
  - action: drop
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    regex: admission-controller;(8080|5000)
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    target_label: __metrics_path__
    regex: (.+)
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

Understand the Prometheus Settings

Scrape Interval

The default scrape interval is 10 seconds. However, the value can be overridden per scraping job. The scrape interval configured in the prometheus.yaml is independent of the agent configuration.

Promscrape V2 reads prometheus.yaml and initiates scraping jobs.

The metrics from targets are collected per scrape interval for each target and immediately forwarded to the agent. The agent sends the metrics every 10 seconds to the Sysdig collector. Only those metrics that have been received since the last transmission are sent to the collector. If a scraping job for a job has a scrape interval longer than 10 seconds, the agent transmissions might not include all the metrics from that job.

Hostname Selection

__HOSTIPS__ is replaced by the host IP addresses. Selection by the host IP address is preferred because of its reliability.

__HOSTNAME__ is replaced with the actual hostname before promscrape starts scraping the targets. This allows promscrape to ignore targets running on other hosts.

Relabeling Configuration

The default Prometheus configuration file contains the following two relabeling configurations:

- action: replace
  source_labels: [__meta_kubernetes_pod_uid]
  target_label: sysdig_k8s_pod_uid
- action: replace
  source_labels: [__meta_kubernetes_pod_container_name]
  target_label: sysdig_k8s_pod_container_name

These rules add two labels, sysdig_k8s_pod_uid and sysdig_k8s_pod_container_name to every metric gathered from the local targets, containing pod ID and container name respectively. These labels will be dropped from the metrics before sending them to the Sysdig collector for further processing.

Configure Prometheus Configuration File Using the Agent Configmap

Here is an example for setting up the prometheus.yaml file using the agent configmap:

apiVersion: v1
data:
  dragent.yaml: |
    new_k8s: true
    k8s_cluster_name: your-cluster-name
    metrics_excess_log: true
    10s_flush_enable: true
    app_checks_enabled: false
    use_promscrape: true
    promscrape_fastproto: true
    prometheus:
      enabled: true
      prom_service_discovery: true
      log_errors: true
      max_metrics: 200000
      max_metrics_per_process: 200000
      max_tags_per_metric: 100
      ingest_raw: true
      ingest_calculated: false
    snaplen: 512
    tags: role:cluster    
  prometheus.yaml: |
    global:
      scrape_interval: 10s
    scrape_configs:
    - job_name: 'haproxy-router'
      basic_auth:
        username: USER
        password: PASSWORD
      tls_config:
        insecure_skip_verify: true
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
        # Trying to ensure we only scrape local targets
        # We need the wildcard at the end because in AWS the node name is the FQDN,
        # whereas in Azure the node name is the base host name
      - action: keep
        source_labels: [__meta_kubernetes_pod_host_ip]
        regex: __HOSTIPS__
      - source_labels: [__meta_kubernetes_pod_phase]
        action: keep
        regex: Running
      - action: keep
        source_labels:
        - __meta_kubernetes_namespace
        - __meta_kubernetes_pod_name
        separator: '/'
        regex: 'default/router-1-.+'
        # Holding on to pod-id and container name so we can associate the metrics
        # with the container (and cluster hierarchy)
      - action: replace
        source_labels: [__meta_kubernetes_pod_uid]
        target_label: sysdig_k8s_pod_uid
      - action: replace
        source_labels: [__meta_kubernetes_pod_container_name]
        target_label: sysdig_k8s_pod_container_name    

kind: ConfigMap
metadata:
    labels:
      app: sysdig-agent
    name: sysdig-agent
    namespace: sysdig-agent