This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Monitoring Integrations

Integrations for Sysdig Monitor include a number of platforms, orchestrators, and a wide range of applications designed to extend Monitor capabilities and collect metrics from these systems. Sysdig collects metrics from Prometheus, JMX, StatsD, Kubernetes, and a number of applications to provide a 360-degree view of your infrastructure. Many metrics are collected out of the box; you can also extend the integration or create custom metrics to receive curated insights into your infrastructure stack.

Key Benefits

  • Collects the richest data set for cloud-native visibility and security.

  • Polls data, auto-discover context in order to provide operational and security insights.

  • Simplifies deploying your monitoring integrations by providing guided configuration, curated list of enterprise-grade images, integration with CI/CD workflow.

  • Extends the power of Prometheus metrics with additional insight from other metrics types and infrastructure stack.

  • Employs Prometheus alert and events and provides ready-to-use dashboards for Kubernetes monitoring needs.

  • Exposes application metrics using Java JMX and MBeans monitoring.

Key Integrations

Inbound

  • Monitoring Integrations

    Describes how to configure Monitoring Integration in your infrastructure and receive deeper insight into the health and performance of your services across platforms and the cloud.

  • Prometheus Metrics

    Describes how Sysdig agent enables automatically collecting metrics from services that expose native Prometheus metrics as well as from applications with Prometheus exporters, how to set up your environment, and scrape Prometheus metrics seamlessly.

  • Agent Installation

    Learn how to install Sysdig agents on supported platforms.

  • AWS CloudWatch

    Illustrates how to configure Sysdig to collect various types of CloudWatch metrics.

  • Java Management Extention (JMX) Metrics

    Describes how to configure your Java virtual machines so Sysdig Agent can collect JMX metrics using the JMX protocol.

  • StatsD Metrics

    Describes how the Sysdig agent collects custom StatsD metrics with an embedded StatsD server.

  • Node.JS Metrics

    Illustrates how Sysdig is able to monitor node.js applications by linking a library to the node.js codebase.

  • Monitor Log Files

    Learn how to search a string by using the chisel script called logwatcher.

  • (legacy) Integrate Applications

    Describes the monitoring capabilities of Sysdig agent with application check scripts or ‘app checks’.

Oubound

  • Notification Channels

    Learn how to add, edit, or delete a variety of notification channel types, and how to disable or delete notifications when they are not needed, for example, during scheduled downtime.

  • S3 Capture Storage

    Learn how to configure Sysdig to use an AWS S3 bucket or custom S3 storage for storing Capture files.

Platform Metrics (IBM)

For Sysdig instances deployed on IBM Cloud Monitoring with Sysdig, an additional form of metrics collection is offered: Platform metrics. Rather than being collected by the Sysdig agent, when enabled, Platform metrics are reported to Sysdig directly by the IBM Cloud infrastructure.

Platform metrics are metrics that are exposed by enabled services across the IBM Cloud platform. These services have made metrics and pre-defined dashboards for their services available by publishing metrics associated with the customer’s space or account. Customers can view these platform metrics alongside the metrics from their applications and other services within IBM Cloud monitoring.

Enable this feature by logging into the IBM Cloud console and selecting “Enable” for IBM Platform metrics under the Configure your resource section when creating a new IBM Cloud Monitoring with a Sysdig instance, as described here.

1 - Configure Monitoring Integrations

Configure Monitoring Integrations

Monitoring Integration provides an at-a-glance summary of workloads running in your infrastructure and a deeper insight into the health and performance of your services across platforms and the cloud. You can easily identify the workloads in your team scope, the service discovered (such as etcd) within each workload, and configure the Prometheus exporter integration to collect and visualize time series metrics. Monitoring Integration also powers Alerts Library.

The following indicates integration status for each service integrations:

  • Reporting Metrics: The integration is configured correctly and is reporting metrics.

  • Needs Attention: An integration has stopped working and is no longer reporting metrics or requires some other type of attention.

  • Pending Metrics: An integration has recently been configured and has been waiting to receive metrics.

  • Configure Integration: The integration needs to be configured, and therefore no metrics are reported.

Ensure that you meet the prerequisites given in Guidelines for Monitoring Integrations to make the best use of this feature.

Access Monitoring Integrations

  1. Log in to Sysdig Monitor.

  2. Select Integration > Monitoring Integration in the management section of the left-hand sidebar.

    The Integrations page is displayed. Continue with configuring an integration.

Configure an Integration

  1. Locate the service that you want to configure an integration for. To do so, identify the workload and drill down to the grouping where the service is running.

    To locate the service, you can use one of the following:

    • Text search
    • Type filtering
    • Left navigation to filter the workload and then use text search or type filtering
    • Use the Configure Integration option on the top, and locate the service using text search or type filtering
  2. Click Configure Integration.

    1. Click Start Installation.

    2. Review the prerequisites.
    3. Do one of the following:
      • Dry Run: Use kubectl command to install the service. Follow the on-screen instructions to complete the tasks successfully.
      • Patch: Install directly on your workload. Follow the on-screen instructions to complete the tasks successfully.
      • Manual: Use an exporter and install the service manually. Click Documentation to learn more about the service exporter and integrate with Sysdig Monitor
  3. Click Validate to validate the installation.

  4. Make sure that the wizard shows the Installation Complete screen.

  5. Click Close to close the window.

Show Unidentified Workloads

The services that Sysdig Monitor cannot discover can technically still be monitored through the Unidentified Workloads option. You can view the workloads with these unidentified services or applications and see their status. To do so, use the Unidentified Workloads slider at the top right corner of the Integration page.

Learn More

1.1 - Guidelines for Monitoring Integrations

If you are directed to this page from the Sysdig Monitor app, your agent deployment might include a configuration that causes either of the following:

  • Prohibits the use of Monitoring Integrations
  • Affect the current metrics you are already collecting

Ensure that you meet the prerequisites to successfully use Monitoring Integrations. For technical assistance, contact Sysdig Support.

Prerequisites

  • Upgrade Sysdig agent to v12.0.0

  • If you have clusters with more than 50 nodes and you don’t have the prom_service_discovery option enabled:

    • Enabling the latest Prometheus features might create an additional connection to the Kubernetes API server from each Sysdig agent in your environment. The surge in agent connections can increase the CPU and memory load in your API servers. Therefore, ensure that your API servers are suitably sized to handle the increased load in large clusters.
    • If you encounter any problems contact Sysdig Support.
  • Remove the following manual configurations in the dragent.yaml file because they might interfere with those provided by Sysdig:

    • use_promscrape
    • promscrape_fastproto
    • prom_service_discovery
    • prometheus.max_metrics
    • prometheus.ingest_raw
    • prometheus.ingest_calculated
  • The sysdig_sd_configs configuration is no longer supported. Remove the existing prometheus.yaml if it includes the sysdig_sd_configs configuration.

If you are not currently using Prometheus metrics in Sysdig Monitor, you can skip the following steps:

  • If you are using a custom Prometheus process_filter in dragent.yaml to trigger scraping, see Migrating from Promscrape V1 to V2.

  • If you are using service annotations or container labels to find scrape targets, you may need to create new scrape_configs in prometheus.yaml , preferably based on Kubernetes pods service discovery. This configuration can be complicated in certain environments and therefore we recommend that you contact Sysdig support for help.

Learn More

1.2 - Configure Default Integrations

Each Monitoring Integration holds a specific job that scrapes its metrics and sends them to Sysdig Monitor. To optimize metrics scraping for building dashboards and alerts in Sysdig Monitor, Sysdig offers default jobs for these integrations. Periodically, the Sysdig agent connects with Sysdig Monitor and retrieves the default jobs and make the Monitoring Integrations available for use. See the list of the available integrations and corresponding jobs.

You can find all the jobs in the /opt/draios/etc/promscrape.yaml file in the sysdig-agent container in your cluster.

Supported Monitoring Integrations

IntegrationOut of the BoxEnabled by defaultJob name in config file
Apacheapache-exporter-default, apache-grok-default
Calicocalico-node-default, calico-controller-default
Cassandracassandra-default
Cephceph-default
Consulconsul-server-default, consul-envoy-default
Elasticsearchelasticsearch-default
Fluentdfluentd-default
HAProxy Ingresshaproxy-default
HAProxy Ingress OpenShifthaproxy-router
Harborharbor-exporter-default, harbor-core-default, harbor-registry-default, harbor-jobservice-default
Istioistiod
Kubernetes API serverkubernetes-apiservers-default
Kubernetes controller managerkube-controller-manager-default
Kubernetes CoreDNSkube-dns-default
Kubernetes etcdetcd-default
Kubernetes kubeletk8s-kubelet-default
Kubernetes kube-proxykubernetes-kube-proxy-default
Kubernetes PVCk8s-pvc-default
Kubernetes Schedulerkube-scheduler-default
Kubernetes storagek8s-storage-default
Kafkakafka-exporter-default, kafka-jmx-default
KEDAkeda-default
Memcachedmemcached-default
MongoDBmongodb-default
MySQLmysql-default
NGINXnginx-default
NGINX Ingressnginx-ingress-default
NTPntp-default
OPAopa-default
OpenShift API-Serveropenshift-apiserver-default
OpenShift CoreDNSopenshift-dns-default
OpenShift Etcdopenshift-etcd-default
OpenShift State Metricsopenshift-state-metrics
PHP-FPMphp-fpm-default
Portworxportworx-default, portworx-openshift-default
PostgreSQLpostgres-default
Prometheus Default Jobk8s-pods
RabbitMQrabbitmq-default
Redisredis-default
Sysdig Admission Controllersysdig-admission-controller-default

Enable and Disable Integrations

Some integrations are disabled by default due to the potential high cardinality of their metrics. To enable them, contact Sysdig Support. The same applies to disabling integrations by default in all your clusters.

Customize a Default Job

The default jobs offered by Sysdig for integrations are optimized to scrape the metrics for building dashboards and alerts in Sysdig Monitor. Instead of processing all the metrics available, you can determine which metrics to include or exclude for your requirements. To do so, you can overwrite the default configuration in the prometheus.yaml file. The prometheus.yaml file is located in the sysdig-agent ConfigMap in the sysdig-agent namespace.

You can overwrite the default job for a specific integration by adding a new job to the prometheus.yaml file with the same name as the default job that you want to replace. For example, if you want to create a new job for the Apache integration, create a new job with the name apache-default. The jobs defined by the user has precedence over the default ones.

See Supported Monitoring Integrations for the complete list of integrations and corresponding job names.

Use Sysdig Annotations in Exporters

Sysdig provides a set of Helm charts that helps you configure the exporters for the integrations. For more information on installing Monitor Integrations, see the Monitoring Integrations option in Sysdig Monitor. Additionally, the Helm charts are publicly available in the Sysdig Helm repository.

If exporters are already installed in your cluster, you can use the standard Prometheus annotations and the Sysdig agent will automatically scrape them.

For example, if you use the annotation given below, the incoming metrics will have the information about the pod that generates the metrics.

spec:
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: '9100'
        prometheus.io/scrape: 'true'

If you use an exporter, the incoming metrics will be associated with the exporter pod, not the application pod. To change this behavior, you can use the Sysdig-provided annotations and configure the exporter on the agent.

Annotate the Exporter

Use the following annotations to configure the exporter:

spec:
  template:
    metadata:
      annotations:
        promcat.sysdig.com/port: '9187'
        promcat.sysdig.com/target_ns: my-namespace
        promcat.sysdig.com/target_workload_type: deployment
        promcat.sysdig.com/target_workload_name: my-workload
        promcat.sysdig.com/integration_type: my-integration
  • port: The port to scrape for metrics on the exporter.
  • target_ns: The namespace of the workload corresponding to the application (not the exporter).
  • target_workload_type: The type of the workload of the application (not the exporter). The possible values are deployment, statefulset, and daemonset.
  • target_workload_name: The name of the workload corresponding to the application (not the exporter).
  • integration_type: The type of the integration. The job created in the Sysdig agent use this value to find the exporter.

Configure a New Job

Edit the prometheus.yaml file to configure a new job in Sysdig agent. The file is located in the sysdig-agent ConfigMap in the sysdig-agent namespace.

You can use the following example template:

- job_name: my-integration
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
    - role: pod
  relabel_configs:
    - action: keep
      source_labels: [__meta_kubernetes_pod_host_ip]
      regex: __HOSTIPS__
    - action: drop
      source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
      regex: true
    - action: keep
      source_labels:
        - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
      regex: 'my-integration' # Use here the integration type that you defined in your annotations
    - action: replace
      source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
      target_label: kube_namespace_name
    - action: replace
      source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
      target_label: kube_workload_type
    - action: replace
      source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
      target_label: kube_workload_name
    - action: replace
      replacement: true
      target_label: sysdig_omit_source
    - action: replace
      source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
      regex: ([^:]+)(?::\d+)?;(\d+)
      replacement: $1:$2
      target_label: __address__
    - action: replace
      source_labels: [__meta_kubernetes_pod_uid]
      target_label: sysdig_k8s_pod_uid
    - action: replace
      source_labels: [__meta_kubernetes_pod_container_name]
      target_label: sysdig_k8s_pod_container_name

Exclude a Deployment from Being Scraped

If you want the agent to exclude a deployment from being scraped, use the following annotation:

spec:
  template:
    metadata:
      annotations:
        promcat.sysdig.com/omit: 'true'

Learn More

2.1 - Apache

Apache

Apache

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Apache] No Instance UpNo instances upPrometheus
[Apache] Up Time Less Than One HourInstance with UpTime less than one hourPrometheus
[Apache] Time Since Last OK Request More Than One HourTime since last OK request higher than one hourPrometheus
[Apache] High Error RateHigh error ratePrometheus
[Apache] High Rate Of Busy Workers In InstanceLow workers in open_slot statePrometheus

List of Dashboards:

  • Apache App Overview Apache App Overview

List of Metrics:

  • apache_accesses_total
  • apache_connections
  • apache_cpuload
  • apache_duration_ms_total
  • apache_http_last_request_seconds
  • apache_http_response_codes_total
  • apache_scoreboard
  • apache_sent_kilobytes_total
  • apache_up
  • apache_uptime_seconds_total
  • apache_workers

2.2 - Calico

Calico

Calico

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[Calico-Node] Dataplane Updates Are Failing and RetryingThe update actions for dataplane are failing and retrying several timesPrometheus
[Calico-Node] IP Set Command FailuresEncountered a number of ipset command failuresPrometheus
[Calico-Node] IP Tables Restore FailuresEncountered a number of iptable restore failuresPrometheus
[Calico-Node] IP Tables Save FailuresEncountered a number of iptable restore failuresPrometheus
[Calico-Node] Errors While LoggingEncountered a number of errors while loggingPrometheus
[Calico-Node] Latency Increase in Datastore OnUpdate CallThe duration of datastore OnUpdate calls are increasingPrometheus
[Calico-Node] Latency Increase in Dataplane UpdateIncreased response time for dataplane updatesPrometheus
[Calico-Node] Latency Increase in Acquire Iptables LockIncreased response time for dataplane updatesPrometheus
[Calico-Node] Latency Increase While Listing All the Interfaces during a ResyncIncreased response time for interface listing during a resyncPrometheus
[Calico-Node] Latency Increase in Interface ResyncIncreased response time for interface resyncPrometheus
[Calico-Node] Fork/Exec Child Processes Results in High LatencyIncreased response time for Fork/Exec child processesPrometheus

List of Dashboards:

  • Calico Calico

List of Metrics:

  • felix_calc_graph_update_time_seconds
  • felix_cluster_num_hosts
  • felix_cluster_num_policies
  • felix_cluster_num_profiles
  • felix_exec_time_micros
  • felix_int_dataplane_addr_msg_batch_size
  • felix_int_dataplane_apply_time_seconds
  • felix_int_dataplane_failures
  • felix_int_dataplane_iface_msg_batch_size
  • felix_int_dataplane_msg_batch_size
  • felix_ipset_calls
  • felix_ipset_errors
  • felix_ipset_lines_executed
  • felix_iptables_lines_executed
  • felix_iptables_lock_acquire_secs
  • felix_iptables_restore_calls
  • felix_iptables_restore_errors
  • felix_iptables_save_calls
  • felix_iptables_save_errors
  • felix_log_errors
  • felix_route_table_list_seconds
  • felix_route_table_per_iface_sync_seconds
  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds

Monitoring and Troubleshooting Calico

Here are some interesting metrics and queries to monitor and troubleshoot Calico.

About the Calico User

Hosts

A host endpoint resource (HostEndpoint) represents one or more real or virtual interfaces attached to a host that is running Calico. It enforces Calico policy on the traffic that is entering or leaving the host’s default network namespace through those interfaces.

  • A host endpoint with interfaceName: * represents all of a host’s real or virtual interfaces.

  • A host endpoint for one specific real interface is configured by interfaceName: , for example interfaceName: eth0, or by leaving interfaceName empty and including one of the interface’s IPs in expectedIPs.

Each host endpoint may include a set of labels and list of profiles that Calico will use to apply policy to the interface.

Profiles

Profiles provide a way to group multiple endpoints so that they inherit a shared set of labels. For historic reasons, Profiles can also include policy rules, but that feature is deprecated in favor of the much more flexible NetworkPolicy and GlobalNetworkPolicy resources.

Each Calico endpoint or host endpoint can be assigned to zero or more profiles.

Policies

If you are new to Kubernetes, start with “Kubernetes policy” and learn the basics of enforcing policy for pod traffic. The good news is, Kubernetes and Calico policies are very similar and work alongside each other – so managing both types is easy.

Kubernetes network policy lets developers secure access to and from their applications using the same simple language they use to deploy them. Developers can focus on their applications without understanding low-level networking concepts. Enabling developers to easily secure their applications using network policies supports a shift left DevOps environment.

Errors

Dataplane Updates Failures and Retries

Dataplane is base of work for Calico. It has three different types of Dataplanes (Linux eBPF, Standard Linux and Windows HNS). Dataplane is responsible for main important parts in Calico: base networking, network policy and IP address management capabilities. So be aware of possible errors in dataplane is keystone for Calico monitoring.

rate(felix_int_dataplane_failures[5m])

Ipset Command Failures

IP sets are stored collections of IP addresses, network ranges, MAC addresses, port numbers, and network interface names. The iptables tool can leverage IP sets for more efficient rule matching.

For example, let’s say you want to drop traffic that originates from one of several IP address ranges that you know to be malicious. Instead of configuring rules for each range in iptables directly, you can create an IP set and then reference that set in an iptables rule. This makes your rule sets dynamic and therefore easier to configure; whenever you need to add or swap out network identifiers that are handled by the firewall, you simply change the IP set.

For that reason we need to monitor failures fot his kind of command in calico.

rate(felix_ipset_errors[5m])

Iptables Save Failures and Iptables Restore Failures

The actual iptables rules are created and customized on the command line with the command iptables for IPv4 and ip6tables for IPv6.

These can be saved in a file with the command iptables-save for IPv4.

Debian/Ubuntu: iptables-save > /etc/iptables/rules.v4
RHEL/CentOS: iptables-save > /etc/sysconfig/iptables

These files can be loaded again with the command iptables-restore for IPv4.

Debian/Ubuntu: iptables-restore < /etc/iptables/rules.v4
RHEL/CentOS: iptables-restore < /etc/sysconfig/iptables

This is basically the main purpose of calico, so monitor failures of the features is very important.

rate(felix_iptables_save_errors[5m])
rate(felix_iptables_restore_errors[5m])

Latency

Most usefull way to inform about latency is show some alert with quantiles.

Calico metrics does not provides buckets, it summarizes all that info with specific labels. For Latency metrics Calico provides quantile labels 0.5, 0.9 and 0.99.

Latency in Datastore OnUpdate Call

# Latency on dataplane update
felix_calc_graph_update_time_seconds{quantile="0.99"}

# Latency on acquire iptables lock
felix_int_dataplane_apply_time_seconds{quantile="0.99"}

# Latency to list all the interfaces during a resync
felix_iptables_lock_acquire_secs{quantile="0.99"}

Saturation

The way to monitor saturation in Calico is batch size. Here we can analyze three kinds of batches and also analyze them by quantiles.

# Number of messages processed in each batch
felix_int_dataplane_msg_batch_size{quantile="0.99"}

# Interface state messages processed in each batch
felix_int_dataplane_iface_msg_batch_size{quantile="0.99"}

# Interface address messages processed in each batch
felix_int_dataplane_addr_msg_batch_size{quantile="0.99"}

Traffic

One of the four golden signals we have to monitor to is traffic, in this case for calico, we need to monitor the most core network requests. Ipset and Iptables commands are the lowest level interaction in calico, in order to create that traffic Calico needs to create, destroy and update any policy network.

# Number of ipset commands executed.
rate(felix_ipset_calls[5m])

# Number of ipset operations executed.
rate(felix_ipset_lines_executed[5m])

# Number of iptables rule updates executed.
rate(felix_iptables_lines_executed[5m])

# Number of iptables-restore calls.
rate(felix_iptables_restore_calls[5m])

# Number of iptables-save calls.
rate(felix_iptables_save_calls[$__interval])

2.3 - Cassandra

Cassandra

Cassandra

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Cassandra] Compaction Task PendingThere are many Cassandra compaction tasks pending.Prometheus
[Cassandra] Commitlog Pending TasksThere are many Cassandra Commitlog tasks pending.Prometheus
[Cassandra] Compaction Executor Blocked TasksThere are many Cassandra compaction executor blocked tasks.Prometheus
[Cassandra] Flush Writer Blocked TasksThere are many Cassandra flush writer blocked tasks.Prometheus
[Cassandra] Storage ExceptionsThere are storage exceptions in Cassandra node.Prometheus
[Cassandra] High Tombstones ScannedThere is a high number of tombstones scanned.Prometheus
[Cassandra] JVM Heap MemoryHigh JVM Heap Memory.Prometheus

List of Dashboards:

  • Cassandra Cassandra

List of Metrics:

  • cassandra_bufferpool_misses_total
  • cassandra_bufferpool_size_total
  • cassandra_client_connected_clients
  • cassandra_client_request_read_latency
  • cassandra_client_request_read_timeouts
  • cassandra_client_request_read_unavailables
  • cassandra_client_request_write_latency
  • cassandra_client_request_write_timeouts
  • cassandra_client_request_write_unavailables
  • cassandra_commitlog_completed_tasks
  • cassandra_commitlog_pending_tasks
  • cassandra_commitlog_total_size
  • cassandra_compaction_compacted_bytes_total
  • cassandra_compaction_completed_tasks
  • cassandra_compaction_pending_tasks
  • cassandra_cql_prepared_statements_executed_total
  • cassandra_cql_regular_statements_executed_total
  • cassandra_dropped_messages_mutation
  • cassandra_dropped_messages_read
  • cassandra_jvm_gc_collection_count
  • cassandra_jvm_gc_duration_seconds
  • cassandra_jvm_memory_usage_max_bytes
  • cassandra_jvm_memory_usage_used_bytes
  • cassandra_storage_internal_exceptions_total
  • cassandra_storage_load_bytes_total
  • cassandra_table_read_requests_per_second
  • cassandra_table_tombstoned_scanned
  • cassandra_table_total_disk_space_used
  • cassandra_table_write_requests_per_second
  • cassandra_threadpool_blocked_tasks_total

Monitoring and troubleshooting Cassandra

Here are some interesting metrics and queries to monitor and troubleshoot Cassandra.

General stats

Node Down:

Let’s get the number of expected of nodes, and the actual number of nodes up and running. If the number is not the same, then there might a problem.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_desired)
-
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_ready)
> 0

Dropped Messages

Dropped messages mutation

If there are dropped mutation messages then we probably have write/read failures due to timeouts.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_mutation)

Dropped messages read

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_read)

Buffer Pool

Buffer Pool size

This buffer is allocated as off-heap in addition to the memory allocated for heap. Memory is allocated when needed. Check if miss rate is high.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_size_total)

Buffer pool misses

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_misses_total)

CQL Statements

CQL Prepared statements

Use prepared statements (query with bound variables) as they are more secure and can be cached.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_prepared_statements_executed_total[$__interval]))

CQL Regular statements

This value should be as low as possible if you are looking for good performance.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_regular_statements_executed_total[$__interval]))

Connected clients

The number of current client connections in each node.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_connected_clients)

Client Request Latency

Write Latency

95th percentile client request write latency.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_write_latency{quantile="0.95"})

Read Latency

95th percentile client request read latency.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_read_latency{quantile="0.95"})

Unavailable Exceptions

Number of exceptions encountered in regular reads / writes. This number should be near 0 in a healthy cluster.

Read unavailable exceptions

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_unavailables[$__interval]))

Write unavailable exceptions

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_write_unavailables[$__interval]))

Client Request timeouts

Write / read request timeouts in Cassandra nodes. If there are timeouts, check for:

1.- ‘read_request_timeout_in_ms’ value in cassandra.yaml in case it is too low. 2.- Check tombstones that can degrade performance. You can find tombstones query below

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)

Client request read timeout

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_timeouts[$__interval]))

Client request write timeout

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_write_request_read_timeouts[$__interval]))

Threadpool blocked tasks

Compaction blocked tasks

Pending compactions that are blocked. This metric could deviate from “pending compactions” which includes an estimate of tasks that these pending tasks might create after completion.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="CompactionExecutor"}[$__interval]))

Flush writer blocked tasks

The writer flush defines the number of parallel writes on disk. This value should be near 0. Check your “memtable_flush_writers” value to match with your number of cores if you are using SSD disks.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="MemtableFlushWriter"}[$__interval]))

Compactions

Pending Compactions

Compactions that are queued. This value should be as low as possible. If it reaches more than 50 you can start having CPU and Memory pressure.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_compaction_pending_tasks)

Total Size compacted

Cassandra triggers minor compactions automatically so the compacted size should be low unless you trigger a major compaction across the node.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_compaction_compacted_bytes_total[$__interval]))

Commit Log

Commit Log pending tasks

This value should be under 15-20 for performance purposes.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_commitlog_pending_tasks)

Storage

Storage Exceptions

Look carefully at this value as any storage error over 0 is critical for Cassandra.

sum  by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_storage_internal_exceptions_total)

JVM and GC

JVM Heap Usage

If you want to tune your Heap memory you can use this query.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="Heap"})

If you want to know the maximum heap memory you can use this query.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_max_bytes{area="Heap"})

JVM NonHeap usage

Use this query for NonHeap memory.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="NonHeap"})

GC Info

If there is memory pressure the max GC duration will start increasing.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_gc_duration_seconds)

Keyspaces and Tables

Keyspace Size

This query gives you information of all keyspaces.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_table_total_disk_space_used)

Table Size

This query gives you information of all tables.

Table highest increase size

Very useful to know what tables are growing too fast.

topk(10,sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(delta(cassandra_table_total_disk_space_used[$__interval])))

Tombstones scanned

Cassandra does not delete data from disk at once. Instead, it writes a tombstone with a value that indicates the data has been deleted.

A high value (more than 1000) can cause GC pauses, latency and read failures. Sometimes you need to issue a manual compaction from nodetool.

sum  by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)

2.4 - Ceph

Ceph

Ceph

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Ceph] Ceph Manager is absentCeph Manager has disappeared from Prometheus target discovery.Prometheus
[Ceph] Ceph Manager is missing replicasCeph Manager is missing replicas.Prometheus
[Ceph] Ceph quorum at riskStorage cluster quorum is low. Contact Support.Prometheus
[Ceph] High number of leader changesCeph Monitor has seen a lot of leader changes per minute recently.Prometheus

List of Dashboards:

  • Ceph Ceph

List of Metrics:

  • ceph_cluster_total_bytes
  • ceph_cluster_total_used_bytes
  • ceph_health_status
  • ceph_mgr_status
  • ceph_mon_metadata
  • ceph_mon_num_elections
  • ceph_mon_quorum_status
  • ceph_osd_apply_latency_ms
  • ceph_osd_commit_latency_ms
  • ceph_osd_in
  • ceph_osd_metadata
  • ceph_osd_numpg
  • ceph_osd_op_r
  • ceph_osd_op_r_latency_count
  • ceph_osd_op_r_latency_sum
  • ceph_osd_op_r_out_bytes
  • ceph_osd_op_w
  • ceph_osd_op_w_in_bytes
  • ceph_osd_op_w_latency_count
  • ceph_osd_op_w_latency_sum
  • ceph_osd_recovery_bytes
  • ceph_osd_recovery_ops
  • ceph_osd_up
  • ceph_pool_max_avail

2.5 - Consul

Consul

Consul

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Consul] KV Store update time anomalyKV Store update time anomalyPrometheus
[Consul] Transaction time anomalyTransaction time anomalyPrometheus
[Consul] Raft transactions count anomalyRaft transactions count anomalyPrometheus
[Consul] Raft commit time anomalyRaft commit time anomalyPrometheus
[Consul] Leader time to contact followers too highLeader time to contact followers too highPrometheus
[Consul] Flapping leadershipFlapping leadershipPrometheus
[Consul] Too many electionsToo many electionsPrometheus
[Consul] Server cluster unhealthyServer cluster unhealthyPrometheus
[Consul] Zero failure toleranceZero failure tolerancePrometheus
[Consul] Client RPC requests anomalyConsul client RPC requests anomalyPrometheus
[Consul] Client RPC requests rate limit exceededConsul client RPC requests rate limit exceededPrometheus
[Consul] Client RPC requests failedConsul client RPC requests failedPrometheus
[Consul] License ExpiryConsul License ExpiryPrometheus
[Consul] Garbage Collection pause highConsul Garbage Collection pause highPrometheus
[Consul] Garbage Collection pause too highConsul Garbage Collection pause too highPrometheus
[Consul] Raft restore duration too highConsul Raft restore duration too highPrometheus
[Consul] RPC requests error rate is highConsul RPC requests error rate is highPrometheus
[Consul] Cache hit rate is lowConsul Cache hit rate is lowPrometheus
[Consul] High 4xx RequestError RateHigh 4xx RequestError RatePrometheus
[Consul] High Request LatencyEnvoy High Request LatencyPrometheus
[Consul] High Response LatencyEnvoy High Response LatencyPrometheus
[Consul] Certificate close to expireCertificate close to expirePrometheus

List of Dashboards:

  • Consul Consul
  • Consul Envoy Consul Envoy

List of Metrics:

  • consul_autopilot_failure_tolerance
  • consul_autopilot_healthy
  • consul_client_rpc
  • consul_client_rpc_exceeded
  • consul_client_rpc_failed
  • consul_consul_cache_bypass
  • consul_consul_cache_entries_count
  • consul_consul_cache_evict_expired
  • consul_consul_cache_fetch_error
  • consul_consul_cache_fetch_success
  • consul_kvs_apply_sum
  • consul_raft_apply
  • consul_raft_commitTime_sum
  • consul_raft_fsm_lastRestoreDuration
  • consul_raft_leader_lastContact
  • consul_raft_leader_oldestLogAge
  • consul_raft_rpc_installSnapshot
  • consul_raft_state_candidate
  • consul_raft_state_leader
  • consul_rpc_cross_dc
  • consul_rpc_queries_blocking
  • consul_rpc_query
  • consul_rpc_request
  • consul_rpc_request_error
  • consul_runtime_gc_pause_ns
  • consul_runtime_gc_pause_ns_sum
  • consul_system_licenseExpiration
  • consul_txn_apply_sum
  • envoy_cluster_membership_change
  • envoy_cluster_membership_healthy
  • envoy_cluster_membership_total
  • envoy_cluster_upstream_cx_active
  • envoy_cluster_upstream_cx_connect_ms_bucket
  • envoy_cluster_upstream_rq_active
  • envoy_cluster_upstream_rq_pending_active
  • envoy_cluster_upstream_rq_time_bucket
  • envoy_cluster_upstream_rq_xx
  • envoy_server_days_until_first_cert_expiring
  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds

2.6 - Elasticsearch

Elasticsearch

Elasticsearch

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Elasticsearch] Heap Usage Too HighThe heap usage is over 90%Prometheus
[Elasticsearch] Heap Usage WarningThe heap usage is over 80%Prometheus
[Elasticsearch] Disk Space LowDisk available less than 20%Prometheus
[Elasticsearch] Disk Out Of SpaceDisk available less than 10%Prometheus
[Elasticsearch] Cluster RedCluster in Red statusPrometheus
[Elasticsearch] Cluster YellowCluster in Yellow statusPrometheus
[Elasticsearch] Relocation ShardsRelocating shards for too longPrometheus
[Elasticsearch] Initializing ShardsInitializing shards takes too longPrometheus
[Elasticsearch] Unassigned ShardsUnassigned shards for long timePrometheus
[Elasticsearch] Pending TasksElasticsearch has a high number of pending tasksPrometheus
[Elasticsearch] No New DocumentsElasticsearch has no new documents for a period of timePrometheus

List of Dashboards:

  • ElasticSearch Cluster ElasticSearch Cluster
  • ElasticSearch Infra ElasticSearch Infra

List of Metrics:

  • elasticsearch_cluster_health_active_primary_shards
  • elasticsearch_cluster_health_active_shards
  • elasticsearch_cluster_health_initializing_shards
  • elasticsearch_cluster_health_number_of_data_nodes
  • elasticsearch_cluster_health_number_of_nodes
  • elasticsearch_cluster_health_number_of_pending_tasks
  • elasticsearch_cluster_health_relocating_shards
  • elasticsearch_cluster_health_status
  • elasticsearch_cluster_health_unassigned_shards
  • elasticsearch_filesystem_data_available_bytes
  • elasticsearch_filesystem_data_size_bytes
  • elasticsearch_indices_docs
  • elasticsearch_indices_indexing_index_time_seconds_total
  • elasticsearch_indices_indexing_index_total
  • elasticsearch_indices_merges_total_time_seconds_total
  • elasticsearch_indices_search_query_time_seconds
  • elasticsearch_indices_store_throttle_time_seconds_total
  • elasticsearch_jvm_gc_collection_seconds_count
  • elasticsearch_jvm_gc_collection_seconds_sum
  • elasticsearch_jvm_memory_committed_bytes
  • elasticsearch_jvm_memory_max_bytes
  • elasticsearch_jvm_memory_pool_peak_used_bytes
  • elasticsearch_jvm_memory_used_bytes
  • elasticsearch_os_load1
  • elasticsearch_os_load15
  • elasticsearch_os_load5
  • elasticsearch_process_cpu_percent
  • elasticsearch_transport_rx_size_bytes_total
  • elasticsearch_transport_tx_size_bytes_total

2.7 - Fluentd

Fluentd

Fluentd

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Fluentd] No Input From ContainerNo Input From Container.Prometheus
[Fluentd] High Error RatioHigh Error Ratio.Prometheus
[Fluentd] High Retry RatioHigh Retry Ratio.Prometheus
[Fluentd] High Retry WaitHigh Retry Wait.Prometheus
[Fluentd] Low Buffer Available SpaceLow Buffer Available Space.Prometheus
[Fluentd] Buffer Queue Length IncreasingBuffer Queue Length Increasing.Prometheus
[Fluentd] Buffer Total Bytes IncreasingBuffer Total Bytes Increasing.Prometheus
[Fluentd] High Slow Flush RatioHigh Slow Flush Ratio.Prometheus
[Fluentd] No Output Records From PluginNo Output Records From Plugin.Prometheus

List of Dashboards:

  • Fluentd Fluentd

List of Metrics:

  • fluentd_input_status_num_records_total
  • fluentd_output_status_buffer_available_space_ratio
  • fluentd_output_status_buffer_queue_length
  • fluentd_output_status_buffer_total_bytes
  • fluentd_output_status_emit_count
  • fluentd_output_status_emit_records
  • fluentd_output_status_flush_time_count
  • fluentd_output_status_num_errors
  • fluentd_output_status_retry_count
  • fluentd_output_status_retry_wait
  • fluentd_output_status_rollback_count
  • fluentd_output_status_slow_flush_count

2.8 - Go

Go

Go

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[Go] Slow Garbage CollectorGarbage collector took too long.Prometheus
[Go] Few Free File DescriptorsFew free file descriptors.Prometheus

List of Dashboards:

  • Go Internals Go Internals

List of Metrics:

  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds

2.9 - HAProxy Ingress

HAProxy Ingress

HAProxy Ingress

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Haproxy-Ingress] Uptime less than 1 hourThis alert detects when all of the instances of the ingress controller have an uptime of less than 1 hour.Prometheus
[Haproxy-Ingress] Frontend DownThis alert detects when a frontend has all of its instances down for more than 10 minutes.Prometheus
[Haproxy-Ingress] Backend DownThis alert detects when a backend has all of its instances down for more than 10 minutes.Prometheus
[Haproxy-Ingress] High Sessions UsageThis alert triggers when the backend sessions overpass the 85% of the sessions capacity for 10 minutes.Prometheus
[Haproxy-Ingress] High Error RateThis alert triggers when there is an error rate over 15% for over 10 minutes in a proxy.Prometheus
[Haproxy-Ingress] High Request Denied RateThese alerts detect when there is a denied rate of requests over 10% for over 10 minutes in a proxy.Prometheus
[Haproxy-Ingress] High Response Denied RateThese alerts detect when there is a denied rate of responses over 10% for over 10 minutes in a proxy.Prometheus
[Haproxy-Ingress] High Response RateThis alert triggers when a proxy has a mean response time higher than 250ms for over 10 minutes.Prometheus

List of Dashboards:

  • HAProxy Ingress Overview HAProxy Ingress Overview
  • HAProxy Ingress Service Details HAProxy Ingress Service Details

List of Metrics:

  • haproxy_backend_bytes_in_total
  • haproxy_backend_bytes_out_total
  • haproxy_backend_client_aborts_total
  • haproxy_backend_connect_time_average_seconds
  • haproxy_backend_current_queue
  • haproxy_backend_http_requests_total
  • haproxy_backend_http_responses_total
  • haproxy_backend_limit_sessions
  • haproxy_backend_queue_time_average_seconds
  • haproxy_backend_requests_denied_total
  • haproxy_backend_response_time_average_seconds
  • haproxy_backend_responses_denied_total
  • haproxy_backend_sessions_total
  • haproxy_backend_status
  • haproxy_frontend_bytes_in_total
  • haproxy_frontend_bytes_out_total
  • haproxy_frontend_connections_total
  • haproxy_frontend_denied_connections_total
  • haproxy_frontend_denied_sessions_total
  • haproxy_frontend_request_errors_total
  • haproxy_frontend_requests_denied_total
  • haproxy_frontend_responses_denied_total
  • haproxy_frontend_status
  • haproxy_process_active_peers
  • haproxy_process_current_connection_rate
  • haproxy_process_current_run_queue
  • haproxy_process_current_session_rate
  • haproxy_process_current_tasks
  • haproxy_process_jobs
  • haproxy_process_ssl_connections_total
  • haproxy_process_start_time_seconds

2.10 - HAProxy Ingress OpenShift

HAProxy Ingress OpenShift

HAProxy Ingress OpenShift

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[OpenShift-HAProxy-Router] Router DownRouter HAProxy down. No instances running.Prometheus
[OpenShift-HAProxy-Router] Percentage of routers lowLess than 75% Routers are upPrometheus
[OpenShift-HAProxy-Router] Route DownThis alert detects if all servers are down in a routePrometheus
[OpenShift-HAProxy-Router] High LatencyThis alert detects high latency in at least one server of the routePrometheus
[OpenShift-HAProxy-Router] Pod Health Check FailureThis alert triggers when there is a recurrent pod health check failure.Prometheus
[OpenShift-HAProxy-Router] Queue not empty in routeThis alert triggers when a queue is not empty in a routePrometheus
[OpenShift-HAProxy-Router] High error rate in routeThis alert triggers when the error rate in a route is higher than 15%.Prometheus
[OpenShift-HAProxy-Router] Connection errors in routeThis alert triggers when there are recurring connection errors in a routePrometheus

List of Dashboards:

  • OpenShift HAProxy Ingress Overview OpenShift HAProxy Ingress Overview
  • OpenShift HAProxy Ingress Service Details OpenShift HAProxy Ingress Service Details

List of Metrics:

  • haproxy_backend_http_average_connect_latency_milliseconds
  • haproxy_backend_http_average_queue_latency_milliseconds
  • haproxy_backend_http_average_response_latency_milliseconds
  • haproxy_backend_up
  • haproxy_frontend_bytes_in_total
  • haproxy_frontend_bytes_out_total
  • haproxy_frontend_connections_total
  • haproxy_frontend_current_session_rate
  • haproxy_frontend_http_responses_total
  • haproxy_process_cpu_seconds_total
  • haproxy_process_max_fds
  • haproxy_process_resident_memory_bytes
  • haproxy_process_start_time_seconds
  • haproxy_process_virtual_memory_bytes
  • haproxy_server_bytes_in_total
  • haproxy_server_bytes_out_total
  • haproxy_server_check_failures_total
  • haproxy_server_connection_errors_total
  • haproxy_server_connections_total
  • haproxy_server_current_queue
  • haproxy_server_current_sessions
  • haproxy_server_downtime_seconds_total
  • haproxy_server_http_average_response_latency_milliseconds
  • haproxy_server_http_responses_total
  • haproxy_server_up
  • kube_workload_status_desired

2.11 - Harbor

Harbor

Harbor

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Harbor] Harbor Core Is DownHarbor Core Is DownPrometheus
[Harbor] Harbor Database Is DownHarbor Database Is DownPrometheus
[Harbor] Harbor Registry Is DownHarbor Registry Is DownPrometheus
[Harbor] Harbor Redis Is DownHarbor Redis Is DownPrometheus
[Harbor] Harbor Trivy Is DownHarbor Trivy Is DownPrometheus
[Harbor] Harbor JobService Is DownHarbor JobService Is DownPrometheus
[Harbor] Project Quota Is Raising The LimitProject Quota Is Raising The LimitPrometheus
[Harbor] Harbor p99 latency is higher than 10 secondsHarbor p99 latency is higher than 10 secondsPrometheus
[Harbor] Harbor Error Rate is HighHarbor Error Rate is HighPrometheus

List of Dashboards:

  • Harbor Harbor

List of Metrics:

  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • harbor_artifact_pulled
  • harbor_core_http_request_duration_seconds
  • harbor_jobservice_task_process_time_seconds
  • harbor_project_member_total
  • harbor_project_quota_byte
  • harbor_project_quota_usage_byte
  • harbor_project_repo_total
  • harbor_project_total
  • harbor_quotas_size_bytes
  • harbor_task_concurrency
  • harbor_task_queue_latency
  • harbor_task_queue_size
  • harbor_up
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds
  • registry_http_request_duration_seconds_bucket
  • registry_http_request_size_bytes_bucket
  • registry_http_requests_total
  • registry_http_response_size_bytes_bucket
  • registry_storage_action_seconds_bucket

2.12 - Istio

Istio

Istio

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Istio-Citadel] CSR without successSome of the Certificate Signing Request (CSR) were not correctly requestedPrometheus
[Istio-Pilot] Inbound listener rules conflictsThere are some conflict with inbound listener rulesPrometheus
[Istio-Pilot] Endpoint found in unready stateEndpoint found in unready statePrometheus
[Istio] Unstable requests for sidecar injectionsSidecar injections requests are failingPrometheus

List of Dashboards:

  • Istio v1.14 Control Plane Istio v1.14 Control Plane

List of Metrics:

  • citadel_server_csr_count
  • citadel_server_success_cert_issuance_count
  • galley_validation_failed
  • galley_validation_passed
  • pilot_conflict_inbound_listener
  • pilot_conflict_outbound_listener_http_over_current_tcp
  • pilot_conflict_outbound_listener_tcp_over_current_http
  • pilot_conflict_outbound_listener_tcp_over_current_tcp
  • pilot_endpoint_not_ready
  • pilot_services
  • pilot_total_xds_internal_errors
  • pilot_total_xds_rejects
  • pilot_virt_services
  • pilot_xds
  • pilot_xds_cds_reject
  • pilot_xds_config_size_bytes_bucket
  • pilot_xds_eds_reject
  • pilot_xds_lds_reject
  • pilot_xds_push_context_errors
  • pilot_xds_push_time_bucket
  • pilot_xds_pushes
  • pilot_xds_rds_reject
  • pilot_xds_send_time_bucket
  • pilot_xds_write_timeout
  • sidecar_injection_failure_total
  • sidecar_injection_requests_total
  • sidecar_injection_success_total

2.13 - Istio Envoy

Istio Envoy

Istio Envoy

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[Istio-Envoy] High 4xx RequestError Rate4xx RequestError Rate is higher than 5%Prometheus
[Istio-Envoy] High 5xx RequestError Rate5xx RequestError Rate is higher than 5%Prometheus
[Istio-Envoy] High Request LatencyEnvoy Request Latency is higher than 100msPrometheus

List of Dashboards:

  • Istio v1.14 Workload Istio v1.14 Workload
  • Istio v1.14 Service Istio v1.14 Service

List of Metrics:

  • envoy_cluster_membership_change
  • envoy_cluster_membership_healthy
  • envoy_cluster_membership_total
  • envoy_cluster_upstream_cx_active
  • envoy_cluster_upstream_cx_connect_ms_bucket
  • envoy_cluster_upstream_rq_active
  • envoy_cluster_upstream_rq_pending_active
  • envoy_server_days_until_first_cert_expiring
  • istio_request_bytes_bucket
  • istio_request_duration_milliseconds_bucket
  • istio_requests_total
  • istio_response_bytes_bucket
  • istio_tcp_received_bytes_total
  • istio_tcp_sent_bytes_total

2.14 - Kafka

Kafka

Kafka

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Kafka] Broker DownThere are less Kafka brokers up than expected. The ‘workload’ label of the Kafka Deployment/Stateful set must be specified.Prometheus
[Kafka] No LeaderThere is no ActiveController or ’leader’ in the Kafka cluster.Prometheus
[Kafka] Too Many LeadersThere is more than one ActiveController or ’leader’ in the Kafka cluster.Prometheus
[Kafka] Offline PartitionsThere are one or more Offline Partitions. These partitions don’t have an active leader and are hence not writable or readable.Prometheus
[Kafka] Under Replicated PartitionsThere are one or more Under Replicated Partitions.Prometheus
[Kafka] Under In-Sync Replicated PartitionsThere are one or more Under In-Sync Replicated Partitions. These partitions will be unavailable to producers who use ‘acks=all’.Prometheus
[Kafka] ConsumerGroup Lag Not DecreasingThe ConsumerGroup lag is not decreasing. The Consumers might be down, failing to process the messages and continuously retrying, or their consumption rate is lower than the production rate of messages.Prometheus
[Kafka] ConsumerGroup Without MembersThe ConsumerGroup doesn’t have any members.Prometheus
[Kafka] Producer High ThrottleTime By Client-IdThe Producer has reached its quota and has high throttle time. Applicable when Client-Id-only quotas are being used.Prometheus
[Kafka] Producer High ThrottleTime By UserThe Producer has reached its quota and has high throttle time. Applicable when User-only quotas are being used.Prometheus
[Kafka] Producer High ThrottleTime By User And Client-IdThe Producer has reached its quota and has high throttle time. Applicable when Client-Id + User quotas are being used.Prometheus
[Kafka] Consumer High ThrottleTime By Client-IdThe Consumer has reached its quota and has high throttle time. Applicable when Client-Id-only quotas are being used.Prometheus
[Kafka] Consumer High ThrottleTime By UserThe Consumer has reached its quota and has high throttle time. Applicable when User-only quotas are being used.Prometheus
[Kafka] Consumer High ThrottleTime By User And Client-IdThe Consumer has reached its quota and has high throttle time. Applicable when Client-Id + User quotas are being used.Prometheus

List of Dashboards:

  • Kafka Kafka

List of Metrics:

  • kafka_brokers
  • kafka_consumergroup_current_offset
  • kafka_consumergroup_lag
  • kafka_consumergroup_members
  • kafka_controller_active_controller
  • kafka_controller_offline_partitions
  • kafka_log_size
  • kafka_network_consumer_request_time_milliseconds
  • kafka_network_fetch_follower_time_milliseconds
  • kafka_network_producer_request_time_milliseconds
  • kafka_server_bytes_in
  • kafka_server_bytes_out
  • kafka_server_consumer_client_byterate
  • kafka_server_consumer_client_throttle_time
  • kafka_server_consumer_user_byterate
  • kafka_server_consumer_user_client_byterate
  • kafka_server_consumer_user_client_throttle_time
  • kafka_server_consumer_user_throttle_time
  • kafka_server_messages_in
  • kafka_server_partition_leader_count
  • kafka_server_producer_client_byterate
  • kafka_server_producer_client_throttle_time
  • kafka_server_producer_user_byterate
  • kafka_server_producer_user_client_byterate
  • kafka_server_producer_user_client_throttle_time
  • kafka_server_producer_user_throttle_time
  • kafka_server_under_isr_partitions
  • kafka_server_under_replicated_partitions
  • kafka_server_zookeeper_auth_failures
  • kafka_server_zookeeper_disconnections
  • kafka_server_zookeeper_expired_sessions
  • kafka_server_zookeeper_read_only_connections
  • kafka_server_zookeeper_sasl_authentications
  • kafka_server_zookeeper_sync_connections
  • kafka_topic_partition_current_offset
  • kafka_topic_partition_oldest_offset
  • kube_workload_status_desired

Monitoring and troubleshooting Kafka

Here are some interesting metrics and queries to monitor and troubleshoot Kafka.

Brokers

Broker Down:

Let’s get the number of expected Brokers, and the actual number of Brokers up and running. If the number is not the same, then there might a problem.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_desired)
-
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kafka_brokers)
> 0

Leadership

Let’s get the number of Kafka leaders. There should always be one leader. If not, a Kafka misconfiguration or a networking issue might be the problem.

sum(kafka_controller_active_controller) < 1

If there are more than one leader, that might be a temporal situation while the leadership is changing. If this case doesn’t get fixed by itslef over time, a split-brain situation might be happening.

sum(kafka_controller_active_controller) > 1

Offline, Under Replicated and In-Sync Under Replicated Partitions:

When a Broker goes down, the other Brokers in the cluster will take leadership of the partitions it was leading. If several brokers go down, or just a few but the topic had a low replication factor, there will be Offline partitions. These partitions don’t have an active leader and are hence not writable or readable, which will most likely dangerous for business.

Let’s check if there are offline partitions:

sum(kafka_controller_offline_partitions) > 0

If other Brokers had replicas of those partitions, one of them will take leadership and the service won’t be down. In this situation there will be Under Replicated partitions. If there are enough Brokers where these partitions can be replicated, the situation will be fixed by itself over time. If there aren’t enough Brokers, the situation will only be fixed once the Brokers which went down come up again.

The following expression is used to get the under replication partitions:

sum(kafka_server_under_replicated_partitions) > 0

But there is a situation when having no Offline partitons but having Under Replicated partitions might pose a real problem. That’s the case of topics with ‘Minimum In-Sync Replicas’, and Kafka Producers with the configuration ‘acks=all’.

If one of this topics has any partition with less replicas than its ‘Minimum In-Sync Replicas’ configuration, and there is Producer with ‘acks=all’, that Producer won’t be able to produce messages into that partition, since ‘acks=all’ means that it waits for the produced messages to be replicated in all the minimum replicas in the Kafka cluster.

If the Producers have any configuration different than ‘acks=all’, then there won’t be any problem.

This is how Under In-Sync Replicated partitions can be checked:

sum(kafka_server_under_isr_partitions) > 0

Network

Broker Bytes In:

Let’s get the amount of bytes produced into each Broker:

sum by (kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_server_bytes_in)

Broker Bytes Out:

Now the same, but for bytes consumed from each Broker:

sum by (kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_server_bytes_out)

Broker Messages In:

And similar, but for number of messages produced into each Broker:

sum by (kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_server_messages_in)

Topics

Topic Size:

This query returns the size of a topic in the whole Kafka cluster. It also includes the size of all replicas, so increasing the replication factor of a topic will increase the overall size across the Kafka cluster.

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_log_size)

In case of needing the size of a topic in each Broker, use the following query:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_log_size)

In a situation where the Broker disk space is running low, the retention of the topics can be decreased to free up some space. Let’s get the top 10 biggest topics:

topk(10,sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_log_size))

If this “low disk space” situation happened out of the blue, there might be a problem in a topic with a Producer filling it with unwanted messages. The following query will help find which topics increased their size the most in the past few hours, which will allow to find the responsible of the sudden increase of messages. It wouldn’t be the first time an exhausted developer wanted to perform a stress test in a topic in a Staging environment, but accidentally did it in Production.

topk(10,sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(delta(kafka_log_size[$__interval])))

Topic Messages:

Calculating the number of messages inside a topic is as easy as substracting the offset of the newest message minus the offset of the oldest message:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_topic_partition_current_offset) - sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_topic_partition_oldest_offset)

But it’s very important to acknowledge that this is only true for topics with ‘compaction’ disabled, since compacted topics might have deleted messages in the middle. To get the number of messages in a compacted topic, a new Consumer must consume all the messages in that topic to count them.

It’s also quite easy to calculate the rate per second of messages being produced into a topic:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(rate(kafka_topic_partition_current_offset[$__interval]))

ConsumerGroup

ConsumerGroup Lag:

Let’s check the ConsumerGroup lag of a Consumer in each partition of a topic:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic, partition)(kafka_consumergroup_lag)

If the lag of a ConsumerGroup is constantly increasing and never decreases, it might have different causes. The Consumers of the ConsumerGroups might be down, one of them might be failing to process the messages and continuously retrying, or their consumption rate might be lower than the production rate of messages.

A non-stop increasing lag can be detected using the following expression:

(sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic)(kafka_consumergroup_lag) > 0)
and
(sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic)(delta(kafka_consumergroup_lag[2m])) >= 0)

ConsumerGroup Consumption Rate:

It might be useful to get the consumption speed of the Consumers of a ConsumerGroup, to detect any issues while processing messages, like internal issues related to the messages, or external issues related to the business. For example, the Consumers might want to send the processed messages to another microservice or another database, but there might be networking issues, or the database performance might be degraded so it slows down the Consumer.

Here you can check the consumption rate:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic, partition)(rate(kafka_consumergroup_current_offset[$__interval]))

ConsumerGroup Members:

It might be also help to know the number of Consumers in a ConsumerGroup, in case there are less than expected:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup)(kafka_consumergroup_members)

Quotas

Kafka has the option to enforce quotas on requests to control the Broker resources used by clients (Producers and Consumers).

Quotas can be applied to user, client-id or both groups at the same time.

Each client can utilize this quota per Broker before it gets throttled. Throttling means that the client will need to wait some time before being able to produce or consume messages again.

Production/Consumption Rate:

Depending if the client is a Consumer or a Producer, or if the quota is applied at cliend-id or user level, or both at the same time, a different metric will be used:

  • kafka_server_producer_client_byterate
  • kafka_server_producer_user_byterate
  • kafka_server_producer_user_client_byterate
  • kafka_server_consumer_client_byterate
  • kafka_server_consumer_user_byterate
  • kafka_server_consumer_user_client_byterate

Let’s check for example the production rate of a Producer using both user and client-id:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name, user, client_id)(kafka_server_producer_user_client_byterate)

Production/Consumption Throttle Time:

Similar to the rate, there are throttle time for the same combinations of clients and quota groups:

  • kafka_server_producer_client_throttle_time
  • kafka_server_producer_user_throttle_time
  • kafka_server_producer_user_client_throttle_time
  • kafka_server_consumer_client_throttle_time
  • kafka_server_consumer_user_throttle_time
  • kafka_server_consumer_user_client_throttle_time

Let’s see in this case if the throtte time of a Consumer using user and client-id is higher than one second, at least in one Broker:

max by(kube_cluster_name, kube_namespace_name, kube_workload_name, user, client_id)(kafka_server_consumer_user_client_throttle_time) > 1000

2.15 - KEDA

KEDA

KEDA

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Keda] Errors in Scaled ObjectErrors detected in scaled objectPrometheus

List of Dashboards:

  • Keda Keda

List of Metrics:

  • keda_metrics_adapter_scaled_object_errors
  • keda_metrics_adapter_scaler_metrics_value
  • kubernetes.hpa.replicas.current
  • kubernetes.hpa.replicas.desired
  • kubernetes.hpa.replicas.max
  • kubernetes.hpa.replicas.min

2.16 - Kube State Metrics

Kube State Metrics

Kube State Metrics

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Dashboards:

  • KSM Pod Status & Performance KSM Pod Status & Performance
  • KSM Workload Status & Performance KSM Workload Status & Performance
  • KSM Container Resource Usage & Troubleshooting KSM Container Resource Usage & Troubleshooting
  • KSM Cluster / Namespace Available Resources KSM Cluster / Namespace Available Resources

List of Metrics:

  • ksm_container_cpu_cores_used
  • ksm_container_cpu_quota_used_percent
  • ksm_container_info
  • ksm_container_memory_limit_used_percent
  • ksm_container_memory_used_bytes
  • ksm_kube_node_status_allocatable
  • ksm_kube_node_status_capacity
  • ksm_kube_pod_container_status_restarts_total
  • ksm_kube_pod_container_status_terminated_reason
  • ksm_kube_pod_container_status_waiting_reason
  • ksm_kube_pod_status_ready
  • ksm_kube_pod_status_reason
  • ksm_kube_resourcequota
  • ksm_workload_status_desired
  • ksm_workload_status_ready
  • kube_pod_container_cpu_request
  • kube_pod_container_memory_request
  • kube_pod_container_resource_limits_cpu_cores
  • kube_pod_container_resource_limits_memory_bytes
  • kube_pod_status_ready

2.17 - Kubernetes

Kubernetes

Kubernetes

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[Kubernetes] Container WaitingContainer in waiting status for long time (CrashLoopBackOff, ImagePullErr…)Prometheus
[Kubernetes] Container RestartingContainer restartingPrometheus
[Kubernetes] Pod Not ReadyPod in not ready statusPrometheus
[Kubernetes] Init Container Waiting For a Long TimeInit container in waiting state (CrashLoopBackOff, ImagePullErr…)Prometheus
[Kubernetes] Pod Container Creating For a Long TimePod is stuck in ContainerCreating statePrometheus
[Kubernetes] Pod Container Terminated With ErrorPod Container Terminated With Error (OOMKilled, Error…)Prometheus
[Kubernetes] Init Container Terminated With ErrorInit Container Terminated With Error (OOMKilled, Error…)Prometheus
[Kubernetes] Workload with Pods not ReadyWorkload with Pods not Ready (Evicted, NodeLost, UnexpectedAdmissionError)Prometheus
[Kubernetes] Workload Replicas MismatchThere are pod in the workload that could not startPrometheus
[Kubernetes] Pod Not Scheduled For DaemonSetPods cannot be scheduled for DaemonSetPrometheus
[Kubernetes] Pods In DaemonSet Incorrectly ScheduledThere are pods from a DaemonSet that should not be runningPrometheus
[Kubernetes] CPU OvercommitCPU OverCommit in cluster. If one node fails, the cluster will not be able to schedule all the current pods.Prometheus
[Kubernetes] Memory OvercommitMemory OverCommit in cluster. If one node fails, the cluster will not be able to schedule all the current pods.Prometheus
[Kubernetes] CPU OverUsageCPU OverUsage in cluster. If one node fails, the cluster will not have enough CPU to run all the current pods.Prometheus
[Kubernetes] Memory OverUsageMemory OverUsage in cluster. If one node fails, the cluster will not have enough memory to run all the current pods.Prometheus
[Kubernetes] Container CPU ThrottlingContainer CPU usage next to limit. Possible CPU Throttling.Prometheus
[Kubernetes] Container Memory Next To LimitContainer memory usage next to limit. Risk of Out Of Memory Kill.Prometheus
[Kubernetes] Container CPU UnusedContainer unused CPU higher than 85% of request for 8 hours.Prometheus
[Kubernetes] Container Memory UnusedContainer unused Memory higher than 85% of request for 8 hours.Prometheus
[Kubernetes] Node Not ReadyNode in Not-Ready conditionPrometheus
[Kubernetes] Too Many Pods In NodeNode close to its limits of pods.Prometheus
[Kubernetes] Node Readiness FlappingNode availability is unstable.Prometheus
[Kubernetes] Nodes DisappearedLess nodes in cluster than 30 minutes before.Prometheus
[Kubernetes] All Nodes Gone In ClusterAll Nodes Gone In Cluster.Prometheus
[Kubernetes] Node CPU High UsageHigh usage of CPU in node.Prometheus
[Kubernetes] Node Memory High UsageHigh usage of memory in node. Risk of pod eviction.Prometheus
[Kubernetes] Node Root File System Almost FullRoot file system in node almost full. To include other file systems, change the value of the device label from ‘.root.’ to your device namePrometheus
[Kubernetes] Max Schedulable Pod Less Than 1 CPU CoreThe maximum schedulable CPU request in a pod is less than 1 core.Prometheus
[Kubernetes] Max Schedulable Pod Less Than 512Mb MemoryThe maximum schedulable memory request in a pod is less than 512Mb.Prometheus
[Kubernetes] HPA Desired Scale Up Replicas UnreachedHPA could not reach the desired scaled up replicas for long time.Prometheus
[Kubernetes] HPA Desired Scale Down Replicas UnreachedHPA could not reach the desired scaled down replicas for long time.Prometheus
[Kubernetes] Job failed to completeJob failed to completePrometheus

List of Dashboards:

  • Workload Status & Performance Workload Status & Performance
  • Pod Status & Performance Pod Status & Performance
  • Cluster / Namespace Available Resources Cluster / Namespace Available Resources
  • Cluster Capacity Planning Cluster Capacity Planning
  • Container Resource Usage & Troubleshooting Container Resource Usage & Troubleshooting
  • Node Status & Performance Node Status & Performance
  • Pod Rightsizing & Workload Capacity Optimization Pod Rightsizing & Workload Capacity Optimization
  • Pod Scheduling Troubleshooting Pod Scheduling Troubleshooting
  • Horizontal Pod Autoscaler Horizontal Pod Autoscaler
  • Kubernetes Jobs Kubernetes Jobs

List of Metrics:

  • container.image
  • container.image.tag
  • cpu.cores.used
  • kube_cronjob_next_schedule_time
  • kube_cronjob_status_active
  • kube_cronjob_status_last_schedule_time
  • kube_daemonset_status_current_number_scheduled
  • kube_daemonset_status_desired_number_scheduled
  • kube_daemonset_status_number_misscheduled
  • kube_daemonset_status_number_ready
  • kube_hpa_status_current_replicas
  • kube_hpa_status_desired_replicas
  • kube_job_complete
  • kube_job_failed
  • kube_job_spec_completions
  • kube_job_status_active
  • kube_namespace_labels
  • kube_node_info
  • kube_node_status_allocatable
  • kube_node_status_allocatable_cpu_cores
  • kube_node_status_allocatable_memory_bytes
  • kube_node_status_capacity
  • kube_node_status_capacity_cpu_cores
  • kube_node_status_capacity_memory_bytes
  • kube_node_status_capacity_pods
  • kube_node_status_condition
  • kube_node_sysdig_host
  • kube_pod_container_info
  • kube_pod_container_resource_limits
  • kube_pod_container_resource_requests
  • kube_pod_container_status_restarts_total
  • kube_pod_container_status_terminated_reason
  • kube_pod_container_status_waiting_reason
  • kube_pod_info
  • kube_pod_init_container_status_terminated_reason
  • kube_pod_init_container_status_waiting_reason
  • kube_pod_status_ready
  • kube_resourcequota
  • kube_workload_pods_status_reason
  • kube_workload_status_desired
  • kube_workload_status_ready
  • kubernetes.hpa.replicas.current
  • kubernetes.hpa.replicas.desired
  • kubernetes.hpa.replicas.max
  • kubernetes.hpa.replicas.min
  • memory.bytes.used
  • net.bytes.in
  • net.bytes.out
  • net.bytes.total
  • net.connection.count.total
  • net.error.count
  • net.http.error.count
  • net.http.request.time
  • net.request.count
  • net.request.time
  • sysdig_container_cpu_cores_used
  • sysdig_container_cpu_quota_used_percent
  • sysdig_container_info
  • sysdig_container_memory_limit_used_percent
  • sysdig_container_memory_used_bytes
  • sysdig_container_net_connection_in_count
  • sysdig_container_net_connection_out_count
  • sysdig_container_net_error_count
  • sysdig_container_net_http_error_count
  • sysdig_container_net_http_request_time
  • sysdig_container_net_http_statuscode_request_count
  • sysdig_container_net_in_bytes
  • sysdig_container_net_out_bytes
  • sysdig_container_net_request_count
  • sysdig_container_net_request_time
  • sysdig_fs_free_bytes
  • sysdig_fs_inodes_used_percent
  • sysdig_fs_total_bytes
  • sysdig_fs_used_bytes
  • sysdig_fs_used_percent
  • sysdig_program_cpu_used_percent
  • sysdig_program_memory_used_bytes

2.18 - Kubernetes API server

Kubernetes API server

Kubernetes API server

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[Kubernetes API Server] Deprecated APIsAPI-Server Deprecated APIsPrometheus
[Kubernetes API Server] Certificate ExpiryAPI-Server Certificate ExpiryPrometheus
[Kubernetes API Server] Admission Controller High LatencyAPI-Server Admission Controller High LatencyPrometheus
[Kubernetes API Server] Webhook Admission Controller High LatencyAPI-Server Webhook Admission Controller High LatencyPrometheus
[Kubernetes API Server] High 4xx RequestError RateAPIS-Server High 4xx Request Error RatePrometheus
[Kubernetes API Server] High 5xx RequestError RateAPIS-Server High 5xx Request Error RatePrometheus
[Kubernetes API Server] High Request LatencyAPIS-Server High Request LatencyPrometheus

List of Dashboards:

  • Kubernetes API Server Kubernetes API Server

List of Metrics:

  • apiserver_admission_controller_admission_duration_seconds_count
  • apiserver_admission_controller_admission_duration_seconds_sum
  • apiserver_admission_webhook_admission_duration_seconds_count
  • apiserver_admission_webhook_admission_duration_seconds_sum
  • apiserver_client_certificate_expiration_seconds_bucket
  • apiserver_client_certificate_expiration_seconds_count
  • apiserver_request_duration_seconds_count
  • apiserver_request_duration_seconds_sum
  • apiserver_request_total
  • apiserver_requested_deprecated_apis
  • apiserver_response_sizes_count
  • apiserver_response_sizes_sum
  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds
  • process_resident_memory_bytes
  • workqueue_adds_total
  • workqueue_depth

2.19 - Kubernetes controller manager

Kubernetes controller manager

Kubernetes controller manager

This integration is enabled by default.

List of Dashboards:

  • Kubernetes Controller Manager Kubernetes Controller Manager

List of Metrics:

  • cloudprovider_aws_api_request_duration_seconds_count
  • cloudprovider_aws_api_request_duration_seconds_sum
  • cloudprovider_aws_api_request_errors
  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds
  • rest_client_request_duration_seconds_count
  • rest_client_request_duration_seconds_sum
  • rest_client_requests_total
  • sysdig_container_cpu_cores_used
  • sysdig_container_memory_used_bytes
  • workqueue_adds_total
  • workqueue_depth
  • workqueue_queue_duration_seconds_count
  • workqueue_queue_duration_seconds_sum
  • workqueue_retries_total
  • workqueue_unfinished_work_seconds
  • workqueue_work_duration_seconds_count
  • workqueue_work_duration_seconds_sum

2.20 - Kubernetes CoreDNS

Kubernetes CoreDNS

Kubernetes CoreDNS

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[CoreDNS] Error HighHigh Request DurationPrometheus
[CoreDNS] Latency HighLatency HighPrometheus

List of Dashboards:

  • Kubernetes CoreDNS Kubernetes CoreDNS

List of Metrics:

  • coredns_cache_hits_total
  • coredns_cache_misses_total
  • coredns_dns_request_duration_seconds_bucket
  • coredns_dns_request_size_bytes_bucket
  • coredns_dns_requests_total
  • coredns_dns_response_size_bytes_bucket
  • coredns_dns_responses_total
  • coredns_forward_request_duration_seconds_bucket
  • coredns_panics_total
  • coredns_plugin_enabled
  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds
  • process_resident_memory_bytes

2.21 - Kubernetes etcd

Kubernetes etcd

Kubernetes etcd

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Etcd] Etcd Members DownThere are members down.Prometheus
[Etcd] Etcd Insufficient MembersEtcd cluster has insufficient membersPrometheus
[Etcd] Etcd No LeaderMember has no leader.Prometheus
[Etcd] Etcd High Number Of Leader ChangesLeader changes within the last 15 minutes.Prometheus
[Etcd] Etcd High Number Of Failed GRPC RequestsHigh number of failed grpc requestsPrometheus
[Etcd] Etcd GRPC Requests SlowgRPC requests are taking too much timePrometheus
[Etcd] Etcd High Number Of Failed ProposalsHigh number of proposal failures within the last 30 minutes on etcd instancePrometheus
[Etcd] Etcd High Fsync Durations99th percentile fync durations are too highPrometheus
[Etcd] Etcd High Commit Durations99th percentile commit durations are too highPrometheus
[Etcd] Etcd HighNumber Of Failed HTTP RequestsHigh number of failed http requestsPrometheus
[Etcd] Etcd HTTP Requests SlowHttps request are slowPrometheus

List of Dashboards:

  • Kubernetes Etcd Kubernetes Etcd

List of Metrics:

  • etcd_debugging_mvcc_db_total_size_in_bytes
  • etcd_disk_backend_commit_duration_seconds_bucket
  • etcd_disk_wal_fsync_duration_seconds_bucket
  • etcd_grpc_proxy_cache_hits_total
  • etcd_grpc_proxy_cache_misses_total
  • etcd_http_failed_total
  • etcd_http_received_total
  • etcd_http_successful_duration_seconds_bucket
  • etcd_mvcc_db_total_size_in_bytes
  • etcd_network_client_grpc_received_bytes_total
  • etcd_network_client_grpc_sent_bytes_total
  • etcd_network_peer_received_bytes_total
  • etcd_network_peer_received_failures_total
  • etcd_network_peer_round_trip_time_seconds_bucket
  • etcd_network_peer_sent_bytes_total
  • etcd_network_peer_sent_failures_total
  • etcd_server_has_leader
  • etcd_server_id
  • etcd_server_leader_changes_seen_total
  • etcd_server_proposals_applied_total
  • etcd_server_proposals_committed_total
  • etcd_server_proposals_failed_total
  • etcd_server_proposals_pending
  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • grpc_server_handled_total
  • grpc_server_handling_seconds_bucket
  • grpc_server_started_total
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds
  • sysdig_container_cpu_cores_used
  • sysdig_container_memory_used_bytes

2.22 - Kubernetes kube-proxy

Kubernetes kube-proxy

Kubernetes kube-proxy

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[KubeProxy] Kube Proxy DownKubeProxy detected downPrometheus
[KubeProxy] High Rest Client LatencyHigh Rest Client Latency detectedPrometheus
[KubeProxy] High Rule Sync LatencyHigh Rule Sync Latency detectedPrometheus
[KubeProxy] Too Many 500 CodeToo Many 500 Code detectedPrometheus

List of Dashboards:

  • Kubernetes Proxy Kubernetes Proxy

List of Metrics:

  • go_goroutines
  • kube_node_info
  • kubeproxy_network_programming_duration_seconds_bucket
  • kubeproxy_network_programming_duration_seconds_count
  • kubeproxy_sync_proxy_rules_duration_seconds_bucket
  • kubeproxy_sync_proxy_rules_duration_seconds_count
  • process_cpu_seconds_total
  • process_resident_memory_bytes
  • rest_client_request_duration_seconds_bucket
  • rest_client_requests_total

2.23 - Kubernetes kubelet

Kubernetes kubelet

Kubernetes kubelet

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[k8s-kubelet] Kubelet Too Many PodsKubelet Too Many PodsPrometheus
[k8s-kubelet] Kubelet Pod Lifecycle Event Generator Duration HighKubelet Pod Lifecycle Event Generator Duration HighPrometheus
[k8s-kubelet] Kubelet Pod StartUp Latency HighKubelet Pod StartUp Latency HighPrometheus
[k8s-kubelet] Kubelet DownKubelet DownPrometheus

List of Dashboards:

  • Kubernetes Kubelet Kubernetes Kubelet

List of Metrics:

  • go_goroutines
  • kube_node_status_capacity_pods
  • kube_node_status_condition
  • kubelet_cgroup_manager_duration_seconds_bucket
  • kubelet_cgroup_manager_duration_seconds_count
  • kubelet_node_config_error
  • kubelet_pleg_relist_duration_seconds_bucket
  • kubelet_pleg_relist_interval_seconds_bucket
  • kubelet_pod_start_duration_seconds_bucket
  • kubelet_pod_start_duration_seconds_count
  • kubelet_pod_worker_duration_seconds_bucket
  • kubelet_pod_worker_duration_seconds_count
  • kubelet_running_containers
  • kubelet_running_pod_count
  • kubelet_running_pods
  • kubelet_runtime_operations_duration_seconds_bucket
  • kubelet_runtime_operations_errors_total
  • kubelet_runtime_operations_total
  • process_cpu_seconds_total
  • process_resident_memory_bytes
  • rest_client_request_duration_seconds_bucket
  • rest_client_requests_total
  • storage_operation_duration_seconds_bucket
  • storage_operation_duration_seconds_count
  • storage_operation_errors_total
  • storage_operation_status_count
  • volume_manager_total_volumes

2.24 - Kubernetes PVC

Kubernetes PVC

Kubernetes PVC

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[k8s-pvc] PV Not AvailablePersistent Volume not availablePrometheus
[k8s-pvc] PVC Pending For a Long TimePersistent Volume Claim not availablePrometheus
[k8s-pvc] PVC LostPersistent Volume Claim lostPrometheus
[k8s-pvc] PVC Storage Usage Is Reaching The LimitPersistent Volume Claim storage at 95%Prometheus
[k8s-pvc] PVC Inodes Usage Is Reaching The LimitPVC inodes Usage Is Reaching The LimitPrometheus
[k8s-pvc] PV Full In Four DaysPersistent Volume Full In Four DaysPrometheus

List of Dashboards:

  • PVC and Storage PVC and Storage

List of Metrics:

  • kube_persistentvolume_status_phase
  • kube_persistentvolumeclaim_status_phase
  • kubelet_volume_stats_available_bytes
  • kubelet_volume_stats_capacity_bytes
  • kubelet_volume_stats_inodes
  • kubelet_volume_stats_inodes_used
  • kubelet_volume_stats_used_bytes
  • storage_operation_duration_seconds_bucket
  • storage_operation_errors_total
  • storage_operation_status_count

2.25 - Kubernetes Scheduler

Kubernetes Scheduler

Kubernetes Scheduler

This integration is enabled by default.

List of Dashboards:

  • Kubernetes Scheduler Kubernetes Scheduler

List of Metrics:

  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds
  • rest_client_request_duration_seconds_count
  • rest_client_request_duration_seconds_sum
  • rest_client_requests_total
  • scheduler_e2e_scheduling_duration_seconds_count
  • scheduler_e2e_scheduling_duration_seconds_sum
  • scheduler_pending_pods
  • scheduler_pod_scheduling_attempts_count
  • scheduler_pod_scheduling_attempts_sum
  • scheduler_schedule_attempts_total
  • sysdig_container_cpu_cores_used
  • sysdig_container_memory_used_bytes
  • workqueue_adds_total
  • workqueue_depth
  • workqueue_queue_duration_seconds_count
  • workqueue_queue_duration_seconds_sum
  • workqueue_retries_total
  • workqueue_unfinished_work_seconds
  • workqueue_work_duration_seconds_count
  • workqueue_work_duration_seconds_sum

2.26 - Kubernetes storage

Kubernetes storage

Kubernetes storage

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[k8s-storage] High Storage Error RateHigh Storage Error RatePrometheus
[k8s-storage] High Storage LatencyHigh Storage LatencyPrometheus

List of Metrics:

  • kube_persistentvolume_status_phase
  • kube_persistentvolumeclaim_status_phase
  • kubelet_volume_stats_capacity_bytes
  • kubelet_volume_stats_inodes
  • kubelet_volume_stats_inodes_used
  • kubelet_volume_stats_used_bytes
  • storage_operation_duration_seconds_bucket
  • storage_operation_errors_total
  • storage_operation_status_count

2.27 - Memcached

Memcached

Memcached

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Memcached] Instance DownInstance is not reachablePrometheus
[Memcached] Low UpTimeUptime of less than 1 hour in a Memcached instancePrometheus
[Memcached] Connection ThrottledConnection throttled because max number of requests per event process reachedPrometheus
[Memcached] Connections Close To The Limit 85%The mumber of connections are close to the limitPrometheus
[Memcached] Connections Limit ReachedReached the number of maximum connections and caused a connection errorPrometheus

List of Dashboards:

  • Memcached Memcached

List of Metrics:

  • memcached_commands_total
  • memcached_connections_listener_disabled_total
  • memcached_connections_yielded_total
  • memcached_current_bytes
  • memcached_current_connections
  • memcached_current_items
  • memcached_items_evicted_total
  • memcached_items_reclaimed_total
  • memcached_items_total
  • memcached_limit_bytes
  • memcached_max_connections
  • memcached_up
  • memcached_uptime_seconds

2.28 - MongoDB

MongoDB

MongoDB

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[MongoDB] Instance DownMongo server detected down by instancePrometheus
[MongoDB] Uptime less than one hourMongo server detected down by instancePrometheus
[MongoDB] Asserts detectedMongo server detected down by instancePrometheus
[MongoDB] High LatencyHigh latency in instancePrometheus
[MongoDB] High Ticket UtilizationTicket usage over 75% in instancePrometheus
[MongoDB] Recurrent Cursor TimeoutRecurrent cursors timeout in instancePrometheus
[MongoDB] Recurrent Memory Page FaultsRecurrent cursors timeout in instancePrometheus

List of Dashboards:

  • MongoDB Instance Health MongoDB Instance Health
  • MongoDB Database Details MongoDB Database Details

List of Metrics:

  • mongodb_asserts_total
  • mongodb_connections
  • mongodb_extra_info_page_faults_total
  • mongodb_instance_uptime_seconds
  • mongodb_memory
  • mongodb_mongod_db_collections_total
  • mongodb_mongod_db_data_size_bytes
  • mongodb_mongod_db_index_size_bytes
  • mongodb_mongod_db_indexes_total
  • mongodb_mongod_db_objects_total
  • mongodb_mongod_global_lock_client
  • mongodb_mongod_global_lock_current_queue
  • mongodb_mongod_global_lock_ratio
  • mongodb_mongod_metrics_cursor_open
  • mongodb_mongod_metrics_cursor_timed_out_total
  • mongodb_mongod_op_latencies_latency_total
  • mongodb_mongod_op_latencies_ops_total
  • mongodb_mongod_wiredtiger_cache_bytes
  • mongodb_mongod_wiredtiger_cache_bytes_total
  • mongodb_mongod_wiredtiger_cache_evicted_total
  • mongodb_mongod_wiredtiger_cache_pages
  • mongodb_mongod_wiredtiger_concurrent_transactions_out_tickets
  • mongodb_mongod_wiredtiger_concurrent_transactions_total_tickets
  • mongodb_network_bytes_total
  • mongodb_network_metrics_num_requests_total
  • mongodb_op_counters_total
  • mongodb_up
  • net.error.count

2.29 - MySQL

MySQL

MySQL

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[MySQL] Mysql DownMySQL instance is downPrometheus
[MySQL] Mysql RestartedMySQL has just been restarted, less than one minute agoPrometheus
[MySQL] Mysql Too many Connections (>80%)More than 80% of MySQL connections are in usePrometheus
[MySQL] Mysql High Threads RunningMore than 60% of MySQL connections are in running statePrometheus
[MySQL] Mysql HighOpen FilesMore than 80% of MySQL files openPrometheus
[MySQL] Mysql Slow QueriesMySQL server mysql has some new slow queryPrometheus
[MySQL] Mysql Innodb Log WaitsMySQL innodb log writes stallingPrometheus
[MySQL] Mysql Slave Io Thread Not RunningMySQL Slave IO thread not runningPrometheus
[MySQL] Mysql Slave Sql Thread Not RunningMySQL Slave SQL thread not runningPrometheus
[MySQL] Mysql Slave Replication LagMySQL Slave replication lagPrometheus

List of Dashboards:

  • MySQL MySQL

List of Metrics:

  • mysql_global_status_aborted_clients
  • mysql_global_status_aborted_connects
  • mysql_global_status_buffer_pool_pages
  • mysql_global_status_bytes_received
  • mysql_global_status_bytes_sent
  • mysql_global_status_commands_total
  • mysql_global_status_connection_errors_total
  • mysql_global_status_innodb_buffer_pool_read_requests
  • mysql_global_status_innodb_buffer_pool_reads
  • mysql_global_status_innodb_log_waits
  • mysql_global_status_innodb_mem_adaptive_hash
  • mysql_global_status_innodb_mem_dictionary
  • mysql_global_status_innodb_page_size
  • mysql_global_status_questions
  • mysql_global_status_select_full_join
  • mysql_global_status_select_full_range_join
  • mysql_global_status_select_range_check
  • mysql_global_status_select_scan
  • mysql_global_status_slow_queries
  • mysql_global_status_sort_merge_passes
  • mysql_global_status_sort_range
  • mysql_global_status_sort_rows
  • mysql_global_status_sort_scan
  • mysql_global_status_table_locks_immediate
  • mysql_global_status_table_locks_waited
  • mysql_global_status_table_open_cache_hits
  • mysql_global_status_table_open_cache_misses
  • mysql_global_status_threads_cached
  • mysql_global_status_threads_connected
  • mysql_global_status_threads_created
  • mysql_global_status_threads_running
  • mysql_global_status_uptime
  • mysql_global_variables_innodb_additional_mem_pool_size
  • mysql_global_variables_innodb_log_buffer_size
  • mysql_global_variables_innodb_open_files
  • mysql_global_variables_key_buffer_size
  • mysql_global_variables_max_connections
  • mysql_global_variables_open_files_limit
  • mysql_global_variables_query_cache_size
  • mysql_global_variables_thread_cache_size
  • mysql_global_variables_tokudb_cache_size
  • mysql_slave_status_master_server_id
  • mysql_slave_status_seconds_behind_master
  • mysql_slave_status_slave_io_running
  • mysql_slave_status_slave_sql_running
  • mysql_slave_status_sql_delay
  • mysql_up

2.30 - NGINX

NGINX

NGINX

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Nginx] No Intances UpNo Nginx instances UpPrometheus

List of Dashboards:

  • Nginx Nginx

List of Metrics:

  • net.bytes.in
  • net.bytes.out
  • net.http.error.count
  • net.http.request.count
  • net.http.request.time
  • nginx_connections_accepted
  • nginx_connections_active
  • nginx_connections_handled
  • nginx_connections_reading
  • nginx_connections_waiting
  • nginx_connections_writing
  • nginx_up

2.31 - NGINX Ingress

NGINX Ingress

NGINX Ingress

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Nginx-Ingress] High Http 4xx Error RateToo many HTTP requests with status 4xx (> 5%)Prometheus
[Nginx-Ingress] High Http 5xx Error RateToo many HTTP requests with status 5xx (> 5%)Prometheus
[Nginx-Ingress] High LatencyNginx p99 latency is higher than 10 secondsPrometheus
[Nginx-Ingress] Ingress Certificate ExpiryNginx Ingress Certificate will expire in less than 14 daysPrometheus

List of Dashboards:

  • Nginx Ingress Nginx Ingress

List of Metrics:

  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • nginx_ingress_controller_config_last_reload_successful
  • nginx_ingress_controller_config_last_reload_successful_timestamp_seconds
  • nginx_ingress_controller_ingress_upstream_latency_seconds_count
  • nginx_ingress_controller_ingress_upstream_latency_seconds_sum
  • nginx_ingress_controller_nginx_process_connections
  • nginx_ingress_controller_nginx_process_cpu_seconds_total
  • nginx_ingress_controller_nginx_process_resident_memory_bytes
  • nginx_ingress_controller_request_duration_seconds_bucket
  • nginx_ingress_controller_request_duration_seconds_count
  • nginx_ingress_controller_request_duration_seconds_sum
  • nginx_ingress_controller_request_size_sum
  • nginx_ingress_controller_requests
  • nginx_ingress_controller_response_duration_seconds_count
  • nginx_ingress_controller_response_duration_seconds_sum
  • nginx_ingress_controller_response_size_sum
  • nginx_ingress_controller_ssl_expire_time_seconds
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds

2.32 - NTP

NTP

NTP

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Ntp] Drift is too highDrift is too highPrometheus

List of Dashboards:

  • NTP NTP

List of Metrics:

  • ntp_drift_seconds

2.33 - OPA

OPA

OPA

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Opa gatekeeper] Too much time since the last auditThere was more than 120 second since the last auditPrometheus
[Opa gatekeeper] Spike of violationsThere was more than 30 violationsPrometheus

List of Dashboards:

  • OPA Gatekeeper OPA Gatekeeper

List of Metrics:

  • gatekeeper_audit_duration_seconds_bucket
  • gatekeeper_audit_last_run_time
  • gatekeeper_constraint_template_ingestion_count
  • gatekeeper_constraint_template_ingestion_duration_seconds_bucket
  • gatekeeper_constraint_templates
  • gatekeeper_constraints
  • gatekeeper_request_count
  • gatekeeper_request_duration_seconds_bucket
  • gatekeeper_request_duration_seconds_count
  • gatekeeper_violations

2.34 - OpenShift API-Server

OpenShift API-Server

OpenShift API-Server

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[OpenShift API Server] Deprecated APIsAPI-Server Deprecated APIsPrometheus
[OpenShift API Server] Certificate ExpiryAPI-Server Certificate ExpiryPrometheus
[OpenShift API Server] Admission Controller High LatencyAPI-Server Admission Controller High LatencyPrometheus
[OpenShift API Server] Webhook Admission Controller High LatencyAPI-Server Webhook Admission Controller High LatencyPrometheus
[OpenShift API Server] High 4xx RequestError RateAPIS-Server High 4xx Request Error RatePrometheus
[OpenShift API Server] High 5xx RequestError RateAPIS-Server High 5xx Request Error RatePrometheus
[OpenShift API Server] High Request LatencyAPIS-Server High Request LatencyPrometheus

List of Dashboards:

  • OpenShift v4 API Server OpenShift v4 API Server

List of Metrics:

  • apiserver_admission_controller_admission_duration_seconds_count
  • apiserver_admission_controller_admission_duration_seconds_sum
  • apiserver_admission_webhook_admission_duration_seconds_count
  • apiserver_admission_webhook_admission_duration_seconds_sum
  • apiserver_client_certificate_expiration_seconds_bucket
  • apiserver_client_certificate_expiration_seconds_count
  • apiserver_request_duration_seconds_count
  • apiserver_request_duration_seconds_sum
  • apiserver_request_total
  • apiserver_requested_deprecated_apis

How to monitor OpenShift API Server with Sysdig agent

No further installation is needed, since OpenShift 4.X comes with both Prometheus and API Server ready to use. OpenShift API Server metrics are exposed using /federate endpoint.

Learning how to monitor Kubernetes API server is of vital importance when running Kubernetes in production. Monitoring kube-apiserver will let you detect and troubleshoot latency, errors and validate the service performs as expected.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift API Server.

API Server deprecated APIs

To check if deprecated API versions are being used use the following query:

sum by (kube_cluster_name, resource, removed_release,version)(apiserver_requested_deprecated_apis)

Certificate expiration

Certificates are used to authenticate to the apiserver, you can check with the following query if a certificate is expiring next week:

apiserver_client_certificate_expiration_seconds_count > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket[5m]))) < 7*24*60*60

API Server Latency

Check for latency spikes in the last 10 minutes. This is typically a sign of overload in the API server. Probably your cluster has a lot of load and the API server needs to be scaled out.

sum by (kube_cluster_name,verb,apiserver)(rate(apiserver_request_duration_seconds_sum{verb!="WATCH"}[10m]))/sum by (kube_cluster_name,verb,apiserver)(rate(apiserver_request_duration_seconds_count{verb!="WATCH"}[10m]))

Request Error Rate

Request errror rate means that API is responding 5xx errors, check CPU / Memory of your api-server pods.

sum by(kube_cluster_name)(rate(apiserver_request_total{code=~"5..",kube_cluster_name=~$cluster}[5m])) / sum by(kube_cluster_name)(rate(apiserver_request_total{kube_cluster_name=~$cluster}[5m])) > 0.05

2.35 - OpenShift CoreDNS

OpenShift CoreDNS

OpenShift CoreDNS

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[OpenShiftCoreDNS] Error HighHigh Request DurationPrometheus
[OpenShiftCoreDNS] Latency HighLatency HighPrometheus

List of Metrics:

  • coredns_cache_hits_total
  • coredns_cache_misses_total
  • coredns_dns_request_duration_seconds_bucket
  • coredns_dns_request_size_bytes_bucket
  • coredns_dns_requests_total
  • coredns_dns_response_size_bytes_bucket
  • coredns_dns_responses_total
  • coredns_forward_request_duration_seconds_bucket
  • coredns_panics_total
  • coredns_plugin_enabled
  • go_goroutines
  • process_cpu_seconds_total
  • process_resident_memory_bytes

How to monitor OpenShift CoreDNS with Sysdig agent

No further installation is needed, since OpenShift 4.X comes with both Prometheus and CoreDNS ready to use. OpenShift CoreDNS metrics are exposed in SSL port 9154.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift 4.

CoreDNS panics

Number of panics

Let’s check the CoreDNS number of panics. Check for CoreDNS pods logs in case you see this number growing.

sum(coredns_panics_total)

DNS Requests

by type

To filter DNS request types use the following query:

(sum(rate(coredns_dns_requests_total[$__interval])) by (type,kube_cluster_name,kube_pod_name))

by protocol

To filter DNS request types by protocolo use the following query:

(sum(rate(coredns_dns_requests_total[$__interval]) ) by (proto,kube_cluster_name,kube_pod_name))

by zone

To filter DNS request types by zone use the following query:

(sum(rate(coredns_dns_requests_total[$__interval]) ) by (zone,kube_cluster_name,kube_pod_name))

by Latency

This metrics is important to detect any degradation in the service. With the following compare you can compare percentile 99 against average.

histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by(server, zone, le))

Error Rate

Watch carefully for this metric as you can filter depending on the status code (200,404,400 or 500).

sum by (server, status)(coredns_dns_https_responses_total{server, status})

Cache

Cache hit

To check the cache hit rate use the following query:

sum(rate(coredns_cache_hits_total[$__interval])) by (type,kube_cluster_name,kube_pod_name)

Cache miss

To check the cache miss rate use the following query:

sum(rate(coredns_cache_misses_total[$__interval])) by(server,kube_cluster_name,kube_pod_name)

2.36 - OpenShift Etcd

OpenShift Etcd

OpenShift Etcd

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[OpenShiftEtcd] Etcd Insufficient MembersEtcd cluster has insufficient membersPrometheus
[OpenShiftEtcd] Etcd No LeaderMember has no leader.Prometheus
[OpenShiftEtcd] Etcd High Number Of Leader ChangesLeader changes within the last 15 minutes.Prometheus
[OpenShiftEtcd] Etcd High Number Of Failed GRPC RequestsHigh number of failed grpc requestsPrometheus
[OpenShiftEtcd] Etcd GRPC Requests SlowgRPC requests are taking too much timePrometheus
[OpenShiftEtcd] Etcd High Number Of Failed ProposalsHigh number of proposal failures within the last 30 minutes on etcd instancePrometheus
[OpenShiftEtcd] Etcd High Fsync Durations99th percentile fync durations are too highPrometheus
[OpenShiftEtcd] Etcd High Commit Durations99th percentile commit durations are too highPrometheus
[OpenShiftEtcd] Etcd HighNumber Of Failed HTTP RequestsHigh number of failed http requestsPrometheus
[OpenShiftEtcd] Etcd HTTP Requests SlowHttps request are slowPrometheus

List of Metrics:

  • etcd_debugging_mvcc_db_total_size_in_bytes
  • etcd_disk_backend_commit_duration_seconds_bucket
  • etcd_disk_wal_fsync_duration_seconds_bucket
  • etcd_grpc_proxy_cache_hits_total
  • etcd_grpc_proxy_cache_misses_total
  • etcd_http_failed_total
  • etcd_http_received_total
  • etcd_http_successful_duration_seconds_bucket
  • etcd_mvcc_db_total_size_in_bytes
  • etcd_network_client_grpc_received_bytes_total
  • etcd_network_client_grpc_sent_bytes_total
  • etcd_network_peer_received_bytes_total
  • etcd_network_peer_received_failures_total
  • etcd_network_peer_round_trip_time_seconds_bucket
  • etcd_network_peer_sent_bytes_total
  • etcd_network_peer_sent_failures_total
  • etcd_server_has_leader
  • etcd_server_id
  • etcd_server_leader_changes_seen_total
  • etcd_server_proposals_applied_total
  • etcd_server_proposals_committed_total
  • etcd_server_proposals_failed_total
  • etcd_server_proposals_pending
  • go_goroutines
  • grpc_server_handled_total
  • grpc_server_handling_seconds_bucket
  • grpc_server_started_total
  • process_max_fds
  • process_open_fds
  • sysdig_container_cpu_cores_used
  • sysdig_container_memory_used_bytes

How to monitor OpenShift Etcd with Sysdig agent

No further installation is needed, since OpenShift 4.X comes with both Prometheus and Etcd ready to use. OpenShift Etcd metrics are exposed using /federate endpoint.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift Etcd.

Etcd Consensus & Leader

Problems in the leader and consensus of the etcd cluster can cause outages in the cluster.

Etcd leader

If a member does not have a leader, it is totally unavailable. If all the members in the cluster do not have any leader, the entire cluster is totally unavailable.

Check leader using this query, if the result is 1 etcd has a leader:

count(etcd_server_id) % 2

Leader changes

Rapid leadership changes impact the performance of etcd significantly and it can also mean that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

Check for leader changes in the last hour:

max(increase(etcd_server_leader_changes_seen_total[60m]))

Failed proposals

Check the proposal fails. They are normally related to two issues:

  • Temporary failures related to a leader election
  • Longer downtime caused by a loss of quorum in the cluster
max(rate(etcd_server_proposals_failed_total[60m]))

Pending proposals

Rising pending proposals suggests there is a high client load or the member cannot commit proposals

sum(etcd_server_proposals_pending)

Total number of consensus proposals commited

The etcd server applies every committed proposal asynchronously

Check that the difference between proposals committed applied is small (within a few thousands even under high load):

  • If the difference between them continues to rise, it indicates that the etcd server is overloaded.
  • This might happen when applying expensive queries like heavy range queries or large txn operations.

Proposals commited

sum(rate(etcd_server_proposals_committed_total[60m])) by (kube_cluster_name)

Proposals applied

sum(rate(etcd_server_proposals_applied_total[60m])) by (kube_cluster_name)

gRPC

Error rate

Check gRPC error rate, this error are most likely related to networking issues.

sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary",grpc_code!="OK"}[10m])) by (kube_cluster_name,kube_pod_name)
/
sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary"}[10m])) by (kube_cluster_name,kube_pod_name)

gRPC Traffic

Check for unusual spikes in the traffic, they could be related with networking issues.

rate(etcd_network_client_grpc_received_bytes_total[10m])
rate(etcd_network_client_grpc_sent_bytes_total[10m])

Disk

Disk sync

Check that the fsync and commit latencies are below limits:

  • High disk operation latencies often indicate disk issues.
  • It may cause high request latency or make the cluster unstable
histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))
histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))

DB Size

Check for DB size in case it keeps increasing. You should defrag etcd to decrease DB Size

etcd_debugging_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"} or etcd_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"}

Networking between peers (only if multi-master)

Errors from / to peer

Check the total number of sent failures from peers

rate(etcd_network_peer_sent_failures_total{container_name=~".*etcd.*|http"}[10m])

Check the total number of received failures from peers

rate(etcd_network_peer_received_failures_total{container_name=~".*etcd.*|http"}[10m])

2.37 - OpenShift State Metrics

OpenShift State Metrics

OpenShift State Metrics

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[OpenShift-state-metrics] CPU Resource Request Quota UsageResource request CPU usage is over 90% resource quota.Prometheus
[OpenShift-state-metrics] CPU Resource Limit Quota UsageResource limit CPU usage is over 90% resource limit quota.Prometheus
[OpenShift-state-metrics] Memory Resource Request Quota UsageResource request memory usage is over 90% resource quota.Prometheus
[OpenShift-state-metrics] Memory Resource Limit Quota UsageResource limit memory usage is over 90% resource limit quota.Prometheus
[OpenShift-state-metrics] Routes with issuesA route status is in error and is having issues.Prometheus
[OpenShift-state-metrics] Buid Processes with issuesA build process is in error or failed status.Prometheus

List of Dashboards:

  • OpenShift v4 State Metrics OpenShift v4 State Metrics

List of Metrics:

  • openshift_build_created_timestamp_seconds
  • openshift_build_status_phase_total
  • openshift_clusterresourcequota_usage
  • openshift_route_status

How to monitor OpenShift State Metrics with Sysdig agent

No further installation is needed, since OKD4 comes with both Prometheus and OSM ready to use.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift 4.

Resource Quotas

Resource Quotas Requests:

% CPU used vs request quota

Let’s get what’s the % of CPU used vs the request quota.

sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="used"}) / sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="hard"}) > 0

% Memory used vs request quota

Now, the same but for the memory.

sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="hard"}) > 0

These queries return one time series for each resource quota deployed in the cluster.

Please, not that if your requests are near 100%, you can use the Pod Rightsizing & Workload Capacity Optimization dashboard to fix it. You can also talk to your cluster administrator to check your resource quota. Also, if your requests are too low, the resource quota could be rightsized.

Resource Quotas Limits:

% CPU used vs limit quota

Let’s get what’s the % of CPU used vs the limit quota.

sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="hard"}) > 0

% Memory used vs limit quota

Now, the same but for the memory.

sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="hard"}) > 0

These queries return one time series for each resource quota deployed in the cluster.

Please, note that quota limits are normally higher than the quota requests. If your limits are too close to 100%, you might face scheduling issues. The Pod Scheduling Troubleshooting dashboard might help you to troubleshoot this scenario. Also, if limit usage is too low, the resource quota could be rightsized.

Routes

List the routes

Let’s get a list of all the routes present in the cluster, aggregated by host and namespace

sum by (route, host, namespace) (openshift_route_info)

Duplicated routes

Now, let’s find our duplicated routes:

sum by (host) (openshift_route_info) > 1

This query will return the duplicated hosts. If you want the full information for the duplicated routes, try this one:

openshift_route_info * on (host) group_left(host_name) label_replace((sum by (host) (openshift_route_info) > 1), "host_name", "$0", "host", ".+")

Why the label_replace? because to get the full info, joining the openshift_route_info metric with itself was necessary, but, as both sides of the join have the same labels, there wasn’t any extra label to join by.

What you can do is to perform a label_replace to create a new label host_name with the content of the host label and the join will work.

Routes with issues

Let’s get what are the routes with issues (a.k.a. routes with a False status)

openshift_route_status{status == 'False'} > 0

Builds

New builds, by processing time

Let’s list the new builds, by how many time they have been processing. This query can be useful to detect slow processes.

time() - (openshift_build_created_timestamp_seconds) * on (build) group_left(build_phase) (openshift_build_status_phase_total{build_phase="new"} == 1)

Builds with errors

Use this query to get builds that are in failed or error state.

sum by (build, buildconfig, kube_namespace_name, kube_cluster_name) (openshift_build_status_phase_total{build_phase=~"failed|error"}) > 0

2.38 - PHP-FPM

PHP-FPM

PHP-FPM

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Php-Fpm] Percentage of instances lowLess than 75% of instances are upPrometheus
[Php-Fpm] Recently rebootInstances have been recently rebootPrometheus
[Php-Fpm] Limit of child proccess exceededNumber of childs process have been exceededPrometheus
[Php-Fpm] Reaching limit of queue processBuffer of queue requests reaching its limitPrometheus
[Php-Fpm] Too slow requests processingRequests have taking too much time to be processedPrometheus

List of Dashboards:

  • Php-fpm Php-fpm

List of Metrics:

  • kube_workload_status_desired
  • phpfpm_accepted_connections
  • phpfpm_active_processes
  • phpfpm_idle_processes
  • phpfpm_listen_queue
  • phpfpm_listen_queue_length
  • phpfpm_max_children_reached
  • phpfpm_process_requests
  • phpfpm_slow_requests
  • phpfpm_start_since
  • phpfpm_total_processes
  • phpfpm_up

2.39 - Portworx

Portworx

Portworx

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Portworx] No QuorumPortworx No Quorum.Prometheus
[Portworx] Node Status Not OKPortworx Node Status Not OK.Prometheus
[Portworx] Offline NodesPortworx Offline Nodes.Prometheus
[Portworx] Nodes Storage Full or DownPortworx Nodes Storage Full or Down.Prometheus
[Portworx] Offline Storage NodesPortworx Offline Storage Nodes.Prometheus
[Portworx] Unhealthy Node KVDBPortworx Unhealthy Node KVDB.Prometheus
[Portworx] Cache read hit rate is lowPortworx Cache read hit rate is low.Prometheus
[Portworx] Cache write hit rate is lowPortworx Cache write hit rate is low.Prometheus
[Portworx] High Read Latency In DiskPortworx High Read Latency In Disk.Prometheus
[Portworx] High Write Latency In DiskPortworx High Write Latency In Disk.Prometheus
[Portworx] Low Cluster CapacityPortworx Low Cluster Capacity.Prometheus
[Portworx] Disk Full In 48HPortworx Disk Full In 48H.Prometheus
[Portworx] Disk Full In 12HPortworx Disk Full In 12H.Prometheus
[Portworx] Pool Status Not OnlinePortworx Node Status Not Online.Prometheus
[Portworx] High Write Latency In PoolPortworx High Write Latency In Pool.Prometheus
[Portworx] Pool Full In 48HPortworx Pool Full In 48H.Prometheus
[Portworx] Pool Full In 12HPortworx Pool Full In 12H.Prometheus
[Portworx] High Write Latency In VolumePortworx High Write Latency In Volume.Prometheus
[Portworx] High Read Latency In VolumePortworx High Read Latency In Volume.Prometheus
[Portworx] License ExpiryPortworx License Expiry.Prometheus

List of Dashboards:

  • Portworx Cluster Portworx Cluster
  • Portworx Volumes Portworx Volumes

List of Metrics:

  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds
  • px_cluster_disk_available_bytes
  • px_cluster_disk_total_bytes
  • px_cluster_status_nodes_offline
  • px_cluster_status_nodes_online
  • px_cluster_status_nodes_storage_down
  • px_cluster_status_quorum
  • px_cluster_status_size
  • px_cluster_status_storage_nodes_decommissioned
  • px_cluster_status_storage_nodes_offline
  • px_cluster_status_storage_nodes_online
  • px_disk_stats_num_reads_total
  • px_disk_stats_num_writes_total
  • px_disk_stats_read_bytes_total
  • px_disk_stats_read_latency_seconds
  • px_disk_stats_used_bytes
  • px_disk_stats_write_latency_seconds
  • px_disk_stats_written_bytes_total
  • px_kvdb_health_state_node_view
  • px_network_io_received_bytes_total
  • px_network_io_sent_bytes_total
  • px_node_status_license_expiry
  • px_node_status_node_status
  • px_pool_stats_available_bytes
  • px_pool_stats_flushed_bytes_total
  • px_pool_stats_num_flushes_total
  • px_pool_stats_num_writes
  • px_pool_stats_status
  • px_pool_stats_total_bytes
  • px_pool_stats_write_latency_seconds
  • px_pool_stats_written_bytes
  • px_px_cache_read_hits
  • px_px_cache_read_miss
  • px_px_cache_write_hits
  • px_px_cache_write_miss
  • px_volume_attached
  • px_volume_attached_state
  • px_volume_capacity_bytes
  • px_volume_currhalevel
  • px_volume_halevel
  • px_volume_read_bytes_total
  • px_volume_read_latency_seconds
  • px_volume_reads_total
  • px_volume_replication_status
  • px_volume_state
  • px_volume_status
  • px_volume_usage_bytes
  • px_volume_write_latency_seconds
  • px_volume_writes_total
  • px_volume_written_bytes_total

2.40 - PostgreSQL

PostgreSQL

PostgreSQL

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[PostgreSQL] Instance DownPostgreSQL instance is unavailablePrometheus
[PostgreSQL] Low UpTimeThe PostgreSQL instance has a UpTime of less than 1 hourPrometheus
[PostgreSQL] Max Write Buffer ReachedBackground writer stops because it reached the maximum write buffersPrometheus
[PostgreSQL] High WAL Files Archive Error RateHigh error rate in WAL files archiverPrometheus
[PostgreSQL] Low Available ConnectionsLow available network connectionsPrometheus
[PostgreSQL] High Response TimeHigh response time in at least one of the databasesPrometheus
[PostgreSQL] Low Cache Hit RateLow cache hit ratePrometheus
[PostgreSQL] DeadLocks In DatabaseDeadlocks detected in databasePrometheus

List of Dashboards:

  • PostgreSQL Instance Health PostgreSQL Instance Health
  • PostgreSQL Database Details PostgreSQL Database Details

List of Metrics:

  • pg_database_size_bytes
  • pg_locks_count
  • pg_postmaster_start_time_seconds
  • pg_replication_lag
  • pg_settings_max_connections
  • pg_settings_superuser_reserved_connections
  • pg_stat_activity_count
  • pg_stat_activity_max_tx_duration
  • pg_stat_archiver_archived_count
  • pg_stat_archiver_failed_count
  • pg_stat_bgwriter_buffers_alloc
  • pg_stat_bgwriter_buffers_backend
  • pg_stat_bgwriter_buffers_checkpoint
  • pg_stat_bgwriter_buffers_clean
  • pg_stat_bgwriter_checkpoint_sync_time
  • pg_stat_bgwriter_checkpoint_write_time
  • pg_stat_bgwriter_checkpoints_req
  • pg_stat_bgwriter_checkpoints_timed
  • pg_stat_bgwriter_maxwritten_clean
  • pg_stat_database_blk_read_time
  • pg_stat_database_blks_hit
  • pg_stat_database_blks_read
  • pg_stat_database_conflicts_confl_deadlock
  • pg_stat_database_conflicts_confl_lock
  • pg_stat_database_deadlocks
  • pg_stat_database_numbackends
  • pg_stat_database_temp_bytes
  • pg_stat_database_tup_deleted
  • pg_stat_database_tup_fetched
  • pg_stat_database_tup_inserted
  • pg_stat_database_tup_returned
  • pg_stat_database_tup_updated
  • pg_stat_database_xact_commit
  • pg_stat_database_xact_rollback
  • pg_stat_user_tables_idx_scan
  • pg_stat_user_tables_n_tup_hot_upd
  • pg_stat_user_tables_seq_scan
  • pg_up

2.41 - RabbitMQ

RabbitMQ

RabbitMQ

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[RabbitMQ] Cluster Operator Unavailable ReplicasThere are kube_pod_names that are either running but not yet available or kube_pod_names that still have not been created.Prometheus
[RabbitMQ] Insufficient Established Erlang Distribution LinksInsuffient establised erland distribution linksPrometheus
[RabbitMQ] Low Disk Watermark PredictedThe predicted free disk space in 24 hours from now is lowPrometheus
[RabbitMQ] High Connection ChurnThere are a high connection churnPrometheus
[RabbitMQ] No MajorityOfNodesReadyThere are so many nodes not readyPrometheus
[RabbitMQ] Persistent Volume MissingThere is at least one pvc not boundPrometheus
[RabbitMQ] Unroutable MessagesThere were unroutable message within the last 5 minutes in RabbitMQ clusterPrometheus
[RabbitMQ] File Descriptors Near LimitThe file descriptors are near to the limitPrometheus
[RabbitMQ] Container RestartsOver the last 10 minutes a rabbitmq container was restartedPrometheus
[RabbitMQ] TCP Sockets Near LimitThe TCP sockets are near to the limitPrometheus

List of Dashboards:

  • Rabbitmq Usage Rabbitmq Usage
  • Rabbitmq Overview Rabbitmq Overview

List of Metrics:

  • erlang_vm_dist_node_state
  • kube_deployment_status_replicas_unavailable
  • kube_kube_pod_name_container_status_restarts_total
  • kube_persistentvolumeclaim_status_phase
  • kube_statefulset_replicas
  • kube_statefulset_status_replicas_ready
  • rabbitmq_build_info
  • rabbitmq_channel_consumers
  • rabbitmq_channel_get_ack_total
  • rabbitmq_channel_get_empty_total
  • rabbitmq_channel_get_total
  • rabbitmq_channel_messages_acked_total
  • rabbitmq_channel_messages_confirmed_total
  • rabbitmq_channel_messages_delivered_ack_total
  • rabbitmq_channel_messages_delivered_total
  • rabbitmq_channel_messages_published_total
  • rabbitmq_channel_messages_redelivered_total
  • rabbitmq_channel_messages_unconfirmed
  • rabbitmq_channel_messages_unroutable_dropped_total
  • rabbitmq_channel_messages_unroutable_returned_total
  • rabbitmq_channels
  • rabbitmq_channels_closed_total
  • rabbitmq_channels_opened_total
  • rabbitmq_connections
  • rabbitmq_connections_closed_total
  • rabbitmq_connections_opened_total
  • rabbitmq_disk_space_available_bytes
  • rabbitmq_disk_space_available_limit_bytes
  • rabbitmq_process_max_fds
  • rabbitmq_process_max_tcp_sockets
  • rabbitmq_process_open_fds
  • rabbitmq_process_open_tcp_sockets
  • rabbitmq_process_resident_memory_bytes
  • rabbitmq_queue_messages_published_total
  • rabbitmq_queue_messages_ready
  • rabbitmq_queue_messages_unacked
  • rabbitmq_queues
  • rabbitmq_queues_created_total
  • rabbitmq_queues_declared_total
  • rabbitmq_queues_deleted_total
  • rabbitmq_resident_memory_limit_bytes

2.42 - Redis

Redis

Redis

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Redis] Low UpTimeUptime of less than 1 hour in a redis instancePrometheus
[Redis] High Memory UsageHigh memory usagePrometheus
[Redis] High Clients UsageHigh client connections usagePrometheus
[Redis] High Response TimeResponse time over 250msPrometheus
[Redis] High Fragmentation RatioHigh fragmentation ratioPrometheus
[Redis] High Keys Eviction RatioHigh keys eviction ratioPrometheus
[Redis] Recurrent Rejected ConnectionsRecurrent rejected connectionsPrometheus
[Redis] Low Hit RatioLow keyspace hit ratioPrometheus

List of Dashboards:

  • Redis Redis

List of Metrics:

  • redis_blocked_clients
  • redis_commands_duration_seconds_total
  • redis_commands_processed_total
  • redis_commands_total
  • redis_config_maxclients
  • redis_connected_clients
  • redis_connected_slaves
  • redis_connections_received_total
  • redis_cpu_sys_children_seconds_total
  • redis_cpu_sys_seconds_total
  • redis_cpu_user_children_seconds_total
  • redis_cpu_user_seconds_total
  • redis_db_avg_ttl_seconds
  • redis_db_keys
  • redis_evicted_keys_total
  • redis_expired_keys_total
  • redis_keyspace_hits_total
  • redis_keyspace_misses_total
  • redis_mem_fragmentation_ratio
  • redis_memory_max_bytes
  • redis_memory_used_bytes
  • redis_memory_used_dataset_bytes
  • redis_memory_used_lua_bytes
  • redis_memory_used_overhead_bytes
  • redis_memory_used_scripts_bytes
  • redis_net_input_bytes_total
  • redis_net_output_bytes_total
  • redis_pubsub_channels
  • redis_pubsub_patterns
  • redis_rdb_changes_since_last_save
  • redis_rdb_last_save_timestamp_seconds
  • redis_rejected_connections_total
  • redis_slowlog_length
  • redis_uptime_in_seconds

2.43 - Sysdig Admission Controller

Sysdig Admission Controller

Sysdig Admission Controller

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Sysdig Admission Controller] No K8s Audit Events ReceivedThe Admission Controller is not receiving Kubernetes Audit eventsPrometheus
[Sysdig Admission Controller] K8s Audit Events ThrottlingKubernetes Audit events is being throttledPrometheus
[Sysdig Admission Controller] Scanning Events ThrottlingScanning events is being throttledPrometheus
[Sysdig Admission Controller] Inline Scanning ThrottlingThe inline scanning queue is not empty for a long timePrometheus
[Sysdig Admission Controller] High Error Rate In Scan Status From BackendHigh Error Rate In Scan Status From BackendPrometheus
[Sysdig Admission Controller] High Error Rate In Scan Report From BackendHigh Error Rate In Scan Status From BackendPrometheus
[Sysdig Admission Controller] High Error Rate In Image ScanHigh Error Rate In Image ScanPrometheus

List of Dashboards:

  • Sysdig Admission Controller Sysdig Admission Controller

List of Metrics:

  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • k8s_audit_ac_alerts_total
  • k8s_audit_ac_events_processed_total
  • k8s_audit_ac_events_received_total
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds
  • queue_length
  • scan_report_cache_hits
  • scan_report_cache_misses
  • scan_status_cache_hits
  • scan_status_cache_misses
  • scanner_scan_errors
  • scanner_scan_report_error_from_backend_count
  • scanner_scan_report_retrieved_from_backend_count
  • scanner_scan_requests_already_queued
  • scanner_scan_requests_error
  • scanner_scan_requests_queued
  • scanner_scan_status_error_from_backend_count
  • scanner_scan_status_retrieved_from_backend_count
  • scanner_scan_success
  • scanning_ac_admission_responses_total
  • scanning_ac_containers_processed_total
  • scanning_ac_http_scanning_handler_requests_total

3 - Custom Integrations

  • Prometheus Metrics

    Describes how Sysdig agent enables automatically collecting metrics from services that expose native Prometheus metrics as well as from applications with Prometheus exporters, how to set up your environment, and scrape Prometheus metrics seamlessly.

  • Java Management Extention (JMX) Metrics

    Describes how to configure your Java virtual machines so Sysdig Agent can collect JMX metrics using the JMX protocol.

  • StatsD Metrics

    Describes how the Sysdig agent collects custom StatsD metrics with an embedded StatsD server.

  • Node.JS Metrics

    Illustrates how Sysdig is able to monitor node.js applications by linking a library to the node.js codebase.

3.1 - Collect Prometheus Metrics

Sysdig supports collecting, storing, and querying Prometheus native metrics and labels. You can use Sysdig in the same way that you use Prometheus and leverage Prometheus Query Language (PromQL) to create dashboards and alerts. Sysdig is compatible with Prometheus HTTP API to query your monitoring data programmatically using PromQL and extend Sysdig to other platforms like Grafana.

From a metric collection standpoint, a lightweight Prometheus server is directly embedded into the Sysdig agent to facilitate metric collection. This also supports targets, instances, and jobs with filtering and relabeling using Prometheus syntax. You can configure the agent to identify these processes that expose Prometheus metric endpoints on its own host and send it to the Sysdig collector for storing and further processing.

The Prometheus product itself does not necessarily have to be installed for Prometheus metrics collection.

Agent Compatibility

See the Sysdig agent versions and compatibility with Prometheus features:

Sysdig Agent v12.2.0 and Above

The following features are enabled by default:

  • Automatically scraping any Kubernetes pods with the following annotation set: prometheus.io/scrape=true
  • Automatically scrape applications supported by Monitoring Integrations.

For more information, see Set up the Environment.

Sysdig Agent Prior to v12.0.0

Manually enable Prometheus in dragent.yaml file:

  prometheus:
       enabled: true

For more information, see Enable Promscrape V2 on Older Versions of Sysdig Agent .

Learn More

The following topics describe in detail about setting up the environment for service discovery, metrics collection, and further processing.

See the following blog posts for additional context on the Prometheus metric and how such metrics are typically used.

3.1.1 - Set Up the Environment

If you are already leveraging Kubernetes Service Discovery, specifically the approach given in prometheus-kubernetes.yml, you might already have annotations attached to the pods that mark them as eligible for scraping. Such environments can quickly begin scraping the same metrics by using the Sysdig agent in a single step.

If you are not using Kubernetes Service Discovery, follow the instructions given below:

Annotation

Ensure that the Kubernetes pods that contain your Prometheus exporters have been deployed with the following annotations to enable scraping, substituting the listening exporter-TCP-port:

spec:
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "exporter-TCP-port"

The configuration above assumes your exporters use the typical endpoint called /metrics. If your exporter is using a different endpoint, specify by adding the following additional annotation, substituting the exporter-endpoint-name:

prometheus.io/path: "/exporter-endpoint-name"

Sample Exporter

Use the Sample Exporter to test your environment. You will quickly see auto-discovered Prometheus metrics being displayed on Sysdig Monitor. You can use this working example as a basis to similarly annotate your own exporters.

3.1.2 - Enable Prometheus Native Service Discovery

Prometheus service discovery is a standard method of finding endpoints to scrape for metrics. You configure prometheus.yaml and custom jobs to prepare for scraping endpoints in the same way you do for native Prometheus.

For metric collection, a lightweight Prometheus server, named promscrape, is directly embedded into the Sysdig agent to facilitate metric collection. Promscrape supports filtering and relabeling targets, instances, and jobs and identify them using the custom jobs configured in the prometheus.yaml file. The latest versions of Sysdig agent (above v12.0.0) by default identify the processes that expose Prometheus metric endpoints on its own host and send it to the Sysdig collector for storing and further processing. On older versions of Sysdig agent, you enable these features by configuring dragent.yaml.

Working with Promscrape

Promscrape is a lightweight Prometheus server that is embedded with the Sysdig agent. Promscrape scrapes metrics from Prometheus endpoints and sends them for storing and processing.

Promscrape has two versions: Promscrape V1 and Promscrape V2.

  • Promscrape V2

    Promscrape itself discovers targets by using the standard Prometheus configuration (native Prometheus service discovery), allowing the use of relabel_configs to find or modify targets. An instance of promscrape runs on every node that is running a Sysdig agent and is intended to collect metrics from local as well as remote targets specified in the prometheus.yaml file. The prometheus.yaml file you create is shared across all such nodes.

    Promscrape V2 is enabled by default on Sysdig agent v12.5.0 and above. On older versions of Sysdig agent, you need to manually enable Promscrape V2, which allows for native Prometheus service discovery, by setting the prom_service_discovery parameter to true in dragent.yaml.

  • Promscrape V1

    Sysdig agent discovers scrape targets through the Sysdig process_filter rules. For more information, see Process Filter.

About Promscrape V2

Supported Features

Promscrape V2 supports the following native Prometheus capabilities:

  • Relabeling: Promscrape V2 supports Prometheus native relabel_config and metric_relabel_configs. Relabel configuration enables the following:

    • Drop unnecessary metrics or unwanted labels from metrics

    • Edit the label format of the target before scraping the labels

  • Sample format: In addition to the regular sample format (metrics name, labels, and metrics reading), Promscrape V2 includes metrics type (counter, gauge, histogram, summary) to every sample sent to the agent.

  • Scraping configuration: Promscrape V2 supports all types of scraping configuration, such as federation, blackbox-exporter, and so on.

  • Label mapping: The metrics can be mapped to their source (pod, process) by using the source labels which in turn map certain Prometheus label names to the known agent tags.

Unsupported Features

  • Promscrape V2 does not support calculated metrics.

  • Promscrape V2 does not support cluster-wide features such as recording rules and alert management.

  • Service discovery configurations in Promscrape V1 (process_filter) and Promscrape V2 (prometheus.yaml) are incompatible and non-translatable.

  • Promscrape V2 collects metrics from both local and remote targets specified in the prometheus.yaml file and therefore it does not make sense to configure promscrape to scrape remote targets, because you will see metrics duplication in this case.

  • Promscrape V2 does not have the cluster view and therefore it ignores the configuration of recording rules and alerts, which is used in the cluster-wide metrics collection. Therefore, the following Prometheus Configurations are not supported

  • Sysdig uses __HOSTNAME__, which is not a standard Prometheus keyword.

Enable Promscrape V2 on Older Versions of Sysdig Agent

To enable Prometheus native service discovery on agent versions prior to 11.2:

  1. Open dragent.yaml file.

  2. Set the following Prometheus Service Discovery parameter to true:

    prometheus:
      prom_service_discovery: true
    

    If true, promscrape.v2 is used. Otherwise, promscrape.v1 is used to scrape the targets.

  3. Restart the agent.

Create Custom Jobs

Prerequisites

Ensure the following features are enabled:

  • Monitoring Integration
  • Promscrape V2

If you are using Sysdig agent v12.0.0 or above, these features are enabled by default.

Prepare Custom Job

You set up custom jobs in the Prometheus configuration file to identify endpoints that expose Prometheus metrics. Sysdig agent uses these custom jobs to scrape endpoints by using promscrape, the lightweight Prometheus server embedded in it.

Guidelines

  • Ensure that targets are scraped only by the agent running on the same node as the target. You do this by adding the host selection relabeling rules.

  • Use the the sysdig specific relabeling rules to automatically get the right workload labels applied.

Example Prometheus Configuration file

The prometheus.yaml file comes with a default configuration for scraping the pods running on the local node. This configuration also includes the rules to preserve pod UID and container name labels for further correlation with Kubernetes State Metrics or Sysdig native metrics.

Here is an example prometheus.yaml file that you can use to set up custom jobs.

global:
  scrape_interval: 10s
scrape_configs:
- job_name: 'my_pod_job'
  sample_limit: 40000
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
    # Look for pod name starting with "my_pod_prefix" in namespace "my_namespace"
  - action:
    source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_pod_name]
    separator: /
    regex: my_namespace/my_pod_prefix.+

    # In those pods try to scrape from port 9876
  - source_labels: [__address__]
    action: replace
    target_label: __address__
    regex: (.+?)(\\:\\d)?
    replacement: $1:9876

    # Trying to ensure we only scrape local targets
    # __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
    # of all the active network interfaces on the host
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__

    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

Default Scrape Job

If Monitoring Integration is not enabled for you and you still want to automatically collect metrics from pods with the Prometheus annotations set (prometheus.io/scrape=true), add the following default scrape job to your prometheus.yaml file:

- job_name: 'k8s-pods'
  sample_limit: 40000
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
    # Trying to ensure we only scrape local targets
    # __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
    # of all the active network interfaces on the host
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    target_label: __metrics_path__
    regex: (.+)
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__

    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

Default Prometheus Configuration File

Here is the default prometheus.yaml file.

global:
  scrape_interval: 10s
scrape_configs:
- job_name: 'k8s-pods'
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
    # Trying to ensure we only scrape local targets
    # __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
    # of all the active network interfaces on the host
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    target_label: __metrics_path__
    regex: (.+)
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

Understand the Prometheus Settings

Scrape Interval

The default scrape interval is 10 seconds. However, the value can be overridden per scraping job. The scrape interval configured in the prometheus.yaml is independent of the agent configuration.

Promscrape V2 reads prometheus.yaml and initiates scraping jobs.

The metrics from targets are collected per scrape interval for each target and immediately forwarded to the agent. The agent sends the metrics every 10 seconds to the Sysdig collector. Only those metrics that have been received since the last transmission are sent to the collector. If a scraping job for a job has a scrape interval longer than 10 seconds, the agent transmissions might not include all the metrics from that job.

Hostname Selection

__HOSTIPS__ is replaced by the host IP addresses. Selection by the host IP address is preferred because of its reliability.

__HOSTNAME__ is replaced with the actual hostname before promscrape starts scraping the targets. This allows promscrape to ignore targets running on other hosts.

Relabeling Configuration

The default Prometheus configuration file contains the following two relabeling configurations:

- action: replace
  source_labels: [__meta_kubernetes_pod_uid]
  target_label: sysdig_k8s_pod_uid
- action: replace
  source_labels: [__meta_kubernetes_pod_container_name]
  target_label: sysdig_k8s_pod_container_name

These rules add two labels, sysdig_k8s_pod_uid and sysdig_k8s_pod_container_name to every metric gathered from the local targets, containing pod ID and container name respectively. These labels will be dropped from the metrics before sending them to the Sysdig collector for further processing.

Configure Prometheus Configuration File Using the Agent Configmap

Here is an example for setting up the prometheus.yaml file using the agent configmap:

apiVersion: v1
data:
  dragent.yaml: |
    new_k8s: true
    k8s_cluster_name: your-cluster-name
    metrics_excess_log: true
    10s_flush_enable: true
    app_checks_enabled: false
    use_promscrape: true
    promscrape_fastproto: true
    prometheus:
      enabled: true
      prom_service_discovery: true
      log_errors: true
      max_metrics: 200000
      max_metrics_per_process: 200000
      max_tags_per_metric: 100
      ingest_raw: true
      ingest_calculated: false
    snaplen: 512
    tags: role:cluster    
  prometheus.yaml: |
    global:
      scrape_interval: 10s
    scrape_configs:
    - job_name: 'haproxy-router'
      basic_auth:
        username: USER
        password: PASSWORD
      tls_config:
        insecure_skip_verify: true
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
        # Trying to ensure we only scrape local targets
        # We need the wildcard at the end because in AWS the node name is the FQDN,
        # whereas in Azure the node name is the base host name
      - action: keep
        source_labels: [__meta_kubernetes_pod_host_ip]
        regex: __HOSTIPS__
      - action: keep
        source_labels:
        - __meta_kubernetes_namespace
        - __meta_kubernetes_pod_name
        separator: '/'
        regex: 'default/router-1-.+'
        # Holding on to pod-id and container name so we can associate the metrics
        # with the container (and cluster hierarchy)
      - action: replace
        source_labels: [__meta_kubernetes_pod_uid]
        target_label: sysdig_k8s_pod_uid
      - action: replace
        source_labels: [__meta_kubernetes_pod_container_name]
        target_label: sysdig_k8s_pod_container_name    

kind: ConfigMap
metadata:
    labels:
      app: sysdig-agent
    name: sysdig-agent
    namespace: sysdig-agent

3.1.3 - Migrating from Promscrape V1 to V2

Promscrape is the lightweight Prometheus server in the Sysdig agent. An updated version of promscrape, named Promscrape V2 is available. This configuration is controlled by the prom_discovery_service parameter in the dragent.yaml file. To use the latest features, such as Service Discovery and Monitoring Integrations, you need to have this option enabled in your environment.

Compare Promscrape V1 and V2

The main difference between V1 and V2 is how scrape targets are determined.

In v1 targets are found through process-filtering rules configured in dragent.yaml or dragent.default.yaml (if no rules are given in dragent.yaml).The process-filtering rules are applied to all the running processes on the host. Matches are made based on process attributes, such as process name or TCP ports being listened to, as well as associated contexts from docker or Kubernetes, such as container labels or Kubernetes annotations.

With Promscrape V2, scrape targets are determined by scrape_configs fields in a prometheus.yaml file (or the prometheus-v2.default.yaml file if no prometheus.yaml exists). Because promscrape is adapted from the open-source Prometheus server, the scrape_config settings are compatible with the normal Prometheus configuration. Here is an example:

global:
  scrape_interval: 10s
scrape_configs:
- job_name: 'my_pod_job'
  sample_limit: 40000
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
    # Look for pod name starting with "my_pod_prefix" in namespace "my_namespace"
  - action:
    source_labels: [__meta_kubernetes_namespace,__meta_kubernetes_pod_name,__meta_kubernetes_pod_label]
    separator: /
    regex: my_namespace/my_pod_prefix.+
  - action: keep
    source_labels: [__meta_kubernetes_pod_label_app]
    regex: my_app_metrics

    # In those pods try to scrape from port 9876
  - source_labels: [__address__]
    action: replace
    target_label: __address__
    regex: (.+?)(\\:\\d)?
    replacement: $1:9876

    # Trying to ensure we only scrape local targets
    # __HOSTIPS__ is replaced by promscrape with a regex list of the IP addresses
    # of all the active network interfaces on the host
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__

    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

Migrate Using Default Configuration

The default configuration for Promscrape v1 triggers the scraping based on standard Kubernetes pod annotations and container labels. The default configuration for v2 currently triggers scraping only based on the standard Kubernetes pod annotations leveraging the Prometheus native service discovery.

Example Pod Annotations

Annotation

Value

Description

spec: template: metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: ""

true

Required field.

prometheus.io/port: ""

The port number to scrape

Optional. It will scrape all pod-registered ports if omitted.

prometheus.io/scheme

<http|https>

The default is http.

(required field)prometheus.io/path

The URL

The default is /metrics.

Example Static Job

- job_name: 'static10'
  static_configs:
    - targets: ['localhost:5010']

Guidelines

  • Users running Kubernetes with Promscrape v1 default rules and triggering scraping based on pod annotations need not take any action to migrate to v2. The migration happens automatically.

  • Users operating non-Kubernetes environments might need to continue using v1 for now, depending on how scraping is triggered. As of today promscrape.v2 doesn’t support leveraging container and Docker labels to discover Prometheus metrics endpoints. If your environment is one of these, define static jobs with the IP:port to be scrapped.

Migrate Using Custom Rules

If you relying on custom process_filter rules to collect metrics, use any method using standard Prometheus configuration syntax to scrape the endpoints. We recommend one of the following:

  • Adopt the standard approach of adding the standard Prometheus annotations to their pods. For more information, see Migrate Using Default Configuration.
  • Write a Prometheus scrape_config by using Kubernetes pods service discovery and use the appropriate pod metadata to trigger their scrapes.

See the below example for converting your process_filter rules to Prometheus terminology.

process_filter

Prometheus

- include:
    kubernetes.pod.annotation.sysdig.com/test: true
- action: keep
  source_labels: [__meta_kubernetes_pod_annotation_sysdig_com_test]
  regex: true
- include:
    kubernetes.pod.label.app: sysdig
- action: keep
  source_labels: [__meta_kubernetes_pod_label_app]
  regex: 'sysdig'
-include:
   container.label.com.sysdig.test: true

Not supported.

- include:
    process.name: test

Not supported.

- include:
    process.cmdline: sysdig-agent

Not supported.

- include:
    port: 8080
- action: keep
  source_labels: [__meta_kubernetes_pod_container_port_number]
  regex: '8080'
- include:
    container.image: sysdig-agent

Not supported.

- include:
    container.name: sysdig-agent
- action: keep
  source_labels: [__meta_kubernetes_pod_container_name]
  regex: 'sysdig-agent'
- include:
    appcheck.match: sysdig

Appchecks are not compatble with Promscrape v2. See Configure Monitoring Integrations for supported integrations.

Contact Support

If you have any queries related to promscrape migration, contact Sysdig Support.

3.2 - Integrate JMX Metrics from Java Virtual Machines

The Sysdig agent retrieves data from your Java virtual machines using the JMX protocol. The agent is configured to automatically discover active Java virtual machines and poll them for basic JVM metrics like Heap Memory and Garbage collector as well as application-specific metrics. Now, the following applications are supported by default:

  • ActiveMQ
  • Cassandra
  • Elasticsearch
  • HBase
  • Kafka
  • Tomcat
  • Zookeeper

The agent can also be easily configured to extract custom JMX metrics coming from your own Java processes. Metrics extracted are shown in the pre-defined Application views or under the Metrics > JVM and JMX menus.

The module java.management must be loaded for the Sysdig agent to collect both JVM and JMX metrics.

The default JMX metrics configuration is found in the /opt/draios/etc/dragent.default.yaml file. When customizing existing entries, copy the complete application’s bean listing from that defaults yaml file into the user settings file /opt/draios/etc/dragent.yaml. The Sysdig agent will merge configurations of both files.

Java versions 7 - 10 are currently supported by the Sysdig agents.


For Java 11-14 you must be running minimum agent version 10.1.0 and must run the app with the JMX Remote option.

Here is what your dragent.yaml file might look like for a customized entry for the Spark application:

customerid: 07c948-your-key-here-006f3b
tags: local:nyc,service:db3
jmx:
  per_process_beans:
    spark:
      pattern: "spark"
      beans:
        - query: "metrics:name=Spark shell.BlockManager.disk.diskSpaceUsed_MB"
          attributes:
            - name: VALUE
              alias: spark.metric

Include the jmx: and per_process_beans: section headers at the beginning of your application/bean list. For more information on adding parameters to a container agent’s configuration file, see Understanding the Agent Config Files.

Bean Configuration

Basic JVM metrics are pre-defined inside the default_beans: section. This section is defined in the agent’s default settings file and contains beans and attributes that are going to be polled for every Java process, like memory and garbage collector usage:

jmx:
  default_beans:
    - query: "java.lang:type=Memory"
      attributes:
        - HeapMemoryUsage
        - NonHeapMemoryUsage
    - query: "java.lang:type=GarbageCollector,*"
      attributes:
        - name: "CollectionCount"
          type: "counter"
        - name: "CollectionTime"
          type: "counter"

Metrics specific for each application are specified in sections named after the applications. For example, this is the Tomcat section:

per_process_beans:
    tomcat:
      pattern: "catalina"
      beans:
        - query: "Catalina:type=Cache,*"
          attributes:
            - accessCount
            - cacheSize
            - hitsCount
            - . . .

The key name, tomcat in this case, will be displayed as a process name in the Sysdig Monitor user interface instead of just java. The pattern: parameter specifies a string that is used to match a java process name and arguments with this set of JMX metrics. If the process main class full name contains the given text, the process is tagged and the metrics specified in the section will be fetched.

The class names are matched against the process argument list. If you implement JMX metrics in a custom manner that does not expose the class names on the command line, you will need to find a pattern which conveniently matches your java invocation command line.

The beans: section contains the list of beans to be queried, based on JMX patterns. JMX patterns are explained in details in the Oracle documentation, but in practice, the format of the query line is pretty simple: you can specify the full name of the bean like java.lang:type=Memory , or you can fetch multiple beans in a single line using the wildcard * as in: java.lang:type=GarbageCollector,* (note that this is just a wildcard, not a regex).

To get the list of all the beans and attributes that your application exports, you can use JVisualVM, Jmxterm, JConsole or other similar tools. Here is a screenshot from JConsole showing where to find the namespace, bean and attribute (metric) information (JConsole is available when you install the Java Development Kit):

For each query, you have to specify the attributes that you want to retrieve, and for each of them a new metric will be created. We support the following JMX attributes types (For these attributes, all the subattributes will be retrieved):

Attributes may be absolute values or rates. For absolute values, we need to calculate a per second rate before sending them. In this case, you can specify type: counter , the default is rate which can be omitted, so usually you can simply write the attribute name.

Limits

The total number of JMX metrics polled per host is limited to 500. The maximum number of beans queried per process is limited to 300. If more metrics are needed please contact your sales representative with your use case.

In agents 0.46 and earlier, the limit was 100 beans for each process.

Aliases

JMX beans and attributes can have very long names. To avoid interface cluttering we added support for aliasing, you can specify an alias in the attribute configuration. For example:

  cassandra:
    pattern: "cassandra"
    beans:
      - query: "org.apache.cassandra.db:type=StorageProxy
        attributes:
          - name: RecentWriteLatencyMicros
            alias: cassandra.write.latency
          - name: RecentReadLatencyMicros
            alias: cassandra.read.latency

In this way the alias will be used in Sysdig Monitor instead of the raw bean name. Aliases can be dynamic as well, getting data from the bean name - useful where you use pattern bean queries. For example, here we are using the attribute name to create different metrics:

      - query: "java.lang:type=GarbageCollector,*"
        attributes:
          - name: CollectionCount
            type: counter
            alias: jvm.gc.{name}.count
          - name: CollectionTime
            type: counter
            alias: jvm.gc.{name}.time

This query will match multiple beans (All Garbage collectors) and the metric name will reflect the name of the Garbage Collector. For example: jvm.gc.ConcurrentMarkSweep.count . General syntax is: {<bean_property_key>} , to get all beans properties you can use a JMX explorer like JVisualVM or Jmxterm.

To use these metrics in promQL queries, you have to add the prefix jmx_ and replace the dots (.) from metrics name by underscores (_). For example, the metric name jvm.gc.ConcurrentMarkSweep.count will be jmx_jvm_gc_ConcurrentMarkSweep_count in promQL.

Troubleshooting: Why Can’t I See Java (JMX) Metrics?

The Sysdig agent normally auto-discovers Java processes running on your host and enables the JMX extensions for polling them.

JMX Remote

If your Java application is not discovered automatically by the agent, try adding the following parameter on your application’s command line:

 -Dcom.sun.management.jmxremote

For more information, see Oracle’s web page on monitoring using JMX technology.

Java Versions

Java versions 7 - 10 are currently supported by the Sysdig agents.

For Java 11-14 you must be running minimum agent version 10.1.0 and must run the app with the JMX Remote option.

Java-Based Applications and JMX Authentication

For Java-based applications (Cassandra, Elasticsearch, Kafka, Tomcat, Zookeeper and etc.), the Sysdig agent requires the Java runtime environment (JRE) to be installed to poll for metrics (beans).

The Sysdig agent does not support JMX authentication.

If the Docker-container-based Sysdig agent is installed, the JRE is installed alongside the agent binaries and no further dependencies exist. However, if you are installing the service-based agent (non-container) and you do not see the JVM/JMX metrics reporting, your host may not have the JRE installed or it may not be installed in the expected location: usr/bin/java

To confirm if the Sysdig agent is able to find the JRE, restart the agent with service dragent restart and check the agent’s /opt/draios/logs/draios.log file for the two Java detection and location log entries recorded during agent startup.

Example if Java is missing or not found:

2017-09-08 23:19:27.944, Information, java detected: false
2017-09-08 23:19:27.944, Information, java_binary:

Example if Java is found:

2017-09-08 23:19:27.944, Information, java detected: true
2017-09-08 23:19:27.944, Information, java_binary: /usr/bin/java

If Java is not installed, the resolution is to install the Java Runtime Environment. If your host has Java installed but not in the expected location ( /usr/bin/java ) you can install a symlink from /usr/bin/java to the actual binary OR set the java_home: variable in the Sysdig agent’s configuration file: /opt/draios/etc/dragent.yaml

java_home: /usr/my_java_location/

Disabling JMX Polling

If you do not need it or otherwise want to disable JMX metrics reporting, you can add the following two lines to the agent’s user settings configuration file /opt/draios/etc/dragent.yaml:

jmx:
  enabled: false

After editing the file, restart the native Linux agent via service dragent restart or restart the container agent to make the change take effect.

If using our containerized agent, instead of editing the dragent.yaml file, you can add this extra parameter in the docker run command when starting the agent:

-e ADDITIONAL_CONF="jmx:\n  enabled: false\n"

3.3 - Integrate StatsD Metrics

StatsD is an open-source project built by Etsy. Using a StatsD library specific to your application’s language, it allows for the easy generation and transmission of custom application metrics to a collection server.

The Sysdig agent contains an embedded StatsD server, so your custom metrics can now be sent to our collector and be relayed to the Sysdig Monitor backend for aggregation. Your application metrics and the rich set of metrics collected by our agent already can all be visualized in the same simple and intuitive graphical interface. Configuring alert notifications is also exactly the same.

Installation and Configuration

The Statsd server, embedded in Sysdig agent beginning with version 0.1.136, is pre-configured and starts by default so no additional user configuration is necessary. Install the agent in a supported distribution directly or install the Docker containerized version in your container server and you’re done.

Sending StatsD Metrics

Active Collection

By default, the Sysdig agent’s embedded StatsD collector listens on the standard StatsD port, 8125, both on TCP and UDP. StatsD is a text based protocol, where samples are separated by a \n .

Sending metrics from your application to the collector is as simple as:

echo "hello_statsd:1|c" > /dev/udp/127.0.0.1/8125

The example transmits the counter metric "hello_statsd" with a value of ‘1’ to the Statsd collector listening on UDP port 8125. Here is a second example sending the output of a more complex shell command giving the number of established network connections:

echo "EstablishedConnections:`netstat -a | grep ESTAB | wc -l`|c" > /dev/udp/127.0.0.1/8125

The protocol format is as follows:

METRIC_NAME:METRIC_VALUE|TYPE[|@SAMPLING_RATIO]

Metric names can be any string except reserved characters: |#:@ . Value is a number and depends on the metric type. Type can be any of: c, ms, g, s . Sampling ratio is a value between 0 (exclusive) and 1 and it’s used to handle subsampling. When sent, metrics will be available in the same display menu for the subviews as the built in metrics.

Passive Collection

In infrastructures already containing a third party StatsD collection server, StatsD metrics can be collected “out of band”. A passive collection technique is automatically performed by our agent by intercepting system calls - as is done for all the Sysdig Monitor metrics normally collected. This method does not require changing your current StatsD configuration and is an excellent way to ’test drive’ the Sysdig Monitor application without having to perform any modifications other than agent installation.

The passive mode of collection is especially suitable for containerized environments where simplicity and efficiency are essential. With the containerized version of the Sysdig Monitor agent running on the host, all other container applications can continue to transmit to any currently implemented collector. In the case where no collector exists, container applications can simply be configured to send StatsD metrics to the localhost interface (127.0.0.1) as demonstrated above - no actual StatsD server needs to be listening at that address.

Effectively, each network transmission made from inside the application container, including statsd messages sent to a non existent destination, generates a system call. The Sysdig agent captures these system calls from its own container, where the statsd collector is listening. In practice, the Sysdig agent acts as a transparent proxy between the application and the StatsD collector, even if they are in different containers. The agent correlates which container a system call is coming from, and uses that information to transparently label the StatsD messages.

The above graphic demonstrates the components of the Sysdig agent and where metrics are actively or passively collected. Regardless of the method of collection, the number of StatsD metrics the agent can transmit is limited by your payment plan.

Note 1: When using the passive technique, ICMP port unreachable events may be generated on the host network.

Note 2: Some clients may use IPv6 addressing (::1) for the “localhost” address string. Metrics collection over IPv6 is not supported at this time. If your StatsD metrics are not visible in the Sysdig Monitor interface, please use “127.0.0.1” instead of “localhost” string to force IPv4. Another solution that may be required is adding the JVM option: java.net.preferIPv4Stack=true.

Note 3: When StatsD metrics are not continuously transmitted by your application (once per second as in the case of all agent created metrics), the charts will render a ‘zero’ or null value. Any alert conditions will only look at those Statsd values actually transmitted and ignore the nulls.

Supported Metric Types

Counter

A counter metric is updated with the value sent by the application, sent to the Sysdig Monitor backend, and then reset to zero. You can use it to count, for example, how many calls have been made to an API:

api.login:1|c

You can specify negative values to decrement a counter.

Gauge

A gauge is a single value that will be sent as is:

table_size:10000|g

These are plotted as received, in the sense, they are at a point in time metrics. You can achieve relative increments or decrements on a counter by prepending the value with a + or a - respectively. As an example, these three samples will cause table_size to be 950:

table_size:1000|g
table_size:-100|g
table_size:+50|g

In Sysdig Monitor, the gauge value is only rendered on the various charts when it is actually transmitted by your application. When not transmitted, a null is plotted on the charts which is not used in any calculations or alerts.

Set

A set is like a counter, but it counts unique elements. For example:

active_users:user1|s active_users:user2|sactive_users:user1|s

Will cause the value of active_users to be 2.

Timer

Timer StatsD metrics types are not supported by default, but you can push them into sysdig by setting up prometheus/statsd_exporter as given in statsd exporter.

Metric Labels

Labels are an extension of the StatsD specification offered by Sysdig Monitor to offer better flexibility in the way metrics are grouped, filtered and visualized. Labeling can be achieved by using the following syntax:

enqueued_messages#az=eu-west-3,country=italy:10|c

In general, this is the syntax you can use for labeling:

METRIC_NAME#LABEL_NAME=LABEL_VALUE,LABEL_NAME ...

Labels can be simple strings or key/value pairs, separated by an = sign. Simple labels can be used for filtering in the Sysdig Monitor web interface. Key/value labels can be used for both filtering and segmentation.

Label names prefixed with ‘agent.label’ are reserved for Sysdig agent use only and any custom labels starting with that prefix will be ignored.

Limits

The number of StatsD metrics the agent can transmit is limited to 1000 for the host and 1000 for all running containers combined. If more metrics are needed please contact your sales representative with your use case.

Collect StatsD Metrics Under Load

The Sysdig agent can reliably receive StatsD metrics from containers, even while the agent is under load. This setting is controlled by the use_forwarder configuration parameter.

The Sysdig agent automatically parses and records StatsD metrics. Historically, the agent parsed the system call stream from the kernel in order to read and record StatsD metrics from containers. For performance reasons, the agent may not be able to collect all StatsD metrics using this mechanism if the load is high. For example, if the StatsD client writes more than 2kB worth of StatsD metrics in a single system call, the agent will truncate the StatsD message, resulting in loss of StatsD metrics.

With the introduction of the togglable use_forwarder option, the agent can collect StastsD metrics even under high load.

This feature is introduced in Sysdig agent v0.90.1. As of agent v10.4.0, the configuration is enabled by default.

statsd:
  use_forwarder: true

To disable, set it to false:

statsd:
  use_forwarder: false

When enabled, rather than use the system call stream for container StatsD messages, the agent listens for UDP datagrams on the configured StatsD port on the localhost within the container’s network namespace. This enables the agent to reliably receive StatsD metrics from containers, even while the agent is under load.

This option introduces a behavior change in the agent, both in the destination address and in port settings.

  • When the option is disabled, the agent reads StatsD metrics that are destined to any remote address.

    With the option is enabled, the agent receives only those metrics that are addressed to the localhost.

  • When the option is disabled, the agent reads the container StatsD messages destined to only port 8125.

    When the option is enabled, the agent uses the configured StatsD port.

StatsD Server Running in a Monitored Container

Using the forwarder is not a valid use case when a StatsD server is running in the container that you are monitoring.

A StatsD server running in a container will already have a process bound to port 8125 or a configured StatsD port, so you can’t use that port to collect the metrics with the forwarder. A 10-second startup delay exists in the detection logic to allow any custom StatsD process to bind to that particular port before the forwarder. This ensures that the forwarder does not interrupt the operation.

Therefore, for this particular use case, you will need to use the traditional method. Disable the forwarder and capture the metrics via the system call stream.

Compatible Clients

Every StatsD compliant client works with our implementation. Here is a quick list, it’s provided just as reference. We don’t support them, we support only the protocol specification compliance.

A full list can be found at the StatsD GitHub page.

Turning Off StatsD Reporting

To disable Sysdig agent’s embedded StatsD server, append the following lines to the /opt/draios/etc/dragent.yaml configuration file in each installed host:

statsd:
  enabled: false

Note that if Sysdig Secure is used, a compliance check is enabled by default and it sends metrics via StatsD. When disabling StatsD, you need to disable the compliance check as well.

security:
  default_compliance_schedule: ""

After modifying the configuration file, you will need to restart the agent with:

service dragent restart

Changing the StatsD Listener Port and Transport Protocol

To modify the port that the agent’s embedded StatsD server listens on, append the following lines to the /opt/draios/etc/dragent.yaml configuration file in each installed host (replace #### with your port):

statsd:
  tcp_port: ####
  udp_port: ####

Characters Allowed For StatsD Metric Names

Use standard ASCII characters, we suggest also to use . namespaces as we do for all our metrics.

Allowed characters: a-z A-Z 0-9 _ .

For more information on adding parameters to a container agent’s configuration file, see /en/docs/installation/sysdig-agent/agent-configuration/understand-the-agent-configuration/.

3.4 - Integrate Node.js Application Metrics

Sysdig is able to monitor node.js applications by linking a library to the node.js code, which then creates a server in the code to export the StatsD metrics.

The example below shows a node.js application that exports metrics using the Prometheus protocol:

{
          "name": "node-example",
          "version": "1.0.0",
          "description": "Node example exporting metrics via Prometheus",
          "main": "index.js",
          "scripts": {
            "test": "echo \"Error: no test specified\" && exit 1"
          },
          "license": "BSD-2-Clause",
          "dependencies": {
            "express": "^4.14.0",
            "gc-stats": "^1.0.0",
            "prom-client": "^6.3.0",
            "prometheus-gc-stats": "^0.3.1"
          }
}

The index.js library function is shown below:

        // Use express as HTTP middleware
        // Feel free to use your own
        var express = require('express')
                var app = express()

        // Initialize Prometheus exporter
                const prom = require('prom-client')
                const prom_gc = require('prometheus-gc-stats')
                prom_gc()

        // Sample HTTP route
                app.get('/', function (req, res) {
                res.send('Hello World!')
                })

        // Export Prometheus metrics from /metrics endpoint
                app.get('/metrics', function(req, res) {
                res.end(prom.register.metrics());
                });

                app.listen(3000, function () {
                console.log('Example app listening on port 3000!')
                })

To integrate an application:

  1. Add an appcheck in the dockerfile:

    FROM node:latest
    WORKDIR /app
    ADD package.json ./
    RUN npm install
    ENV SYSDIG_AGENT_CONF 'app_checks: [{name: node, check_module: prometheus, pattern: {comm: node}, conf: { url: "http://localhost:{port}/metrics" }}]'
    ADD index.js ./
    ENTRYPOINT [ "node", "index.js" ]
    
  2. Run the application:

    user@host:~$ docker build -t node-example
    user@host:~$ docker run -d node-example
    

Once the Sysdig agent is deployed, node.js metrics will be automatically retrieved. The image below shows an example of key node.js metrics visible on the Sysdig Monitor UI:

For code and configuration examples, refer to the Github repository.

4 - Cloud Integrations

Cloud integrations for Sysdig Monitor extend its monitoring capabilities to AWS CloudWatch Metric Streams. This capability is in addition to the existing support to collect AWS CloudWatch metrics by using CloudWatch APIs.

Amazon CloudWatch is a monitoring service that collects monitoring and operational data in the form of logs, metrics, and events. AWS Metric Streams allows AWS users to export metrics from key AWS services faster by eliminating the need for custom integrations for each AWS resources. The AWS users can send these streams of metric data to different endpoints, such as Sysdig, through an Amazon Kinesis Data Firehose HTTPS endpoint.

Sysdig Monitor collects CloudWatch metrics from various AWS services and custom namespaces to provide comprehensive visibility of your AWS resources, applications, and services running on AWS. Sysdig provides a CloudFormation template to ease configuring an AWS account for metric streaming to Sysdig.

Comparing AWS Metric Streams and CloudWatch APIs

Metric StreamsCloudWatch APIs
Monitors all the AWS services and custom namespaces, and collects all the CloudWatch metrics. The only exception is the metrics that are made available to CloudWatch with more than two hours delay which cannot be sent through CloudWatch Metric Stream.Monitors only a limited number of AWS services (ELB, ALB, RedshiftCluster, EBS, DynamoDB, EC2, ElastiCache, EMR, RDS, SQS) and collects only a limited number of metrics.
Metrics are streamed to Sysdig with 2-3 minutes latency. The low-latency metrics are streamed automatically when new AWS resources are created.CloudWatch APIs provide updates of the CloudWatch metrics every five minutes, which introduces significant latency for the alerting and dashboard creation.
No limits to the monitored AWS services and number of metrics collected.The number of monitored AWS services is limited based on the license.

4.1 - Understand Cloud Metrics UI

Cloud Accounts UI provides an at-a-glance summary of AWS accounts connected to your Sysdig Monitor environment and lists the type of accounts and the status of metrics ingestion, shows namespaces, and launches dashboards corresponding to the AWS services for health and performance evaluation. You can also add or remove AWS accounts from the Cloud Accounts page.

Access Cloud Metrics UI

  1. Log in to Sysdig Monitor as an administrator.

  2. In the management section of the left-hand sidebar, select Integration > Cloud Metrics.


    The Cloud Metrics page is displayed.

View Account Details

You can view the following details:

  • Platform: The list of supported Cloud Account. Currently only AWS is supported.
  • Account ID: Your AWS account ID.
  • Type: Type indicates the method you have used to configure the AWS account connection. The supported types are Role Delegation and Access Key.
  • CloudWatch Integration Type: This is the integration method you have chosen to connect to an AWS account. The supported integrations are CloudWatch API and CloudWatch Metric Streams.
  • Status: The statuses associated to metric stream creation as well as accounts. For more information, see Connection Status.

For a given account, you can view enabled services and namespaces, as well as a list of dashboards that you can launch to view health and operational details.

  1. On the Cloud Metrics page, click the desired AWS account. The slider appear on screen listing the namespaces and associated dashboards.

  2. Click a desired dashboard to open the dashboard page to view the performance and health of the service.

Disable CloudWatch Metric Streams

To stop ingesting AWS CloudWatch Metric Streams into Sysdig, you have to stop the stream on the AWS Console. If you do not disable metric streams from pushing metrics into Sysdig you will continue to ingest and store the metrics within Sysdig.

CloudFormation

If metric streaming was set up using the Sysdig’s or your own CloudFormation template, delete the stack that you have created during the setup.

AWS Console

Delete the following:

  1. CloudWatch Metric Streams connected to Sysdig.
  2. The Kinesis Data Firehose delivery stream that forwards metrics to Sysdig.
  3. The backup S3 bucket linked to the Firehose.
  4. The IAM roles associated with the stream and all the resources that were created while setting up the stream.

For information on disabling AWS CloudWatch Streams, see Using Metric Streams.

Connection Status

The statuses associated to metric stream creation in a region are:

  • Configuring: The CloudFormation stack is being created at the moment.
  • Configured: The account credentials are correct but no data has been loaded. This status is applicable only to the API integration type.
  • Reporting metrics: Stack is created, metric stream is in running state, and no files are found with data in S3 backup storage. It appears either everything is working as expected, or at least one resource is loaded for the API integration type.
  • Needs attention: Something went wrong. Either metric stream is stopped, or it cannot send data to endpoint, or somebody deleted metric stream from the stack.
  • Error: An error occurred while checking the metric stream status.

The statuses for account linkage are given below. They are related to the background jobs that are actually connect to the AWS and either grab metrics in case of API integrations or check stream status in case of stream integration.

  • Configured: The very first status set right after cloud integration is created and background jobs are executed.
  • Loading: Background refresh job was scheduled
  • Done: Background loading job successfully finished for the cloud integration
  • Error: An error occurred during background refresh job execution.

4.2 - Connect an AWS Account

Sysdig Platform can collect both general metadata and various types of CloudWatch metrics from your AWS environment.

Use one of the following methods to connect an AWS account to Sysdig:

  • By using CloudWatch Metric Streams. You can do this in the following ways:

    • By using the CloudFormation template that Sysdig provides. Sysdig recommends using CloudFormation because it automatically creates all the resources required and it allows for setting up metric streams in multiple AWS regions simultaneously. See Using the CloudFormation Template.
    • By using your own CloudFormation template. See Case 2: Using the AWS Console.
    • By manually entering an AWS access key and secret key, and manually managing/rotating them as needed. See Connecting Manually.
    • By configuring AWS Role Delegation. Role delegation is an alternative to the existing integration methods using the access keys. This method is considered secure as sharing developer access keys with third-parties is not recommended by Amazon. See Connecting Manually.
  • By using AWS CloudWatch APIs. You can do this in the following ways:

    • Using an AWS access key and secret key, and manually managing/rotating them as needed.
    • Using AWS Role Delegation.

After connecting an AWS account, data will become visible in the Sysdig Monitor UI after a 10-15 minute delay.

Access Cloud Accounts

  1. Log in to Sysdig Monitor as an administrator.

  2. In the management section of the left-hand sidebar, select Integration > Cloud Metrics.

    The Cloud Metrics page is displayed. Continue with connecting an AWS account.

Connect an AWS Account

  1. On the Cloud Metrics screen, click Add Account.
  2. Click Start Installation. The New AWS Account screen is displayed.
  3. Select one of the following integration methods.
    • CloudWatch Metric Streams
      • Use CloudFormation Template: Sysdig provides a CloudFormation template you can easily fill in. Select this option to open an AWS console for creating a stack. You will be given a pre-populated template to help you set up a stack to forward the data to Sysdig.
      • Configure Manually: Set up metric streams using the AWS console.
    • CloudWatch API
      • Role delegation
      • Access Key
  4. Complete the installation and click Confirm.

Connect with CloudWatch Metric Streams

You can connect an AWS account using CloudWatch Metric Streams either by using the CloudFormation template provided by Sysdig, or manually setting up all the metric streams resources by yourself.

Using the CloudFormation Template

On Sysdig Monitor UI
  1. On the Cloud Metrics page, click Add Account.
  2. Click Start Installation. The New AWS Account screen is displayed.
  3. Select CloudWatch Metric Streams.
  4. Click Use CloudFormation Template. You are redirected to log in to your AWS account. Continue with On AWS Console.
On AWS Console

Sysdig provides a CloudFormation Template to create stack corresponding to CloudWatch Metric Streams. The metric stream you create feeds data to Sysdig in each region specified in the template, and the role you specified run and monitor the metric stream.

  1. Log in to your AWS account.

  2. Specify the following in the CloudFormation QuickCreate page.

    • Stack Name: The default name is Sysdig-CloudwatchMetricStream. This is the unique name to identify the stack you create for the CloudWatch Metric Streams.
    • API Key: Your Sysdig API key.
    • SysdigSite: The Sysdig Monitor URL associated with your region is auto-populated. Edit if you want to change the Sysdig URL.
    • Regions: The regions where you want to enable metric streaming. Enter them in a comma-separated list.
    • MonitoringRoleName: The default role name is SysdigCloudwatchIntegrationMonitoringRole. You can specify a different role if you wish to.
  3. Click Acknowledge that AWS CloudFormation might create IAM resources with custom names.

  4. Click Create Stack. Expect a 10-15 minute delay to complete the creating the stack.

Connecting Manually

Case 1: Using the Sysdig CloudFormation Template

If you’ve already deployed the Sysdig CloudWatch Streams CloudFormation Template and are receiving metric streams, you can manually associate your AWS account for verifying the status of your CloudWatch Streams and namespace sources.

On Sysdig Monitor UI
  1. On the New AWS Account screen, select one of the methods:

    • Role Delegation: Specify the following:

      • Account ID: Your AWS account ID.
      • Role: The name you entered for MonitoringRoleName. This role will be used by Sysdig to monitor status of the stream. The Parameter tab on the Stack details page gives you the MonitoringRoleName. For more information on Role Delegation, see Role Delegation.
    • Access Key: Specify the following:

  2. Click Confirm.

Case 2: Using the AWS Console

You can choose to set up CloudWatch Metric Streams manually instead of using the Sysdig CloudFormation template. To do so, perform the following steps for each AWS region.

On AWS Console
  1. Log in to your AWS account.

  2. Create Kinesis Data Firehose Delivery Stream:

    1. Specify the following:

      • Source: Select Direct PUT or Other Sources.
      • Destination: Select HTTP endpoint.
    2. Specify the destination settings:

      • HTTP endpoint URL: Enter https://<your-sysdig-URL>/api/awsmetrics/v1/input.

      Based on your Sysdig URL associated with your region, replace <your-sysdig-URL> with one of the following:

      • app.sysdigcloud.com
      • us2.app.sysdig.com
      • eu1.app.sysdig.com

      For more information on regions, see SaaS Regions and IP Ranges.

      • Access key: Enter your Sysdig Monitor API Token. For more information, see Retrieve the Sysdig API Token.

      • Content encoding: Select Disabled.

      • Retry duration: Enter 60 seconds.

      • HTTP Buffer hints: Specify the following:

        • Buffer size: Enter 5MB.
        • Buffer interval: Enter 60 seconds.
    3. Specify the backup settings:

      • Source record backup in Amazon S3: Select Failed data only and choose an appropriate S3 bucket for backup.
      • HTTP Buffer hints: Specify the following:
        • Buffer size: Enter 5MB.
        • Buffer interval: Enter 60 seconds.
      • S3 compression: Select GZIP.
    4. For advanced settings, select Enable error logging.

    5. Click Create delivery stream.

  3. Create a new CloudWatch Metric Stream:

    1. Specify the following:
      • Metrics to be streamed: Either select all CloudWatch metrics, or choose specific namespaces with Include or Exclude lists.
    2. Specify the Configuration:
      • Choose Select an existing Firehose owned by your account and specify the Kinesis Data Firehose delivery stream created earlier for sending the metrics to Sysdig.
      • Service access to write to Kinesis Data Firehose: Select Create and use a new service role.
      • Change the output format: Select OpenTelemetry 0.7.
    3. Specify a meaningful name for the new metric stream.
    4. Click Create metric stream.
On Sysdig Monitor UI

Log in to Sysdig Monitor, and add a new account by using Role Delegation or Access Key. Ensure that you configure the IAM Policy while setting up CloudWatch Metric Streams.

Using CloudWatch API

  1. On the Cloud Metrics page, click Add Account.

  2. Click Start Installation. The New AWS Account screen is displayed.

  3. On the New AWS Account screen, select one of the methods:

    • Role Delegation: Specify the following:

      • Account ID: Your AWS account ID.
      • Role: The name you entered for MonitoringRoleName. This role will be used by Sysdig to monitor status of the stream. The Parameter tab on the Stack details page gives you the MonitoringRoleName. For more information, see Role Delegation.
    • Access Key: Specify the following:

      • Access Key ID: Your AWS access key ID.
      • Secret Access Key: The secret access key associated with your account.

    For more information, see Integrate AWS Account and CloudWatch Metrics.

  4. Click Confirm.

4.3 - Troubleshoot AWS Account Connections

The potential causes for failing to connect to an AWS account are:

  • AWS credentials are invalid or role was configured incorrectly.
  • Creating or updating stack is failed.
  • Metric stream is stopped.
  • Sysdig could not process the requests due to having a wrong token.

Sysdig displays an error with potential causes when you hover on the status.

To troubleshoot:

  • Ensure that your access key and secret access key are correct. The edit functionality is unavailable so you will have to delete the entry and create a new one.
  • Log in to the AWS console.
    • Check the status of the stack and troubleshoot it in on the AWS console.
    • Find the metric stream created as part of stack and start it.
  • Ensure that Sysdig services are up and running by visiting Sysdig Infrastructure Status.
  • Contact Sysdig Support to check the logs on Sysdig side.
  • See AWS Documentation.

4.4.1 - AWS MetricsStream ALB

AWS MetricsStream ALB

AWS MetricsStream ALB

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[AWS ALB] High 4XX Errors From Load BalancerHigh 4XX Errors From Load Balancer.Prometheus
[AWS ALB] High 5XX Errors From Load BalancerHigh 5XX Errors From Load Balancer.Prometheus
[AWS ALB] High 4XX Errors From Target GroupHigh 4XX Errors From Target Group.Prometheus
[AWS ALB] High 5XX Errors From Target GroupHigh 5XX Errors From Target Group.Prometheus
[AWS ALB] Unhealthy Host In TargetGroupUnhealthy Host In TargetGroup.Prometheus
[AWS ALB] TLS Negotiation ErrorsTLS Negotiation Errors.Prometheus
[AWS ALB] Rejected Connections In Load BalancerRejected Connections In Load Balancer.Prometheus
[AWS ALB] High Response Time In Target GroupHigh Response Time In Target Group.Prometheus

List of Dashboards:

  • AWS ALB AWS ALB

List of Metrics:

  • aws_application_elb_active_connection_count
  • aws_application_elb_client_tls_negotiation_error_count
  • aws_application_elb_consumed_lc_us
  • aws_application_elb_healthy_host_count
  • aws_application_elb_http_code_elb_3xx_count
  • aws_application_elb_http_code_elb_4xx_count
  • aws_application_elb_http_code_elb_5xx_count
  • aws_application_elb_http_code_target_2xx_count
  • aws_application_elb_http_code_target_3xx_count
  • aws_application_elb_http_code_target_4xx_count
  • aws_application_elb_http_code_target_5xx_count
  • aws_application_elb_processed_bytes
  • aws_application_elb_rejected_connection_count
  • aws_application_elb_request_count
  • aws_application_elb_rule_evaluations
  • aws_application_elb_target_response_time
  • aws_application_elb_un_healthy_host_count

4.4.2 - AWS MetricsStream EBS

AWS MetricsStream EBS

AWS MetricsStream EBS

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[AWS EBS] Low Volume PerformanceLow Volume Performance. Used with Provisioned IOPS SSD volumes only. This alert is not supported with Multi-Attach enabled volumes.Prometheus
[AWS EBS] Zero Volume Burst BalanceZero Volume Burst Balance. Used with General Purpose SSD (gp2), Throughput Optimized HDD (st1), and Cold HDD (sc1) attached volumes only.Prometheus

List of Dashboards:

  • AWS EBS AWS EBS

List of Metrics:

  • aws_ebs_burst_balance
  • aws_ebs_volume_idle_time
  • aws_ebs_volume_queue_length
  • aws_ebs_volume_read_bytes
  • aws_ebs_volume_read_ops
  • aws_ebs_volume_throughput_percentage
  • aws_ebs_volume_total_read_time
  • aws_ebs_volume_total_write_time
  • aws_ebs_volume_write_bytes
  • aws_ebs_volume_write_ops

4.4.3 - AWS MetricsStream ELB

AWS MetricsStream ELB

AWS MetricsStream ELB

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[AWS ELB] High 4XX Errors From Load BalancerHigh 4XX Errors From Load Balancer.Prometheus
[AWS ELB] High 5XX Errors From Load BalancerHigh 5XX Errors From Load Balancer.Prometheus
[AWS ELB] High 4XX Errors From BackendHigh 4XX Errors From Backend.Prometheus
[AWS ELB] High 5XX Errors From BackendHigh 5XX Errors From Backend.Prometheus
[AWS ELB] Unhealthy Host In Load BalancerUnhealthy Host In Load Balancer.Prometheus
[AWS ELB] Queue Spillover RejectionsQueue Spillover Rejections.Prometheus
[AWS ELB] High Latency In Load BalancerHigh Latency In Load Balancer.Prometheus

List of Dashboards:

  • AWS ELB AWS ELB

List of Metrics:

  • aws_elb_healthy_host_count
  • aws_elb_http_code_backend_2xx_count
  • aws_elb_http_code_backend_3xx_count
  • aws_elb_http_code_backend_4xx_count
  • aws_elb_http_code_backend_5xx_count
  • aws_elb_http_code_elb_4xx_count
  • aws_elb_http_code_elb_5xx_count
  • aws_elb_latency
  • aws_elb_request_count
  • aws_elb_spillover_count
  • aws_elb_surge_queue_length
  • aws_elb_un_healthy_host_count

4.4.4 - AWS MetricsStream Fargate

AWS MetricsStream Fargate

AWS MetricsStream Fargate

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[AWS Fargate] High Cpu Utilization RateHigh Cpu Utilization Rate.Prometheus
[AWS Fargate] High Memory Utilization RateHigh Memory Utilization Rate.Prometheus
[AWS Fargate] Recurring Pending TasksRecurring Pending Tasks.Prometheus

List of Dashboards:

  • AWS ECS/Fargate AWS ECS/Fargate

List of Metrics:

  • aws_ecs_container_insights_cpu_reserved
  • aws_ecs_container_insights_cpu_utilized
  • aws_ecs_container_insights_desired_task_count
  • aws_ecs_container_insights_memory_reserved
  • aws_ecs_container_insights_memory_utilized
  • aws_ecs_container_insights_pending_task_count
  • aws_ecs_container_insights_running_task_count
  • aws_ecs_container_insights_storage_read_bytes
  • aws_ecs_container_insights_storage_write_bytes

4.4.5 - AWS MetricsStream Lambda

AWS MetricsStream Lambda

AWS MetricsStream Lambda

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[AWS Lambda] High Function Error RateHigh Function Error Rate.Prometheus
[AWS Lambda] Throttling FunctionThrottling Function.Prometheus
[AWS Lambda] Destination Delivery FailuresDestination Delivery Failures.Prometheus
[AWS Lambda] Dead-Letter ErrorsDead-Letter Errors.Prometheus
[AWS Lambda] High Iterator AgeHigh Iterator Age. Only for ’event source mappings’ that read from streamsPrometheus

List of Dashboards:

  • AWS Lambda AWS Lambda

List of Metrics:

  • aws_lambda_concurrent_executions
  • aws_lambda_dead_letter_errors
  • aws_lambda_destination_delivery_failures
  • aws_lambda_duration
  • aws_lambda_errors
  • aws_lambda_invocations
  • aws_lambda_iterator_age
  • aws_lambda_provisioned_concurrency_executions
  • aws_lambda_provisioned_concurrency_invocations
  • aws_lambda_provisioned_concurrency_spillover_invocations
  • aws_lambda_provisioned_concurrency_utilization_average
  • aws_lambda_throttles
  • aws_lambda_unreserved_concurrent_executions

4.4.6 - AWS MetricsStream RDS

AWS MetricsStream RDS

AWS MetricsStream RDS

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[AWS RDS] Long CPU ThrottlingLong CPU Throttling.Prometheus
[AWS RDS] Low Free MemoryLow Free Memory.Prometheus
[AWS RDS] Low Free DiskLow Free Disk.Prometheus
[AWS RDS] Disk Full In 48H[AWS RDS] Disk Full In 48H.Prometheus
[AWS RDS] Disk Full In 12H[AWS RDS] Disk Full In 12H.Prometheus
[AWS RDS] High Read Latency[AWS RDS] High Read Latency.Prometheus
[AWS RDS] High Write Latency[AWS RDS] High Write Latency.Prometheus
[AWS RDS] High Disk Queue[AWS RDS] High Disk Queue. Alert only available for PostgreSQL instances.Prometheus

List of Dashboards:

  • AWS RDS AWS RDS

List of Metrics:

  • aws_rds_cpu_utilization
  • aws_rds_database_connections
  • aws_rds_disk_queue_depth
  • aws_rds_free_local_storage
  • aws_rds_freeable_memory
  • aws_rds_network_receive_throughput
  • aws_rds_network_transmit_throughput
  • aws_rds_read_iops
  • aws_rds_read_latency
  • aws_rds_read_throughput
  • aws_rds_swap_usage
  • aws_rds_write_iops
  • aws_rds_write_latency
  • aws_rds_write_throughput

4.4.7 - AWS MetricsStream S3

AWS MetricsStream S3

AWS MetricsStream S3

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[AWS S3] High 4XX Error RateHigh 4XX Error Rate.Prometheus
[AWS S3] High 5XX Error RateHigh 5XX Error Rate.Prometheus
[AWS S3] High First Byte LatencyHigh First Byte Latency.Prometheus

List of Dashboards:

  • AWS S3

List of Metrics:

  • aws_s3_4xx_errors
  • aws_s3_5xx_errors
  • aws_s3_bytes_downloaded
  • aws_s3_bytes_uploaded
  • aws_s3_first_byte_latency
  • aws_s3_total_request_latency

4.4.8 - AWS MetricsStream SQS

AWS MetricsStream SQS

AWS MetricsStream SQS

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[AWS SQS] High Number Of Messages In QueueHigh Number Of Messages In QueuePrometheus
[AWS SQS] High Latency In QueueHigh Latency In QueuePrometheus
[AWS SQS] Recurring Empty ReceivesRecurring Empty ReceivesPrometheus
[AWS SQS] Message Received In QueueMessage Received In Queue. Useful for example in ‘Dead-Letter’ queues.Prometheus

List of Dashboards:

  • AWS SQS AWS SQS

List of Metrics:

  • aws_sqs_approximate_age_of_oldest_message
  • aws_sqs_approximate_number_of_messages_delayed
  • aws_sqs_approximate_number_of_messages_not_visible
  • aws_sqs_approximate_number_of_messages_visible
  • aws_sqs_number_of_empty_receives
  • aws_sqs_number_of_messages_deleted
  • aws_sqs_number_of_messages_received
  • aws_sqs_number_of_messages_sent
  • aws_sqs_sent_message_size

5 - Advanced Configuration

5.1 - Configure PVC Metrics

You can use dashboards and alerts for PersistentVolumeClaim (PVC) metrics in the regions where PVC metrics are supported.

To see data on PVC dashboards and alerts, ensure that the prerequisites are met.

Prerequisites

Enable KSM Metrics

For the Sysdig agent configuration requirements, see Enable Kube State Metrics.

Apply Rules

If you are upgrading the Sysdig agent, either download sysdig-agent-clusterrole.yaml or apply the following rule to the ClusterRole associated with your Sysdig agent.

rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  - nodes/proxy
- nonResourceURLs:
  - /metrics
  verbs:
  - get

The rules are required to scrape the kubelet containers. With this rule enabled, you will also have the kubelet metrics and can access kubelet templates for both dashboards and alerts.

This configuration change is only required for agent upgrades because the sysdig-agent-clusterrole.yaml associated with fresh installations will already have this configuration. See Steps for Kubernetes (Vanilla) for information on Sysdig agent installation.

Sysdig Agent v12.5.0 and Above

  • Upgrade Sysdig agent to v12.2.0 or above

  • If you are an existing Sysdig user, include the following configuration in the dragent.yaml file:

      k8s_extra_resources:
        include:
          - services
          - resourcequotas
          - persistentvolumes
          - persistentvolumeclaims
          - horizontalpodautoscalers
    

Sysdig Agent v12.3.x and v12.4.x

PVC metrics are enabled by default for Sysdig agent v12.3.0 and v12.4.0. To disable collecting PVC metrics, add the following to the dragent.yaml file:

k8s_extra_resources:
  include:
    - services
    - resourcequotas

Sysdig Agent Prior to v12.3.0

Contact your Sysdig representative or Sysdig Support for technical assistance with enabling PVC metrics in your environment.
  • Upgrade Sysdig agent to v12.2.0 or above

  • If you are an existing Sysdig user, include the following configuration in the dragent.yaml file:

    k8s_extra_resources:
      include:
        - persistentvolumes
        - persistentvolumeclaims
        - storageclasses
    

Access PVC Dashboard from the Library

  1. Log in to Sysdig Monitor and click Dashboards.

  2. On the Dashboards slider, scroll down to locate Dashboard Library (formerly: Dashboard Templates).

  3. Click Kubernetes to expand the Kubernetes section.

  4. Select the PVC and Storage dashboard.

Access PVC Alert Template

  1. Log in to Sysdig Monitor and click Alerts.

  2. On the Alerts page, click Library.

  3. On the Library page, click All Templates.

  4. Select the Kubenetes PVC alert templates.

PVC Metrics

MetricsMetric TypeLabelsMetric Source
kube_persistentvolume_status_phaseGaugepersistentvolume, phaseKubernetes API
kube_persistentvolume_claim_refGaugepersistentvolume, nameKubernetes API
kube_storageclass_createdGaugestorageclassKubernetes API
kube_storageclass_infoGaugestorageclass, provisioner, reclaim_policy, volume_binding_modeKubernetes API
kube_storageclass_labelsGaugestorageclassKubernetes API
kube_pod_spec_volumes_persistentvolumeclaims_infoGaugenamespace, pod, uid, volume, persistentvolumeclaimKubernetes API
kube_pod_spec_volumes_persistentvolumeclaims_readonlyGaugenamespace, pod, uid, volume, persistentvolumeclaimKubernetes API
kube_persistentvolumeclaim_status_conditionGaugenamespace, persistentvolumeclaim, type, statusKubernetes API
kube_persistentvolumeclaim_status_phaseGaugenamespace, persistentvolumeclaim, phaseKubernetes API
kube_persistentvolumeclaim_access_modeGaugenamespace, persistentvolumeclaim, access_modeKubernetes API
kubelet_volume_stats_inodesGaugenamespace, persistentvolumeclaimKubelet
kubelet_volume_stats_inodes_freeGaugenamespace, persistentvolumeclaimKubelet
kubelet_volume_stats_inodes_usedGaugenamespace, persistentvolumeclaimKubelet
kubelet_volume_stats_used_bytesGaugenamespace, persistentvolumeclaimKubelet
kubelet_volume_stats_available_bytesGaugenamespace, persistentvolumeclaimKubelet
kubelet_volume_stats_capacity_bytesGaugenamespace, persistentvolumeclaimKubelet
storage_operation_duration_seconds_bucketGaugeoperation_name, volume_plugin,leKubelet
storage_operation_duration_seconds_sumGaugeoperation_name, volume_pluginKubelet
storage_operation_duration_seconds_countGaugeoperation_name, volume_pluginKubelet
storage_operation_errors_totalGaugeoperation_name, volume_pluginKubelet
storage_operation_status_countGaugeoperation_name, status, volume_pluginKubelet

5.2 - Integrate Keda for HPA

Sysdig supports Keda to deploy Kubernetes Horizontal Pod Autoscaler (HPA) using custom metrics exposed by Sysdig Monitor. You can do this by configuring Prometheus queries and endpoints in Keda. Keda uses that information to query your Prometheus server and create HPA. The HPA will takee care of scaling pods based on your usage of resources, such as CPU and memory.

This option replaces Sysdig’s existing custom metric server for HPA.

Install Keda

Requirements:

  • Helm
  • Keda v2.3 or above (Endpoint authentication)

Install Keda with helm by running the following command:

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace \
  --set image.metricsApiServer.tag=2.4.0 --set image.keda.tag=2.4.0 \
  --set prometheus.metricServer.enabled=true

Create Authentication for Sysdig Prometheus Endpoint

Do the following in each namespace where you want to use Keda. This example uses the namespace, keda.

  1. Create the secret with the API key as the bearer token:

    kubectl create secret generic keda-prom-secret --from-literal=bearerToken=<API_KEY> -n keda
    
  2. Create the triggerAuthentication.yaml file:

    apiVersion: keda.sh/v1alpha1
    kind: TriggerAuthentication
    metadata:
      name: keda-prom-creds
    spec:
      secretTargetRef:
      - parameter: bearerToken
        name: keda-prom-secret
        key: bearerToken
    
  3. Apply the configurations in the triggerAuthentication.yaml file :

    kubectl apply -f -n keda triggerAuthentication.yaml
    

Configure HPA

You can configure HPA for a Deployment, StatefulSet, or CRD. Keda uses a CRD to configure the HPA. You create a ScaledObject and it automatically sets up the metrics server and the HPA object under the hood.

  1. To create a ScaledObject, specify the following:

    • spec.scaleTargetRef.name: The unique name of the Deployment.
    • spec.scaleTargetRef.kind: The kind of object to be scaled: Deployment, SStatefulSet, CustomResource.
    • spec.minReplicaCount: The minimum number of replicas that the Deployment should have.
    • spec.maxReplicaCount: The maximum number of replicas that the Deployment should have.
  2. In the ScaledObject, use a trigger of type prometheus to get the metrics from your Sysdig Monitor account. To do so, specify the following:

    • triggers.metadata.serverAddress: The address of the Prometheus endpoint. It is the Sysdig Monitor URL with prefix /prometheus. For example: https://app.sysdigcloud.com/prometheus.
    • triggers.metadata.query: The PromQL query that will return a value. Ensure that the query returns a vector/scalar single element response.
    • triggers.metadata.metricName: The name of the metric that will be created in the kubernetes API endpoint, /apis/external.metrics.k8s.io/v1beta1.
    • triggers.metadata.threshold: The threshold that will be used to scale the Deployment.
  3. Ensure that you add the authModes and authenticationRef to the trigger.

  4. Check the ScaledObject. Here is an example of a ScaledObject:

    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: keda-web
    spec:
      scaleTargetRef:
        kind: Deployment
        name: web
      minReplicaCount: 1
      maxReplicaCount: 4
      triggers:
      - type: prometheus
        metadata:
          serverAddress: https://app.sysdigcloud.com/prometheus
          metricName: sysdig_container_cpu_cores_used
          query: sum(sysdig_container_cpu_cores_used{kube_cluster_name="my-cluster-name", kube_namespace_name="keda", kube_workload_name = "web"} * 10
          threshold: "5"
          authModes: "bearer"
        authenticationRef:
          name: keda-prom-creds
    

The HPA will divide the value of the metric by the number of current replicas, therefore, try to avoid using the AVERAGE aggregation. Use SUM instead to aggregate the metrics by workload. For example, if the sum of all the values of all the pods is 100 and there are 5 replicas, the HPA will calculate that the value of the metric is 20.

Advanced Configurations

The ScaledObject permits additional options:

spec.pollingInterval:

Specify the interval to check each trigger on. By default KEDA will check each trigger source on every ScaledObject every 30 seconds.

Warning: setting this to a low value will cause Keda to make frequent API calls to the Prometheus endpoint. The minimum value for pollingInterval is 10 seconds. The scraping frequency of the Sysdig Agent is 10 seconds.

spec.cooldownPeriod:

The wait period between the last active trigger reported and scaling the resource back to 0. By default the value is 5 minutes (300 seconds).

spec.idleReplicaCount:

Enabling this property allows KEDA to scale the resource down to the specified number of replicas. If some activity exists on the target triggers, KEDA will scale the target resource immediately to the value of minReplicaCount and scaling is handed over to HPA. When there is no activity, the target resource is again scaled down to the value specified by idleReplicaCount. This setting must be less than minReplicaCount.

spec.fallback:

This property allows you to define a number of replicas if consecutive connection errors happens with the Prometheus endpoint of your Sysdig account.

  • spec.fallback.failureThreshold: The number of consecutive errors to apply the fallback.
  • spec.fallback.replicas: The number of replicas to apply in case of connection error.

spec.advanced.horizontalPodAutoscalerConfig.behavior:

This property allows you to define the behavior of the Kubernetes HPA Object. See the Kubernetes documentation for more information.

Learn More

5.3 - Configure Recording Rules

Sysdig now supports Prometheus recording rules for metric aggregation and querying.

You can configure recording rules by using the Sysdig API. Ensure that you define them in a Prometheus compatible way. The mandatory parameters are:

  • record: The unique name of the time series. It must be a valid metric name.

  • expr: The PromQL expression to evaluate. In each evaluation cycle, the given expression is evaluated and the result is recorded as a new set of time series with the metric name specified in record.

  • labels: The unique identifiers to add or overwrite before storing the result.

To enable this feature in your environment, contact Sysdig Support.

5.4 - Configure Sysdig with Grafana

Sysdig enables Grafana users to query metrics from Sysdig and visualize them in Grafana dashboards. In order to integrate Sysdig with Grafana, you configure a data source. There are two types of data sources supported:

  • Prometheus

    Prometheus data source comes with Grafana and is natively compatible with PromQL. Sysdig provides a Prometheus-compatible API to achieve API-only integration with Grafana.

  • Sysdig

    Sysdig data source requires additional settings and is more compatible with the simple “form-based” data configuration. Use the Sysdig native API instead of the Prometheus API. See Sysdig Grafana datasource for more information.

Using the Prometheus API on Grafana v6.7 and Above

You use the Sysdig Prometheus API to set up the datasource to use with Grafana. Before Grafana can consume Sysdig metrics, Grafana must authenticate itself to Sysdig. To do so, you must set up an HTTP authentication by using the Sysdig API Token because no UI support is currently available on Grafana.

  1. Assuming that you are not using Grafana, spin up a Grafana container as follows:

    $ docker run --rm -p 3000:3000 --name grafana grafana/grafana
    
  2. Login to Grafana as administrator and create a new datasource by using the following information:

    • URL: https://<Monitor URL for Your Region>/prometheus

      See SaaS Regions and IP Ranges and identify the correct URLs associated with your Sysdig application and region.

    • Authentication: Do not select any authentication mechanisms.

    • Access: Server (default)

    • Custom HTTP Headers:

      • Header: Enter the word, Authorization

      • Value:  Enter the word, Bearer , followed by a space and <Your Sysdig API Token>

        API Token is available through Settings > User Profile > Sysdig Monitor API.

Using the Grafana API on Grafana v6.6 and Below

The feature requires Grafana v5.3.0 or above.

You use the Grafana API to set up the Sysdig datasource.

  1. Download and run Grafana in a container.

    docker run --rm -p 3000:3000 --name grafana grafana/grafana
    
  2. Create a JSON file.

    cat grafana-stg-ds.json
    {
        "name": "Sysdig staging PromQL",
        "orgId": 1,
        "type": "prometheus",
        "access": "proxy",
        "url": "https://app-staging.sysdigcloud.com/prometheus",
        "basicAuth": false,
        "withCredentials": false,
        "isDefault": false,
        "editable": true,
        "jsonData": {
            "httpHeaderName1": "Authorization",
            "tlsSkipVerify": true
        },
        "secureJsonData": {
            "httpHeaderValue1": "Bearer your-Sysdig-API-token"
        }
    }
    
  3. Get your Sysdig API Token and plug it in the JSON file above.

    "httpHeaderValue1": "Bearer your_Sysdig_API_Token"
    
  4. Add the datasource to Grafana.

    curl -u admin:admin -H "Content-Type: application/json" http://localhost:3000/api/datasources -XPOST -d @grafana-stg-ds.json
    
  5. Run Grafana.

    http://localhost:3000
    
  6. Use the default credentials, admin: admin, to sign in to Grafana.

  7. Open the Data Source tab under Configuration on Grafana and confirm that the one you have added is listed on the page.

6 - Troubleshoot Monitoring Integrations

Review the common troubleshooting scenarios you might encounter while getting a Monitor integration working and see what you can do if an integration does not report metics after installation.

Check Prerequisites

Some integrations require secrets and other resources available in the correct namespace in order for it to work. Integrations such as database exporters might require you to create a user and provide with special permissions in the database to be able to connect with the endpoint and generate metrics.

Ensure that the prerequisites of the integration are met before proceeding with installation.

Verify Exporter Is Running

If the integration is an exporter, ensure that the pods corresponding to the exporter are running correctly. You can check this after installing the integration. If the exporter is installed as a sidecar of the application (such as Nginx), verify that the exporter container is added to the pod.

You can check the status of the pods with the Kubernetes dashboard Pods Status and Performance or with the following command:

kubectl get pods --namespace=<namespace>

Additionally, if the container has problems and cannot start, check the description of the pod for error messages:

kubectl describe pod <pod-name> --namespace=<namespace>

Verify Metrics Are Generated

Check whether a running exporter is generating metrics by accessing the metrics endpoint:

kubectl port-forward <pod-name> <pod-port> <local-port> --namespace=<namespace>
curl http://localhost:<local-port>/metrics

This is also valid for applications that don’t need an exporter to generate their own metrics.

If the exporter is not generating metics, there could be problems accessing or authenticating with the application. Check the logs associated with the pods:

kubectl logs <pod-name> --namespace=<namespace>

If the application is instrumented and is not generating metrics, check if the Prometheus metrics option or the module is activated.

Verify Sysdig Agent Is Scraping Metrics

If an application doesn’t need an exporter to generate metrics, check if it has the default Prometheus annotations.

Additionally, you can check if the Sysdig agent can access the metrics endpoint. To do so, use the following command:

kubectl exec <sysdig-agent-pod-name> --namespace=sysdig-agent -- /bin/sh -c "curl http://<exporer-pod-ip>:<pod-port>/metrics"

Select the Sysdig Agent pod in the same node than the pod used to scrape.

6.1 - Monitor Log Files

You can search for particular strings within a given log file, and create a metric that is displayed in Sysdig Monitor’s Explore page. The metrics appear under the StatsD section:

Sysdig provides this functionality via a “chisel” script called “logwatcher”, written in Lua. You call the script by adding a logwatcher parameter in the chisels section of the agent configuration file (dragent.yaml). You define the log file name and the precise string to be searched. The results are displayed as metrics in the Monitor UI.

Caveats

The logwatcher chisel adds to Sysdig’s monitoring capability but is not a fully featured log monitor. Note the following limitations:

  • No regex support: Sysdig does not offer regex support; you must define the precise log file and string to be searched.

    (If you were to supply a string with spaces, forward-slashes, or back-slashes in it, the metric generated would also have these characters and so could not be used to create an alert.)

  • Limit of 12 string searches/host: Logwatcher is implemented as a LUA script and, due to resources consumed by this chisel, it is not recommended to have more than a dozen string searches configured per agent/host.

Implementation

Edit the agent configuration file to enable the logwatcher chisel. See Understanding the Agent Config Files for editing options.

Preparation

Determine the log file name(s) and string(s) you want to monitor.

To monitor the output of docker logs <container-name>, find the container’s docker log file with:

docker inspect <container-name> | grep LogPath

Edit dragent.yaml

  1. Access dragent.yaml directly at /opt/draios/etc/dragent.yaml.

  2. Add a chisels entry:

    Format:

    chisels:
      - name: logwatcher
        args:
          filespattern: YOURFILENAME.log
          term: YOURSTRING
    

    Sample Entry:

    customerid: 831f2-your-key-here-d69401
    tags: tagname.tagvalue
    chisels:
      - name: logwatcher
        args:
          filespattern: draios.log
          term: Sent
    

    In this example, Sysdig’s own draios.log is searched for the Sent string.

    The output, in the Sysdig Monitor UI, would show the StatsD metric logwatcher.draios_log.Sent and the number of ‘Sent’ items detected.

  3. Optional: Add multiple -name: sections in the config file to search for additional logs/strings.

    Note the recommended 12-string/agent limit.

  4. Restart the agent for changes to take effect.

    For container agent:

    docker restart sysdig-agent
    

    For non-containerized (service) agent:

    service dragent restart
    

Parameters

NameValueDescription
namelogwatcherThe chisel used in the enterprise Sysdig platform to search log files. (Other chisels are available in Sysdig’s open-source product.)
filespatternYOURFILENAME.logThe log file to be searched. Do not specify a path with the file name.
termYOURSTRINGThe string to be searched.

View Log File Metrics in the Monitor UI

To view logwatcher results:

  1. Log in to Sysdig Monitor and select Explore.

  2. Select Entire Infrastructure > Overview by Host.

  3. In the resulting drop-down, either scroll to Metrics > StatsD > logwatcher or enter “logwatcher” in the search field.

    Each string you configured in the agent config file will be listed in the format logwatcher.YOURFILENAME_log.STRING.

  4. The relevant metrics are displayed.

You can also Add an Alert on logwatcher metrics, to be notified when an important log entry appears.

7 - (Legacy) Integrations for Sysdig Monitor

Integrate metrics with Sysdig Monitor from a number of platforms, orchestrators, and a wide range of applications. Sysdig collects metrics from Prometheus, JMX, StatsD, Kubernetes, and many application stacks to provide a 360-degree view of your infrastructure. Many metrics are collected by default out of the box; you can also extend the integration or create custom metrics.

Key Benefits

  • Collects the richest data set for cloud-native visibility and security

  • Polls data, auto-discover context in order to provide operational and security insights

  • Extends the power of Prometheus metrics with additional insights from other metrics types and infrastructure stack

  • Integrate Prometheus alert and events for Kubernetes monitoring needs

  • Expose application metrics using Java JMX and MBeans monitoring

Key Integrations

Inbound

  • Prometheus Metrics

    Describes how Sysdig Agent enables automatically collecting metrics from Prometheus exporters, how to set up your environment, and scrape Prometheus metrics from local as well as remote hosts.

  • Java Management Extention (JMX) Metrics

    Describes how to configure your Java virtual machines so Sysdig Agent can collect JMX metrics using the JMX protocol.

  • StatsD Metrics

    Describes how the Sysdig agent collects custom StatsD metrics with an embedded StatsD server.

  • Node.JS Metrics

    Illustrates how Sysdig is able to monitor node.js applications by linking a library to the node.js codebase.

  • Integrate Applications

    Describes the monitoring capabilities of Sysdig agent with application check scripts or ‘app checks’.

  • Monitor Log Files

    Learn how to search a string by using the chisel script called logwatcher.

  • AWS CloudWatch

    Illustrates how to configure Sysdig to collect various types of CloudWatch metrics.

  • Agent Installation

    Learn how to install Sysdig agents on supported platforms.

Oubound

  • Notification Channels

    Learn how to add, edit, or delete a variety of notification channel types, and how to disable or delete notifications when they are not needed, for example, during scheduled downtime.

  • S3 Capture Storage

    Learn how to configure Sysdig to use an AWS S3 bucket or custom S3 storage for storing Capture files.

Platform Metrics (IBM)

For Sysdig instances deployed on IBM Cloud Monitoring with Sysdig, an additional form of metrics collection is offered: Platform metrics. Rather than being collected by the Sysdig agent, when enabled, Platform metrics are reported to Sysdig directly by the IBM Cloud infrastructure.

Platform metrics are metrics that are exposed by enabled services across the IBM Cloud platform. These services have made metrics and pre-defined dashboards for their services available by publishing metrics associated with the customer’s space or account. Customers can view these platform metrics alongside the metrics from their applications and other services within IBM Cloud monitoring.

Enable this feature by logging into the IBM Cloud console and selecting “Enable” for IBM Platform metrics under the Configure your resource section when creating a new IBM Cloud Monitoring with a Sysdig instance, as described here.

7.1 - (Legacy)Collect Prometheus Metrics

Sysdig supports collecting, storing, and querying Prometheus native metrics and labels. You can use Sysdig in the same way that you use Prometheus and leverage Prometheus Query Language (PromQL) to create dashboards and alerts. Sysdig is compatible with Prometheus HTTP API to query your monitoring data programmatically using PromQL and extend Sysdig to other platforms like Grafana.

From a metric collection standpoint, a lightweight Prometheus server is directly embedded into the Sysdig agent to facilitate metric collection. This also supports targets, instances, and jobs with filtering and relabeling using Prometheus syntax. You can configure the agent to identify these processes that expose Prometheus metric endpoints on its own host and send it to the Sysdig collector for storing and further processing.

This document uses metric and time series interchangeably. The description of configuration parameters refers to “metric”, but in strict Prometheus terms, those imply time series. That is, applying a limit of 100 metrics implies applying a limit on time series, where all the time series data might not have the same metric name.

The Prometheus product itself does not necessarily have to be installed for Prometheus metrics collection.

See the Sysdig agent versions and compatibility with Prometheus features:

  • Latest versions of agent (v12.0.0 and above): The following features are enabled by default:

    • Automatically scraping any Kubernetes pods with the following annotation set: prometheus.io/scrape=true
    • Automatically scrape applications supported by Monitoring Integrations.
  • Sysdig agent prior to v12.0.0: Manually enable Prometheus in dragent.yaml file:

      prometheus:
           enabled: true
    

Learn More

The following topics describe in detail how to configure the Sysdig agent for service discovery, metrics collection, and further processing.

See the following blog posts for additional context on the Prometheus metric and how such metrics are typically used.

7.1.1 - (Legacy) Working with Prometheus Metrics

The Sysdig agent uses its visibility to all running processes (at both the host and container levels) to find eligible targets for scraping Prometheus metrics. By default, no scraping is attempted. Once the feature is enabled, the agent assembles a list of eligible targets, apply filtering rules, and sends back to the Sysdig collector.

Latest Prometheus Features

Sysdig agents v12.0 or above is required for the following capabilities:

Sysdig agents v10.0 or above is required for the following capabilities:

  • New capabilities of using Prometheus data:

    • Ability to visualize data using PromQL queries. See Using PromQL.

    • Create alerts from PromQL-based Dashboards. See Create Panel Alerts.

    • Backward compatibility for dashboards v2 and alerts.

      The new PromQL data cannot be visualized by using the Dashboard v2 Histogram. Use time-series based visualization for the histogram metrics.

  • New metrics limit per agent

  • 10-second data granularity

  • Higher retention rate on the new metric store.

Prerequisites and Guidelines

  • Sysdig agent v 10.0.0 and above is required for the latest Prometheus features.

  • Prometheus feature is enabled in the dragent.yaml file.

    prometheus:
      enabled: true
    

    See Setting up the Environment for more information.

  • The endpoints of the target should be available on a TCP connection to the agent. The agent scrapes a target, remote or local, specified by the IP: Port or the URL in dragent.yaml.

Service Discovery

To use native Prometheus service discovery, enable Promscrape V2 as described in Enable Prometheus Native Service Discovery. This section covers the Sysdig way of service discovery that involves configuring process filters in the Sysdig agent.

The way service discovery works in the Sysdig agent differs from that of the Prometheus server. While the Prometheus server has built-in integration with several service discovery mechanisms and the prometheus.yml file to read the configuration settings from, the Sysdig agent auto-discovers any process (exporter or instrumented) that matches the specifications in the dragent.yaml, file and instructs the embedded lightweight Prometheus server to retrieve the metrics from it.

The lightweight Prometheus server in the agent is named promscrape and is controlled by the flag of the same name in the dragent.yaml file. See Configuring Sysdig Agent for more information.

Unlike the Prometheus server that can scrape processes running on all the machines in a cluster, the agent can scrape only those processes that are running on the host that it is installed on.

Within the set of eligible processes/ports/endpoints, the agent scrapes only the ports that are exporting Prometheus metrics and will stop attempting to scrape or retry on ports based on how they respond to attempts to connect and scrape them. It is therefore strongly recommended that you create a configuration that restricts the process and ports for attempted scraping to the minimum expected range for your exporters. This minimizes the potential for unintended side-effects in both the Agent and your applications due to repeated failed connection attempts.

The end to end metric collection can be summarized as follows:

  1. A process is determined to be eligible for possible scraping if it positively matches against a series of Process Filter include/exclude rules. See Process Filter for more information.

  2. The Agent will then attempt to scrape an eligible process at a /metrics endpoint on all of its listening TCP ports unless the additional configuration is present to restrict scraping to a subset of ports and/or another endpoint name.

  3. Upon receiving the metrics, the agent applies the following rules before sending them to the Sysdig collector.

The metrics ultimately appear in the Sysdig Monitor Explore interface in the Prometheus section.

7.1.2 - (Legacy) Set up the Environment

Quick Start For Kubernetes Environments

Prometheus users who are already leveraging Kubernetes Service Discovery (specifically the approach in this sample prometheus-kubernetes.yml) may already have Annotations attached to the Pods that mark them as eligible for scraping. Such environments can quickly begin scraping the same metrics using the Sysdig Agent in a couple of easy steps.

  1. Enable the Prometheus metrics feature in the Sysdig Agent. Assuming you are deploying using DaemonSets, the needed config can be added to the Agent’s dragent.yaml by including the following in your DaemonSet YAML (placing it in the env section for the sysdig-agent container):

    - name: ADDITIONAL_CONF
      value: "prometheus:\n  enabled: true"
    
  2. Ensure the Kubernetes Pods that contain your Prometheus exporters have been deployed with the following Annotations to enable scraping (substituting the listening exporter-TCP-port) :

    spec:
      template:
        metadata:
          annotations:
            prometheus.io/scrape: "true"
            prometheus.io/port: "exporter-TCP-port"
    

    The configuration above assumes your exporters use the typical endpoint called /metrics. If an exporter is using a different endpoint, this can also be specified by adding the following additional optional Annotation, substituting the exporter-endpoint-name:

    prometheus.io/path: "/exporter-endpoint-name"
    

If you try this Kubernetes Deployment of a simple exporter, you will quickly see auto-discovered Prometheus metrics being displayed in Sysdig Monitor. You can use this working example as a basis to similarly Annotate your own exporters.

If you have Prometheus exporters not deployed in annotated Kubernetes Pods that you would like to scrape, the following sections describe the full set of options to configure the Agent to find and scrape your metrics.

Quick Start for Container Environments

In order for Prometheus scraping to work in a Docker-based container environment, set the following labels to the application containers, substituting <exporter-port> and <exporter-path> with the correct port and path where metrics are exported by your application:

  • io.prometheus.scrape=true

  • io.prometheus.port=<exporter-port>

  • io.prometheus.path=<exporter-path>

For example, if mysqld-exporter is to be scraped, spin up the container as follows:

docker -d -l io.prometheus.scrape=true -l io.prometheus.port=9104 -l io.prometheus.path=/metrics mysqld-exporter

7.1.3 - (Legacy) Configuring Sysdig Agent

This feature is not supported with Promscrape V2. For information on different versions of Promscrape and migrating to the latest version, see Migrating from Promscrape V1 to V2.

As is typical for the agent, the default configuration for the feature is specified in dragent.default.yaml, and you can override the defaults by configuring parameters in the dragent.yaml. For each parameter, you do not set in dragent.yaml, the defaults in dragent.default.yaml will remain in effect.

Main Configuration Parameters

Parameter

Default

Description

prometheus

See below

Turns Prometheus scraping on and off.

process_filter

See below

Specifies which processes may be eligible for scraping. See [Process Filter](/en/docs/sysdig-monitor/monitoring-integrations/legacy-integrations/legacycollect-prometheus-metrics/configuring-sysdig-agent/#process-filter).

use_promscrape

See below.

Determines whether to use promscrape for scraping Prometheus metrics.

promscrape

Promscrape is a lightweight Prometheus server that is embedded with the Sysdig agent. The use_promscrape parameter controls whether to use it to scrape Prometheus endpoints.

Promscrape has two versions: Promscrape V1 and Promscrape V2. With V1, Sysdig agent discovers scrape targets through the process_filter rules. With V2, promscrape itself discovers targets by using the standard Prometheus configuration, allowing the use of relabel_configs to find or modify targets.

Parameters

Default

Description

use_promscrape

true

prometheus

The prometheus section defines the behavior related to Prometheus metrics collection and analysis. It allows for turning the feature on, set a limit from the agent side on the number of metrics to be scraped, and determines whether to report histogram metrics and log failed scrape attempts.

Parameter

Default

Description

enabled

false

Turns Prometheus scraping on and off.

interval

10

How often (in seconds) the agent will scrape a port for Prometheus metrics

prom_service_discovery

true

Enables native Prometheus service discovery. If disabled, promscrape.v1 is used to scrape the targets. See Enable Prometheus Native Service Discovery.

On agent versions prior to 11.2, the default is false.

max_metrics

1000

The maximum number of total Prometheus metrics that will be scraped across all targets. This value of 1000 is the maximum per-agent, and is a separate limit from other Custom Metrics. For example, StatsD, JMX, and App Checks.

timeout

1

Used to configure the amount of time the agent will wait while scraping a Prometheus endpoint before timing out. The default value is 1 second.

As of agent v10.0, this parameter is only used when promscrape is disabled. Since promscrape is now default, timeout can be considered deprecated, however it is still used when you explicitly disable promscrape.

Process Filter

The process_filter section specifies which of the processes known by an agent may be eligible for scraping.

Note that once you specify a process_filter in your dragent.yaml, this replaces the entire Prometheus process_filter section (i.e. all the rules) shown in the dragent.default.yaml.

The Process Filter is specified in a series of include and exclude rules that are evaluated top-to-bottom for each process known by an Agent. If a process matches an include rule, scraping will be attempted via a /metrics endpoint on each listening TCP port for the process, unless a conf section also appears within the rule to further restrict how the process will be scraped. See conf for more information.

Multiple patterns can be specified in a single rule, in which case all patterns must match for the rule to be a match (AND logic).

Within a pattern value, simple “glob” wildcarding may be used, where * matches any number of characters (including none) and ? matches any single character. Note that due to YAML syntax, when using wildcards, be sure to enclose the value in quotes ("*").

The table below describes the supported patterns in Process Filter rules. To provide realistic examples, we’ll use a simple sample Prometheus exporter (source code here) which can be deployed as a container using the Docker command line below. To help illustrate some of the configuration options, this sample exporter presents Prometheus metrics on /prometheus instead of the more common /metrics endpoint, which will be shown in the example configurations further below.

# docker run -d -p 8080:8080 \
    --label class="exporter" \
    --name my-java-app \
    luca3m/prometheus-java-app

# ps auxww | grep app.jar
root     11502 95.9  9.2 3745724 753632 ?      Ssl  15:52   1:42 java -jar /app.jar --management.security.enabled=false

# curl http://localhost:8080/prometheus
...
random_bucket{le="0.005",} 6.0
random_bucket{le="0.01",} 17.0
random_bucket{le="0.025",} 51.0
...

Pattern name

Description

Example

container.image

Matches if the process is running inside a container running the specified image

- include:

container.image: luca3m/prometheus-java-app

container.name

Matches if the process is running inside a container with the specified name

- include:

container.name: my-java-app

container.label.*

Matches if the process is running in a container that has a Label matching the given value

- include:

container.label.class: exporter

kubernetes.<object>.annotation.* kubernetes.<object>.label.*

Matches if the process is attached to a Kubernetes object (Pod, Namespace, etc.) that is marked with the Annotation/Label matching the given value.

Note: This pattern does not apply to the Docker-only command-line shown above, but would instead apply if the exporter were installed as a Kubernetes Deployment using this example YAML.

Note: See Kubernetes Objects, below, for information on the full set of supported Annotations and Labels.

- include:

kubernetes.pod.annotation.prometheus.io/scrape: true

process.name

Matches the name of the running process

- include:

process.name: java

process.cmdline

Matches a command line argument

- include:

process.cmdline: "*app.jar*"

port

Matches if the process is listening on one or more TCP ports.

The pattern for a single rule can specify a single port as shown in this example, or a single range (e.g.8079-8081), but does not support comma-separated lists of ports/ranges.

Note: This parameter is only used to confirm if a process is eligible for scraping based on the ports on which it is listening. For example, if a process is listening on one port for application traffic and has a second port open for exporting Prometheus metrics, it would be possible to specify the application port here (but not the exporting port), and the exporting port in the conf section (but not the application port), and the process would be matched as eligible and the exporting port would be scraped.

- include:

port: 8080

appcheck.match

Matches if an Application Check with the specific name or pattern is scheduled to run for the process.

- exclude:

appcheck.match: "*"

Instead of the **`include`** examples shown above that would have each matched our process, due to the previously-described ability to combine multiple patterns in a single rule, the following very strict configuration would also have matched:
- include:
    container.image: luca3m/prometheus-java-app
    container.name: my-java-app
    container.label.class: exporter
    process.name: java
    process.cmdline: "*app.jar*"
    port: 8080

conf

Each include rule in the port_filter may include a conf portion that further describes how scraping will be attempted on the eligible process. If a conf portion is not included, scraping will be attempted at a /metrics endpoint on all listening ports of the matching process. The possible settings:

Parameter name

Description

Example

port

Either a static number for a single TCP port to be scraped, or a container/Kubernetes Label name or Kubernetes Annotation specified in curly braces. If the process is running in a container that is marked with this Label or is attached to a Kubernetes object (Pod, Namespace, etc.) that is marked with this Annotation/Label, scraping will be attempted only on the port specified as the value of the Label/Annotation.

Note: The Label/Annotation to match against will not include the text shown in red.

Note: See Kubernetes Objectsfor information on the full set of supported Annotations and Labels.

Note: If running the exporter inside a container, this should specify the port number that the exporter process in the container is listening on, not the port that the container exposes to the host.

port: 8080

- or -

port: "{container.label.io.prometheus.port}"

- or -

port: "{kubernetes.pod.annotation.prometheus.io/port}"

port_filter

A set of include and exclude rules that define the ultimate set of listening TCP ports for an eligible process on which scraping may be attempted. Note that the syntax is different from the port pattern option from within the higher-level include rule in the process_filter. Here a given rule can include single ports, comma-separated lists of ports (enclosed in square brackets), or contiguous port ranges (without brackets).

port_filter:

- include: 8080 - exclude: [9092,9200,9300] - include: 9090-9100

path

Either the static specification of an endpoint to be scraped, or a container/Kubernetes Label name or Kubernetes Annotation specified in curly braces. If the process is running in a container that is marked with this Label or is attached to a Kubernetes object (Pod, Namespace, etc.) that is marked with this Annotation/Label, scraping will be attempted via the endpoint specified as the value of the Label/Annotation.

If path is not specified, or specified but the Agent does not find the Label/Annotation attached to the process, the common Prometheus exporter default of /metrics will be used.

Note: A Label/Annotation to match against will not include the text shown in red.

Note: See Kubernetes Objects for information on the full set of supported Annotations and Labels.

path: "/prometheus"

- or -

path: "{container.label.io.prometheus.path}"

- or -

path: "{kubernetes.pod.annotation.prometheus.io/path}"

host

A hostname or IP address. The default is localhost.

host: 192.168.1.101
- or -
host: subdomain.example.com
- or -
host: localhost

use_https

When set to true, connectivity to the exporter will only be attempted through HTTPS instead of HTTP. It is false by default.

(Available in Agent version 0.79.0 and newer)

use_https: true

ssl_verify

When set to true, verification will be performed for the server certificates for an HTTPS connection. It is false by default. Verification was enabled by default before 0.79.0.

(Available in Agent version 0.79.0 and newer)

ssl_verify: true

Authentication Integration

As of agent version 0.89, Sysdig can collect Prometheus metrics from endpoints requiring authentication. Use the parameters below to enable this function.

  • For username/password authentication:

    • username

    • password

  • For authentication using a token:

    • auth_token_path
  • For certificate authentication with a certificate key:

    • auth_cert_path

    • auth_key_path

Token substitution is also supported for all the authorization parameters. For instance a username can be taken from a Kubernetes annotation by specifying

username: "{kubernetes.service.annotation.prometheus.openshift.io/username}"

conf Authentication Example

Below is an example of the dragent.yaml section showing all the Prometheus authentication configuration options, on OpenShift, Kubernetes, and etcd.

In this example:

  • The username/password are taken from a default annotation used by OpenShift.

  • The auth token path is commonly available in Kubernetes deployments.

  • The certificate and key used here for etcd may normally not be as easily accessible to the agent. In this case they were extracted from the host namespace, constructed into Kubernetes secrets, and then mounted into the agent container.

prometheus:
  enabled: true
  process_filter:
    - include:
        port: 1936
        conf:
            username: "{kubernetes.service.annotation.prometheus.openshift.io/username}"
            password: "{kubernetes.service.annotation.prometheus.openshift.io/password}"
    - include:
        process.name: kubelet
        conf:
            port: 10250
            use_https: true
            auth_token_path: "/run/secrets/kubernetes.io/serviceaccount/token"
    - include:
        process.name: etcd
        conf:
            port: 2379
            use_https: true
            auth_cert_path: "/run/secrets/etcd/client-cert"
            auth_key_path: "/run/secrets/etcd/client-key"

Kubernetes Objects

As described above, there are multiple configuration options that can be set based on auto-discovered values for Kubernetes Labels and/or Annotations. The format in each case begins with "kubernetes.OBJECT.annotation." or "kubernetes.OBJECT.label." where OBJECT can be any of the following supported Kubernetes object types:

  • daemonSet

  • deployment

  • namespace

  • node

  • pod

  • replicaSet

  • replicationController

  • service

  • statefulset

The configuration text you add after the final dot becomes the name of the Kubernetes Label/Annotation that the Agent will look for. If the Label/Annotation is discovered attached to the process, the value of that Label/Annotation will be used for the configuration option.

Note that there are multiple ways for a Kubernetes Label/Annotation to be attached to a particular process. One of the simplest examples of this is the Pod-based approach shown in Quick Start For Kubernetes Environments. However, as an example alternative to marking at the Pod level, you could attach Labels/Annotations at the Namespace level, in which case auto-discovered configuration options would apply to all processes running in that Namespace regardless of whether they’re in a Deployment, DaemonSet, ReplicaSet, etc.

7.1.4 - (Legacy) Filtering Prometheus Metrics

As of Sysdig agent 9.8.0, a lightweight Prometheus server is embedded in agents named promscrape and a prometheus.yaml file is included as part of configuration files. Using the open source Prometheus capabilities, Sysdig leverages a Prometheus feature to allow you to filter Prometheus metrics at the source before ingestion. To do so, you will:

  • Ensure that the Prometheus scraping is enabled in the  dragent.yaml file.

    prometheus:
      enabled: true
    
  • On agent v9.8.0 and above, enable the feature by setting the

    use_promscrape parameter to true in the dragent.yaml. See Enable Filtering at Ingestion.

  • Edit the configuration in the prometheus.yaml file. See Edit Prometheus Configuration File.

    Sysdig-specific configuration is found in the prometheus.yaml file.

Enable Filtering at Ingestion

On agent v9.8.0, in order for target filtering to work, the use_promscrape parameter in the dragent.yaml must be set to true. For more information on configuration, see Configuring Sysdig Agent.

use_promscrape: true

On agent v10.0, use_promscrape is enabled by default. Implies, promscrape is used for scraping Prometheus metrics.

Filtering configuration is optional. The absence of prometheus.yaml  will not change the existing behavior of the agent.

Edit Prometheus Configuration File

About the Prometheus Configuration File

The prometheus.yaml file contains mostly the filtering/relabeling configuration in a list of key-value pairs, representing target process attributes.

You replace keys and values with the desired tags corresponding to your environment.

In this file, you will configure the following:

  • Default scrape interval (optional).

    For example:

    scrape_interval: 10s

  • Of the labeling parameters that Prometheus offers, Sysdig supports only metric_relabel_configs. The relabel_config parameter is not supported.

  • Zero or more process-specific filtering configurations (optional).

    See Kubernetes Environments and Docker Environments

    The filtering configuration includes:

    • Filtering rules

      For example:

      - source_labels: [container_label_io_kubernetes_pod_name]

    • Limit on number of scraped samples (optional)

      For example:

      sample_limit: 2000

  • Default filtering configuration (optional). The filtering configuration includes:

    • Filtering rules

      For example:

      - source_labels: [car]

    • Limit on number of scraped samples (optional)

      For example:

      sample_limit: 2000

The prometheus.yaml file is installed alongside dragent.yaml. For the most part, the syntax of prometheus.yaml complies with the standard Prometheus configuration

Default Configuration

A configuration with empty key-value pairs is considered a default configuration. The default configuration will be applied to all the processes to be scraped that don’t have a matching filtering configuration. In Sample Prometheus Configuration File, the job_name: 'default' section represents the default configuration.

Kubernetes Environments

If the agent runs in Kubernetes environments (Open Source/OpenShift/GKE), include the following Kubernetes objects as key-value pairs. See Agent Install: Kubernetes for details on agent installation.

For example:

sysdig_sd_configs:
- tags:
    namespace: backend
    deployment: my-api

In addition to the aforementioned tags, any of these object types can be matched against:

daemonset: my_daemon
deployment: my_deployment
hpa: my_hpa
namespace: my_namespace
node: my_node
pod: my_pode
replicaset: my_replica
replicationcontroller: my_controller
resourcequota: my_quota
service: my_service
stateful: my_statefulset

For Kubernetes/OpenShift/GKE deployments, prometheus.yaml shares the same ConfigMap with dragent.yaml.

Docker Environments

In Docker environments, include attributes such as container, host, port, and more. For example:

sysdig_sd_configs:
- tags:
    host: my-host
    port: 8080

For Docker-based deployments, prometheus.yaml can be mounted from the host.

Sample Prometheus Configuration File

global:
  scrape_interval: 20s
scrape_configs:
- job_name: 'default'
  sysdig_sd_configs: # default config
  relabel_configs:
- job_name: 'my-app-job'
  sample_limit: 2000
  sysdig_sd_configs:  # apply this filtering config only to my-app
  - tags:
      namespace: backend
      deployment: my-app
  metric_relabel_configs:
  # Drop all metrics starting with http_
  - source_labels: [__name__]
    regex: "http_(.+)"
    action: drop
  metric_relabel_configs:
  # Drop all metrics for which the city label equals atlantis
  - source_labels: [city]
    regex: "atlantis"
    action: drop

7.1.5 - (Legacy) Example Configuration

This topic introduces you to default and specific Prometheus configurations.

Default Configuration

As an example that pulls together many of the configuration elements shown above, consider the default Agent configuration that’s inherited from the dragent.default.yaml.

prometheus:
  enabled: true
  interval: 10
  log_errors: true
  max_metrics: 1000
  max_metrics_per_process: 100
  max_tags_per_metric: 20

  # Filtering processes to scan. Processes not matching a rule will not
  # be scanned
  # If an include rule doesn't contain a port or port_filter in the conf
  # section, we will scan all the ports that a matching process is listening to.
  process_filter:
    - exclude:
        process.name: docker-proxy
    - exclude:
        container.image: sysdig/agent
    # special rule to exclude processes matching configured prometheus appcheck
    - exclude:
        appcheck.match: prometheus
    - include:
        container.label.io.prometheus.scrape: "true"
        conf:
            # Custom path definition
            # If the Label doesn't exist we'll still use "/metrics"
            path: "{container.label.io.prometheus.path}"

            # Port definition
            # - If the Label exists, only scan the given port.
            # - If it doesn't, use port_filter instead.
            # - If there is no port_filter defined, skip this process
            port: "{container.label.io.prometheus.port}"
            port_filter:
                - exclude: [9092,9200,9300]
                - include: 9090-9500
                - include: [9913,9984,24231,42004]
    - exclude:
        container.label.io.prometheus.scrape: "false"
    - include:
        kubernetes.pod.annotation.prometheus.io/scrape: true
        conf:
            path: "{kubernetes.pod.annotation.prometheus.io/path}"
            port: "{kubernetes.pod.annotation.prometheus.io/port}"
    - exclude:
        kubernetes.pod.annotation.prometheus.io/scrape: false

Consider the following about this default configuration:

  • All Prometheus scraping is disabled by default. To enable the entire configuration shown here, you would only need to add the following to your dragent.yaml:

    prometheus:
      enabled: true
    

    Enabling this option and any pods (in case of Kubernetes) that have the right annotation set or containers (if not) that have the labels set will automatically be scrapped.

  • Once enabled, this default configuration is ideal for the use case described in the Quick Start For Kubernetes Environments.

  • A Process Filter rule excludes processes that are likely to exist in most environments but are known to never export Prometheus metrics, such as the Docker Proxy and the Agent itself.

  • Another Process Filter rule ensures that any processes configured to be scraped by the legacy Prometheus application check will not be scraped.

  • Another Process Filter rule is tailored to use container Labels. Processes marked with the container Label io.prometheus.scrape will become eligible for scraping, and if further marked with container Labels io.prometheus.port and/or io.prometheus.path, scraping will be attempted only on this port and/or endpoint. If the container is not marked with the specified path Label, scraping the /metrics endpoint will be attempted. If the container is not marked with the specified port Label, any listening ports in the port_filter will be attempted for scraping (this port_filter in the default is set for the range of ports for common Prometheus exporters, with exclusions for ports in the range that are known to be used by other applications that are not exporters).

  • The final Process Filter Include rule is tailored to the use case described in the Quick Start For Kubernetes Environments.

Scrape a Single Custom Process

If you need to scrape a single custom process, for instance, a java process listening on port 9000 with path /prometheus, add the following to the dragent.yaml:

prometheus:
  enabled: true
  process_filter:
    - include:
        process.name: java
        port: 9000
        conf:
          # ensure we only scrape port 9000 as opposed to all ports this process may be listening to
          port: 9000
          path: "/prometheus"

This configuration overrides the default process_filter section shown in Default Configuration. You can add relevant rules from the default configuration to this to further filter down the metrics.

port has different purposes depending on where it’s placed in the configuration. When placed under the include section, it is a condition for matching the include rule.

Placing a port under conf indicates that only that particular port is scraped when the rule is matched as opposed to all the ports that the process could be listening on.

In this example, the first rule will be matched for the Java process listening on port 9000. The java process listening only on port 9000 will be scrapped.

Scrape a Single Custom Process Based on Container Labels

If you still want to scrape based on container labels, you could just append the relevant rules from the defaults to the process_filter. For example:

prometheus:
  enabled: true
  process_filter:
    - include:
        process.name: java
        port: 9000
        conf:
          # ensure we only scrape port 9000 as opposed to all ports this process may be listening to
          port: 9000
          path: "/prometheus"
    - exclude:
        process.name: docker-proxy
    - include:
        container.label.io.prometheus.scrape: "true"
        conf:
            path: "{container.label.io.prometheus.path}"
            port: "{container.label.io.prometheus.port}"

port has a different meaning depending on where it’s placed in the configuration. When placed under the include section, it’s a condition for matching the include rule.

Placing port under conf indicates that only that port is scraped when the rule is matched as opposed to all the ports that the process could be listening on.

In this example, the first rule will be matched for the process listening on port 9000. The java process listening only on port 9000 will be scrapped.

Container Environment

With this default configuration enabled, a containerized install of our example exporter shown below would be automatically scraped via the Agent.

# docker run -d -p 8080:8080 \
    --label io.prometheus.scrape="true" \
    --label io.prometheus.port="8080" \
    --label io.prometheus.path="/prometheus" \
    luca3m/prometheus-java-app

Kubernetes Environment

In a Kubernetes-based environment, a Deployment with the Annotations as shown in this example YAML would be scraped by enabling the default configuration.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-java-app
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: prometheus-java-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/prometheus"
        prometheus.io/port: "8080"
    spec:
      containers:
        - name: prometheus-java-app
          image: luca3m/prometheus-java-app
          imagePullPolicy: Always

Non-Containerized Environment

This is an example of a non-containerized environment or a containerized environment that doesn’t use Labels or Annotations. The following dragent.yaml would override the default and do per-second scrapes of our sample exporter and also a second exporter on port 5005, each at their respective non-standard endpoints. This can be thought of as a conservative “whitelist” type of configuration since it restricts scraping to only exporters that are known to exist in the environment and the ports on which they’re known to export Prometheus metrics.

prometheus:
  enabled: true
  interval: 1
  process_filter:
    - include:
        process.cmdline: "*app.jar*"
        conf:
          port: 8080
          path: "/prometheus"
    - include:
        port: 5005
        conf:
          port: 5005
          path: "/wacko"

port has a different meaning depending on where it’s placed in the configuration. When placed under the include section, it’s a condition for matching the include rule. Placing port under conf indicates that only that port is scraped when the rule is matched as opposed to all the ports that the process could be listening on.

In this example, the first rule will be matched for the process *app.jar*. The java process listening only on port 8080 will be scrapped as opposed to all the ports that *app.jar* could be listening on. The second rule will be matched for port 5005 and the process listening only on 5005 will be scraped.

7.1.6 - (Legacy) Logging and Troubleshooting

Logging

After the Agent begins scraping Prometheus metrics, there may be a delay of up to a few minutes before the metrics become visible in Sysdig Monitor. To help quickly confirm your configuration is correct, starting with Agent version 0.80.0, the following log line will appear in the Agent log the first time since starting that it has found and is successfully scraping at least one Prometheus exporter:

2018-05-04 21:42:10.048, 8820, Information, 05-04 21:42:10.048324 Starting export of Prometheus metrics

As this is an INFO level log message, it will appear in Agents using the default logging settings. To reveal even more detail,increase the Agent log level to DEBUG , which produces a message like the following that reveals the name of a specific metric first detected. You can then look for this metric to be visible in Sysdig Monitor shortly after.

2018-05-04 21:50:46.068, 11212, Debug, 05-04 21:50:46.068141 First prometheus metrics since agent start: pid 9583: 5 metrics including: randomSummary.95percentile

Troubleshooting

See the previous section for information on expected log messages during successful scraping. If you have enabled Prometheus and are not seeing the Starting export message shown there, revisit your configuration.

It is also suggested to leave the configuration option in its default setting of log_errors: true , which will reveal any issues scraping eligible processes in the Agent log.

For example, here is an error message for a failed scrape of a TCP port that was listening but not accepting HTTP requests:

2017-10-13 22:00:12.076, 4984, Error, sdchecks[4987] Exception on running check prometheus.5000: Exception('Timeout when hitting http://localhost:5000/metrics',)
2017-10-13 22:00:12.076, 4984, Error, sdchecks, Traceback (most recent call last):
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/sdchecks.py", line 246, in run
2017-10-13 22:00:12.076, 4984, Error, sdchecks, self.check_instance.check(self.instance_conf)
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/checks.d/prometheus.py", line 44, in check
2017-10-13 22:00:12.076, 4984, Error, sdchecks, metrics = self.get_prometheus_metrics(query_url, timeout, "prometheus")
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/checks.d/prometheus.py", line 105, in get_prometheus_metrics
2017-10-13 22:00:12.077, 4984, Error, sdchecks, raise Exception("Timeout when hitting %s" % url)
2017-10-13 22:00:12.077, 4984, Error, sdchecks, Exception: Timeout when hitting http://localhost:5000/metrics

Here is an example error message for a failed scrape of a port that was responding to HTTP requests on the /metrics endpoint but not responding with valid Prometheus-format data. The invalid endpoint is responding as follows:

# curl http://localhost:5002/metrics
This ain't no Prometheus metrics!

And the corresponding error message in the Agent log, indicating no further scraping will be attempted after the initial failure:

2017-10-13 22:03:05.081, 5216, Information, sdchecks[5219] Skip retries for Prometheus error: could not convert string to float: ain't
2017-10-13 22:03:05.082, 5216, Error, sdchecks[5219] Exception on running check prometheus.5002: could not convert string to float: ain't

7.1.7 - (Legacy) Collecting Prometheus Metrics from Remote Hosts

This feature is not supported with Promscrape V2. For information on different versions of Promscrape and migrating to the latest version, see Migrating from Promscrape V1 to V2.

(Legacy) Collecting Prometheus Metrics from Remote Hosts

Sysdig Monitor can collect Prometheus metrics from remote endpoints with minimum configuration. Remote endpoints (remote hosts) refer to hosts where Sysdig Agent cannot be deployed. For example, a Kubernetes master node on managed Kubernetes services such as GKE and EKS where user workload cannot be deployed, which in turn implies no Agents involved. Enabling remote scraping on such hosts is as simple as identifying an Agent to perform the scraping and declaring the endpoint configurations with a remote services section in the Agent configuration file.

The collected Prometheus metrics are reported under and associated with the Agent that performed the scraping as opposed to associating them with a process.

Preparing the Configuration File

Multiple Agents can share the same configuration. Therefore, determine which one of those Agents scrape the remote endpoints with the dragent.yaml file. This is applicable to both

  • Create a separate configuration section for remote services in the Agent configuration file under the prometheus configuration.

  • Include a configuration section for each remote endpoint, and add either a URL or host/port (and an optional path) parameter to each section to identify the endpoint to scrape. The optional path identifies the resource at the endpoint. An empty path parameter defaults to the "/metrics" endpoint for scraping.

  • Optionally, add custom tags for each endpoint configuration for remote services. In the absence of tags, metric reporting might not work as expected when multiple endpoints are involved. Agents cannot distinguish similar metrics scraped from multiple endpoints unless those metrics are uniquely identified by tags.

To help you get started, an example configuration for Kubernetes is given below:

prometheus:
  remote_services:
        - prom_1:
            kubernetes.node.annotation.sysdig.com/region: europe
            kubernetes.node.annotation.sysdig.com/scraper: true
            conf:
                url: "https://xx.xxx.xxx.xy:5005/metrics"
                tags:
                    host: xx.xxx.xxx.xy
                    service: prom_1
                    scraping_node: "{kubernetes.node.name}"
        - prom_2:
            kubernetes.node.annotation.sysdig.com/region: india
            kubernetes.node.annotation.sysdig.com/scraper: true
            conf:
                host: xx.xxx.xxx.yx
                port: 5005
                use_https: true
                tags:
                    host: xx.xxx.xxx.yx
                    service: prom_2
                    scraping_node: "{kubernetes.node.name}"
        - prom_3:
            kubernetes.pod.annotation.sysdig.com/prom_3_scraper: true
            conf:
                url: "{kubernetes.pod.annotation.sysdig.com/prom_3_url}"
                tags:
                    service: prom_3
                    scraping_node: "{kubernetes.node.name}"
        - haproxy:
            kubernetes.node.annotation.yourhost.com/haproxy_scraper: true
            conf:
                host: "mymasternode"
                port: 1936
                path: "/metrics"
                username: "{kubernetes.node.annotation.yourhost.com/haproxy_username}"
                password: "{kubernetes.node.annotation.yourhost.com/haproxy_password}"
                tags:
                    service: router

In the above example, scraping is triggered by node and pod annotations. You can add annotations to nodes and pods by using the kubectl annotate command as follows:

kubectl annotate node mynode --overwrite sysdig.com/region=india sysdig.com/scraper=true haproxy_scraper=true yourhost.com/haproxy_username=admin yourhost.com/haproxy_password=admin

In this example, you set annotation on a node to trigger scraping of the prom2 and haproxy services as defined in the above configuration.

Preparing Container Environments

An example configuration for Docker environment is given below:

prometheus:
  remote_services:
        - prom_container:
            container.label.com.sysdig.scrape_xyz: true
            conf:
                url: "https://xyz:5005/metrics"
                tags:
                    host: xyz
                    service: xyz

In order for remote scraping to work in a Docker-based container environment, set the com.sysdig.scrape_xyz=true label to the Agent container. For example:

docker run -d --name sysdig-agent --restart always --privileged --net host --pid host -e ACCESS_KEY=<KEY> -e COLLECTOR=<COLLECTOR> -e SECURE=true -e TAGS=example_tag:example_value -v /var/run/docker.sock:/host/var/run/docker.sock -v /dev:/host/dev -v /proc:/host/proc:ro -v /boot:/host/boot:ro -v /lib/modules:/host/lib/modules:ro -v /usr:/host/usr:ro --shm-size=512m sysdig/agent

Substitute <KEY>, <COLLECTOR>, TAGS with your account key, collector, and tags respectively.

Syntax of the Rules

The syntax of the rules for the remote_services is almost identical to those of process_filter with an exception to the include/exclude rule. The remote_services section does not use include/exclude rules. The process_filter uses include and exclude rules of which only the first match against a process is applied, whereas, in the remote_services section, each rule has a corresponding service name and all the matching rules are applied.

Rule Conditions

The rule conditions work the same way as those for the process_filter. The only caveat is that the rules will be matched against the Agent process and container because the remote process/context is unknown. Therefore, matches for container labels and annotations work as before but they must be applicable to the Agent container as well. For instance, node annotations will apply because the Agent container runs on a node.

For annotations, multiple patterns can be specified in a single rule, in which case all patterns must match for the rule to be a match (AND operator). In the following example, the endpoint will not be considered unless both the annotations match:

kubernetes.node.annotation.sysdig.com/region_scraper: europe
kubernetes.node.annotation.sysdig.com/scraper: true

That is, Kubernetes nodes belonging to only the Europe region are considered for scraping.

Authenticating Sysdig Agent

Sysdig Agent requires necessary permissions on the remote host to scrape for metrics. The authentication methods for local scraping works for authenticating agents on remote hosts as well, but the authorization parameters work only in the agent context.

  • Authentication based on certificate-key pair requires it to be constructed into Kubernetes secret and mounted to the agent.

  • In token-based authentication, make sure the agent token has access rights on the remote endpoint to do the scraping.

  • Use annotation to retrieve username/password instead of passing them in plaintext. Any annotation enclosed in curly braces will be replaced by the value of the annotation. If the annotation doesn’t exist the value will be an empty string. Token substitution is supported for all the authorization parameters. Because authorization works only in the Agent context, credentials cannot be automatically retrieved from the target pod. Therefore, use an annotation in the Agent pod to pass them. To do so, set the password into an annotation for the selected Kubernetes object.

In the following example, an HAProxy account is authenticated with the password supplied in the yourhost.com/haproxy_password annotation on the agent node.

- haproxy:
            kubernetes.node.annotation.yourhost.com/haproxy_scraper: true
            conf:
                host: "mymasternode"
                port: 1936
                path: "/metrics"
                username: "{kubernetes.node.annotation.yourhost.com/haproxy_username}"
                password: "{kubernetes.node.annotation.yourhost.com/haproxy_password}"
                tags:
                    service: router

7.2 - (Legacy) Integrate Applications (Default App Checks)

We are sunsetting application checks in favor of Monitoring Integrations.

The Sysdig agent supports additional application monitoring capabilities with application check scripts or ‘app checks’. These are a set of plugins that poll for custom metrics from the specific applications which export them via status or management pages: e.g. NGINX, Redis, MongoDB, Memcached and more.

Many app checks are enabled by default in the agent and when a supported application is found, the correct app check script will be called and metrics polled automatically.

However, if default connection parameters are changed in your application, you will need to modify the app check connection parameters in the Sysdig Agent configuration file (dragent.yaml) to match your application.

In some cases, you may also need to enable the metrics reporting functionality in the application before the agent can poll them.

This page details how to make configuration changes in the agent’s configuration file, and provides an application integration example. Click the Supported Applications links for application-specific details.

Python Version for App Checks:

As of agent version 9.9.0, the default version of Python used for app checks is Python 3.

Python 2 can still be used by setting the following option in your dragent.yaml:

python_binary: <path to python 2.7 binary>

For containerized agents, this path will be: /usr/bin/python2.7

Edit dragent.yaml to Integrate or Modify Application Checks

Out of the box, the Sysdig agent will gather and report on a wide variety of pre-defined metrics. It can also accommodate any number of custom parameters for additional metrics collection.

The agent relies on a pair of configuration files to define metrics collection parameters:

dragent.default.yaml

The core configuration file. You can look at it to understand more about the default configurations provided.

Location: "/opt/draios/etc/dragent.default.yaml."

CAUTION. This file should never be edited.

dragent.yaml

The configuration file where parameters can be added, either directly in YAML as name/value pairs, or using environment variables such as 'ADDITIONAL_CONF." Location: "/opt/draios/etc/dragent.yaml."

The “dragent.yaml” file can be accessed and edited in several ways, depending on how the agent was installed.

Review Understanding the Agent Config Files for details.

The examples in this section presume you are entering YAML code directly intodragent.yaml, under the app_checks section.

Find the default settings

To find the default app-checks for already supported applications, check the dragent.default.yaml file.

(Location: /opt/draios/etc/dragent.default.yaml.)

Sample format

app_checks:
  - name: APP_NAME
    check_module: APP_CHECK_SCRIPT
    pattern:
      comm: PROCESS_NAME
    conf:
      host: IP_ADDR
      port: PORT

Parameter

Parameter 2

Description

Sample Value

app_checks

The main section of dragent.default.yaml that contains a list of pre-configured checks.

n/a

name

Every check should have a uniquename: which will be displayed on Sysdig Monitor as the process name of the integrated application.

e.g. MongoDB

check_module

The name of the Python plugin that polls the data from the designated application.

All the app check scripts can be found inside the /opt/draios/lib/python/checks.d directory.

e.g. elastic

pattern

This section is used by the Sysdig agent to match a process with a check. Four kinds of keys can be specified along with any arguments to help distinguish them.

n/a

comm

Matches process name as seen in /proc/PID/status

port

Matches based on the port used (i.e MySQL identified by 'port: 3306')

arg

Matches any process arguments

exe

Matches the process exe as seen in /proc/PID/exe link

conf

This section is specific for each plugin. You can specify any key/values that the plugins support.

host

Application-specific. A URL or IP address

port

{...} tokens can be used as values, which will be substituted with values from process info.

Change the default settings

To override the defaults:

  1. Copy relevant code blocks from dragent.default.yaml into dragent.yaml . (Or copy the code from the appropriate app check integration page in this documentation section.)

    Any entries copied into dragent.yaml file will override similar entries in dragent.default.yaml.

    Never modify dragent.default.yaml, as it will be overwritten whenever the agent is updated.

  2. Modify the parameters as needed.

    Be sure to use proper YAML. Pay attention to consistent spacing for indents (as shown) and list all check entries under an app_checks: section title.

  3. Save the changes and restart the agent.

    Use service restart agent or docker restart sysdig-agent.

Metrics for the relevant application should appear in the Sysdig Monitor interface under the appropriate name.

Example 1: Change Name and Add Password

Here is a sample app-check entry for Redis. The app_checks section was copied from the dragent.default.yaml file and modified for a specific instance.

customerid: 831f3-Your-Access-Key-9401
tags: local:sf,acct:dev,svc:db
app_checks:
  - name: redis-6380
    check_module: redisdb
    pattern:
      comm: redis-server
    conf:
      host: 127.0.0.1
      port: PORT
      password: PASSWORD

Edits made:

  • The name to be displayed in the interface

  • A required password.

As the token PORT is used, it will be translated to the actual port where Redis is listening.

Example 2: Increase Polling Interval

The default interval for an application check to be run by the agent is set to every second. You can increase the interval per application check by adding the interval: parameter (under the -name section) and the number of seconds to wait before each run of the script.

interval: must be put into each app check entry that should run less often; there is no global setting.

Example: Run the NTP check once per minute:

app_checks:
  - name: ntp
    interval: 60
    pattern:
      comm: systemd
    conf:
      host: us.pool.ntp.org

Disabling

Disable a Single Application Check

Sometimes the default configuration shipped with the Sysdig agent does not work for you or you may not be interested in checks for a single application. To turn a single check off, add an entry like this to disable it:

app_checks:
 - name: nginx
   enabled: false

This entry overrides the default configuration of the nginx check, disabling it.

If you are using the ADDITIONAL_CONF parameter to modify your container agent’s configuration, you would add an entry like this to your Docker run command (or Kubernetes manifest):

-e ADDITIONAL_CONF="app_checks:\n  - name: nginx\n    enabled: false\n"

Disable ALL Application Checks

If you do not need it or otherwise want to disable the application check functionality, you can add the following entry to the agent’s user settings configuration file /opt/draios/etc/dragent.yaml:

app_checks_enabled: false

Restart the agent as shown immediately above for either the native Linux agent installation or the container agent installation.

Optional: Configure a Custom App-Check

Sysdig allows custom application check-script configurations to be created for each individual container in the infrastructure, via the environment variable SYSDIG_AGENT_CONF. This avoids the need for multiple edits and entries to achieve the container-specific customization, by enabling application teams to configure their own checks.

The SYSDIG_AGENT_CONF variable stores a YAML-formatted configuration for the app check, and is used to match app-check configurations. It can be stored directly within the Docker file.

The syntax is the same as dragent.yaml syntax.

The example below defines a per container app-check for Redis in the Dockerfile, using the SYSDIG_AGENT_CONF environment variable:

FROM redis
# This config file adds a password for accessing redis instance
ADD redis.conf /

ENV SYSDIG_AGENT_CONF { "app_checks": [{ "name": "redis", "check_module": "redisdb", "pattern": {"comm": "redis-server"}, "conf": { "host": "127.0.0.1", "port": "6379", "password": "protected"} }] }
ENTRYPOINT ["redis-server"]
CMD [ "/redis.conf" ]

The example below shows how parameters can be added to a container started with docker run, by either using the -e/–envflag variable, or injecting the parameters using an orchestration system (for example, Kubernetes):

PER_CONTAINER_CONF='{ "app_checks": [{ "name": "redis", "check_module": "redisdb", "pattern": {"comm": "redis-server"}, "conf": { "host": "127.0.0.1", "port": "6379", "password": "protected"} }] }'

docker run --name redis -v /tmp/redis.conf:/etc/redis.conf -e SYSDIG_AGENT_CONF="${PER_CONTAINER_CONF}" -d redis /etc/redis.conf

Metrics Limit

Metric limits are defined by your payment plan. If more metrics are needed please contact your sales representative with your use case.

Note that a metric with the same name but different tag name will count as a unique metric by the agent. Example: a metric 'user.clicks' with the tag 'country=us' and another 'user.clicks' with the 'tag country=it'are considered two metrics which count towards the limit.

Supported Applications

Below is the supported list of applications the agent will automatically poll.

Some app-check scripts will need to be configured since no defaults exist, while some applications may need to be configured to output their metrics. Click a highlighted link to see application-specific notes.

  • Active MQ
  • Apache
  • Apache CouchDB
  • Apache HBase
  • Apache Kafka
  • Apache Zookeeper
  • Consul
  • CEPH
  • Couchbase
  • Elasticsearch
  • etcd
  • fluentd
  • Gearman
  • Go
  • Gunicorn
  • HAProxy
  • HDFS
  • HTTP
  • Jenkins
  • JVM
  • Lighttpd
  • Memcached
  • Mesos/Marathon
  • MongoDB
  • MySQL
  • NGINX and NGINX Plus
  • NTP
  • PGBouncer
  • PHP-FPM
  • Postfix
  • PostgreSQL
  • Prometheus
  • RabbitMQ
  • RedisDB
  • Supervisord
  • SNMP
  • TCP

You can also

7.2.1 - Apache

Apache web server is an open-source, web server creation, deployment, and management software. If Apache is installed on your environment, the Sysdig agent will connect using the mod_status module on Apache. You may need to edit the default entries in the agent configuration file to connect. See the Default Configuration, below.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

Apache Setup

Install mod_status on your Apache servers and enable ExtendedStatus.

The following configuration is required. If it is already present, then un-comment the lines, otherwise add the configuration.

LoadModule status_module modules/mod_status.so
...

<Location /server-status>
    SetHandler server-status
    Order Deny,Allow
    Deny from all
    Allow from localhost
</Location>
...

ExtendedStatus On

Sysdig Agent Configuration

Review how to edit dragent.yaml to Integrate or Modify Application Checks.

Apache has a common default for exposing metrics. The process command name can be either apache2 or httpd. By default, the Sysdig agent will look for the process apache2. If named differently in your environment (e.g. httpd), edit the configuration file to match the process name as shown in Example 1.

Default Configuration

By default, Sysdig’s dragent.default.yaml uses the following code to connect with Apache and collect all metrics.

app_checks:
  - name: apache
    check_module: apache
    pattern:
      comm: apache2
    conf:
      apache_status_url: "http://localhost:{port}/server-status?auto"
    log_errors: false

Example

If it is necessary to edit dragent.yaml to change the process name, use the following example and update the comm with the value httpd.

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml.

app_checks:
  - name: apache
    check_module: apache
    pattern:
      comm: httpd
    conf:
      apache_status_url: "http://localhost/server-status?auto"
    log_errors: false

Metrics Available

The Apache metrics are listed in the metrics dictionary here: Apache Metrics.

UI Examples

7.2.2 - Apache Kafka

Apache Kafka is a distributed streaming platform. Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies. If Kafka is installed on your environment, the Sysdig agent will automatically connect. See the Default Configuration, below.

The Sysdig agent automatically collects metrics from Kafka via JMX polling. You need to provide consumer names and topics in the agent config file to collect consumer-based Kafka metrics.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

Kafka Setup

Kafka will automatically expose all metrics. You do not need to add anything on the Kafka instance.

Zstandard, one of the compressions available in the Kafka integration, is only included in Kafka versions 2.1.0 or newer. See also Apache documentation.

Sysdig Agent Configuration

Review how to edit dragent.yaml to Integrate or Modify Application Checks.

Metrics from Kafka via JMX polling are already configured in the agent’s default-settings configuration file. Metrics for consumers, however, need to use app-checks to poll the Kafka and Zookeeper API. You need to provide consumer names and topics in dragent.yaml file.

Default Configuration

Since consumer names and topics are environment-specific, a default configuration is not present in dragent.default.yaml.

Refer to the following examples for adding Kafka checks to dragent.yaml.

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml.

Example 1: Basic Configuration

A basic example with sample consumer and topic names:

app_checks:
  - name: kafka
    check_module: kafka_consumer
    pattern:
      comm: java
      arg: kafka.Kafka
    conf:
      kafka_connect_str: "127.0.0.1:9092" # kafka address, usually localhost as we run the check on the same instance
      zk_connect_str: "localhost:2181" # zookeeper address, may be different than localhost
      zk_prefix: /
      consumer_groups:
        sample-consumer-1: # sample consumer name
          sample-topic-1: [0, ] # sample topic name and partitions
        sample-consumer-2: # sample consumer name
          sample-topic-2: [0, 1, 2, 3] # sample topic name and partitions

Example 2: Store Consumer Group Info (Kafka 9+)

From Kafka 9 onwards, you can store consumer group config info inside Kafka itself for better performance.

app_checks:
  - name: kafka
    check_module: kafka_consumer
    pattern:
      comm: java
      arg: kafka.Kafka
    conf:
      kafka_connect_str: "localhost:9092"
      zk_connect_str: "localhost:2181"
      zk_prefix: /
      kafka_consumer_offsets: true
      consumer_groups:
        sample-consumer-1: # sample consumer name
          sample-topic-1: [0, ] # sample topic name and partitions

If kafka_consumer_offsets entry is set to true the app check will look for consumer offsets in Kafka. The appcheck will also look in Kafka if zk_connect_str is not set.

Example 3: Aggregate Partitions at the Topic Level

To enable aggregation of partitions at the topic level, use kafka_consumer_topics with aggregate_partitions = true.

In this case the app check will aggregate the lag & offset values at the partition level, reducing the number of metrics collected.

Set aggregate_partitions = false to disable metrics aggregation at the partition level. In this case, the appcheck will show lag and offset values for each partition.

app_checks:
  - name: kafka
    check_module: kafka_consumer
    pattern:
      comm: java
      arg: kafka.Kafka
    conf:
      kafka_connect_str: "localhost:9092"
      zk_connect_str: "localhost:2181"
      zk_prefix: /
      kafka_consumer_offsets: true
      kafka_consumer_topics:
        aggregate_partitions: true
      consumer_groups:
        sample-consumer-1: # sample consumer name
          sample-topic-1: [0, ] # sample topic name and partitions
        sample-consumer-2: # sample consumer name
          sample-topic-2: [0, 1, 2, 3] # sample topic name and partitions

Example 4: Custom Tags

Optional tags can be applied to every emitted metric, service check, and/or event.

app_checks:
  - name: kafka
    check_module: kafka_consumer
    pattern:
      comm: java
      arg: kafka.Kafka
    conf:
      kafka_connect_str: "localhost:9092"
      zk_connect_str: "localhost:2181"
      zk_prefix: /
      consumer_groups:
        sample-consumer-1: # sample consumer name
          sample-topic-1: [0, ] # sample topic name and partitions
    tags:  ["key_first_tag:value_1", "key_second_tag:value_2", "key_third_tag:value_3"]

Example 5: SSL and Authentication

If SSL and authentication are enabled on Kafka, use the following configuration.

app_checks:
  - name: kafka
    check_module: kafka_consumer
    pattern:
      comm: java
      arg: kafka.Kafka
    conf:
      kafka_consumer_offsets: true
      kafka_connect_str: "127.0.0.1:9093"
      zk_connect_str: "localhost:2181"
      zk_prefix: /
      consumer_groups:
        test-group:
          test: [0, ]
          test-4: [0, 1, 2, 3]
      security_protocol: SASL_SSL
      sasl_mechanism: PLAIN
      sasl_plain_username: <USERNAME>
      sasl_plain_password: <PASSWORD>
      ssl_check_hostname: true
      ssl_cafile:  <SSL_CA_FILE_PATH>
      #ssl_context: <SSL_CONTEXT>
      #ssl_certfile: <CERT_FILE_PATH>
      #ssl_keyfile: <KEY_FILE_PATH>
      #ssl_password: <PASSWORD>
      #ssl_crlfile: <SSL_FILE_PATH>

Configuration Keywords and Descriptions

Keyword

Description

Default Value

security_protocol (str)

Protocol used to communicate with brokers.

PLAINTEXT

sasl_mechanism (str)

String picking SASL mechanism when security_protocol is SASL_PLAINTEXT or SASL_SSL

Currently only PLAIN is supported

sasl_plain_username (str) 

Username for SASL PLAIN authentication.

sasl_plain_password (str) 

Password for SASL PLAIN authentication.

ssl_context (ssl.SSLContext) 

Pre-configured SSLContext for wrapping socket connections. If provided, all other ssl_* configurations will be ignored.

none

ssl_check_hostname (bool)

Flag to configure whether SSL handshake should verify that the certificate matches the broker's hostname.

true

ssl_cafile (str)

Optional filename of ca file to use in certificate veriication.

none

ssl_certfile (str)

Optional filename of file in pem format containing the client certificate, as well as any CA certificates needed to establish the certificate's authenticity.

none

ssl_keyfile (str)

Optional filename containing the client private key.

none

ssl_password (str) 

Optional password to be used when loading the certificate chain.

none

ssl_crlfile (str)

Optional filename containing the CRL to check for certificate expiration. By default, no CRL check is done.

When providing a file, only the leaf certificate will be checked against this CRL. The CRL can only be checked with 2.7.9+.

none

Example 6: Regex for Consumer Groups and Topics

As of Sysdig agent version 0.94, the Kafka app check has added optional regex (regular expression) support for Kafka consumer groups and topics.

Regex Configuration:

  • No new metrics are added with this feature

  • The new parameter consumer_groups_regex is added, which includes regex for consumers and topics from Kafka. Consumer offsets stored in Zookeeper are not collected.

  • Regex for topics is optional. When not provided, all topics under the consumer will be reported.

  • The regex Python syntax is documented here: https://docs.python.org/3.7/library/re.html#regular-expression-syntax

  • If both consumer_groups and consumer_groups_regex are provided at the same time, matched consumer groups from both parameters will be merged

Sample configuration:

app_checks:
  - name: kafka
    check_module: kafka_consumer
    pattern:
      comm: java
      arg: kafka.Kafka
    conf:
      kafka_connect_str: "localhost:9092"
      zk_connect_str: "localhost:2181"
      zk_prefix: /
      kafka_consumer_offsets: true
      # Regex can be provided in following format
      # consumer_groups_regex:
      #   'REGEX_1_FOR_CONSUMER_GROUPS':
      #      - 'REGEX_1_FOR_TOPIC'
      #      - 'REGEX_2_FOR_TOPIC'
      consumer_groups_regex:
        'consumer*':
          - 'topic'
          - '^topic.*'
          - '.*topic$'
          - '^topic.*'
          - 'topic\d+'
          - '^topic_\w+'

Example

Regex

Description

Examples Matched

Examples NOT Matched

topic_\d+

All strings having keyword topic followed by _ and one or more digit characters (equal to [0-9])

my-topic_1

topic_23

topic_5-dev

topic_x

my-topic-1

topic-123

topic

All strings having topic keyword

topic_x

x_topic123

xyz

consumer*

All strings have consumer keyword

consumer-1

sample-consumer

sample-consumer-2

xyz

^topic_\w+

All strings starting with topic followed by _ and any one or more word characters (equal to [a-zA-Z0-9_])

topic_12

topic_x

topic_xyz_123

topic-12

x_topic

topic__xyz

^topic.*

All strings starting with topic

topic-x

topic123

x-topic

x_topic123

.*topic$

All strings ending with topic

x_topic

sampletopic

topic-1

x_topic123

Metrics Available

Kafka Consumer Metrics (App Checks)

See Apache Kafka Consumer Metrics.

JMX Metrics

See Apache Kafka JMX Metrics.

Result in the Monitor UI

7.2.3 - Consul

Consul is a distributed service mesh to connect, secure, and configure services across any runtime platform and public or private cloud. If Consul is installed on your environment, the Sysdig agent will automatically connect and collect basic metrics. If the Consul Access Control List (ACL) is configured, you may need to edit the default entries to connect. Also, additional latency metrics can be collected by modifying default entries. See the Default Configuration, below.

It’s easy! Sysdig automatically detects metrics from this app based on standard default configurations.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

Consul Configuration

Consul is ready to expose metrics without any special configuration.

Sysdig Agent Configuration

Review how to edit dragent.yaml to Integrate or Modify Application Checks.

Default Configuration

By default, Sysdig’s dragent.default.yaml ``uses the following code to connect with Consul and collect basic metrics.

app_checks:
  - name: consul
    pattern:
      comm: consul
    conf:
      url: "http://localhost:8500"
      catalog_checks: yes

With the dragent.default.yaml file, the below set of metrics are available in the Sysdig Monitor UI:

Metrics name
consul.catalog.nodes_critical
consul.catalog.nodes_passing
consul.catalog.nodes_up
consul.catalog.nodes_warning
consul.catalog.total_nodes
consul.catalog.services_critical
consul.catalog.services_passing
consul.catalog.services_up
consul.catalog.services_warning
consul.peers

Additional metrics and event can be collected by adding configuration in dragent.yaml file. The ACL token must be provided if enabled. See the following examples.

Remember! Never edit dragent.default.yaml ``directly; always edit only dragent.yaml.

Example 1: Enable Leader Change Event

self_leader_check An enabled node will watch for itself to become the leader and will emit an event when that happens. It can be enabled on all nodes.

app_checks:
  - name: consul
    pattern:
      comm: consul
    conf:
      url: "http://localhost:8500"
      catalog_checks: yes
      self_leader_check: yes
    logs_enabled: true

Example 2: Enable Latency Metrics

If the network_latency_checks flag is enabled, then the Consul network coordinates will be retrieved and the latency calculated for each node and between data centers.

app_checks:
  - name: consul
    pattern:
      comm: consul
    conf:
      url: "http://localhost:8500"
      catalog_checks: yes
      network_latency_checks: yes
    logs_enabled: true

With the above changes, you can see the following additional metrics:

Metrics name
consul.net.node.latency.min
consul.net.node.latency.p25
consul.net.node.latency.median
consul.net.node.latency.p75
consul.net.node.latency.p90
consul.net.node.latency.p95
consul.net.node.latency.p99
consul.net.node.latency.max

Example 3: Enable ACL Token

When the ACL Systemis enabled in Consul, the ACL Agent Token must be added in dragent.yaml in order to collect metrics.

Follow Consul’s official documentation to Configure ACL, Bootstrap ACL and Create Agent Token.

app_checks:
  - name: consul
    pattern:
      comm: consul
    conf:
      url: "http://localhost:8500"
      acl_token: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" #Add agent token
      catalog_checks: yes
      logs_enabled: true

Example 4: Collect Metrics from Non-Leader Node

Required: Agent 9.6.0+

With agent 9.6.0, you can use the configuration option single_node_install (Optional. Default: false). Set this option to true and the app check will be performed on non-leader nodes of Consul.

app_checks:
   - name: consul
    pattern:
      comm: consul
    conf:
      url: "http://localhost:8500"
      catalog_checks: yes
      single_node_install: true

StatsD Metrics

In addition to the metrics from the Sysdig app-check, there are many other metrics that Consul can send using StatsD. Those metrics will be automatically collected by the Sysdig agent’s StatsD integration if Consul is configured to send them.

Add statsd_address under telemetry to the Consul config file. The default config file location is /consul/config/local.json

{
...
  "telemetry": {
     "statsd_address": "127.0.0.1:8125"
  }
...
}

See Telemetry Metrics for more details.

Metrics Available

See Consul Metrics.

Result in the Monitor UI

7.2.4 - Couchbase

Couchbase Server is a distributed, open-source, NoSQL database engine. The core architecture is designed to simplify building modern applications with a flexible data model and simpler high availability, high scalability, high performance, and advanced security. If Couchbase is installed on your environment, the Sysdig agent will automatically connect. If authentication is configured, you may need to edit the default entries to connect. See the Default Configuration, below.

The Sysdig agent automatically collects all bucket and node metrics. You can also edit the configuration to collect query metrics.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

Couchbase Setup

Couchbase will automatically expose all metrics. You do not need to configure anything on the Couchbase instance.

Sysdig Agent Configuration

Review how to edit dragent.yaml to Integrate or Modify Application Checks.

Default Configuration

By default, Sysdig’s dragent.default.yaml uses the following code to connect with Couchbase and collect all bucket and node metrics.

app_checks:
  - name: couchbase
    pattern:
      comm: beam.smp
      arg: couchbase
      port: 8091
    conf:
      server: http://localhost:8091

If authentication is enabled, you need to edit dragent.yaml file to connect with Couchbase. See Example 1.

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml.

Example 1: Authentication

Replace <username> and <password> with appropriate values and update the dragent.yaml file.

app_checks:
  - name: couchbase
    pattern:
      comm: beam.smp
      arg: couchbase
      port: 8091
    conf:
      server: http://localhost:8091
      user: <username>
      password: <password>
      # The following block is optional and required only if the 'path' and
      # 'port' need to be set to non-default values specified here
      cbstats:
        port: 11210
        path: /opt/couchbase/bin/cbstats

Example 2: Query Stats

Additionally, you can configure query_monitoring_url to get query monitoring stats. This is available from Couchbase version 4.5. See Query Monitoring for more detail.

app_checks:
  - name: couchbase
    pattern:
      comm: beam.smp
      arg: couchbase
      port: 8091
    conf:
      server: http://localhost:8091
      query_monitoring_url: http://localhost:8093

Metrics Available

See Couchbase Metrics.

Result in the Monitor UI

7.2.5 - Elasticsearch

Elasticsearch is an open-source, distributed, document storage and search engine that stores and retrieves data structures in near real-time. Elasticsearch represents data in the form of structured JSON documents and makes full-text search accessible via RESTful API and web clients for languages like PHP, Python, and Ruby. It’s also elastic in the sense that it’s easy to scale horizontally—simply add more nodes to distribute the load. If Elasticsearch is installed on your environment, the Sysdig agent will automatically connect in most of the cases. See the Default Configuration, below.

The Sysdig Agent automatically collects default metrics. You can also edit the configuration to collect Primary Shard stats.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

Elasticsearch Setup

Elasticsearch is ready to expose metrics without any special configuration.

Sysdig Agent Configuration

Review how to edit dragent.yaml to Integrate or Modify Application Checks.

Default Configuration

By default, Sysdig’s dragent.default.yaml uses the following code to connect with Elasticsearch and collect basic metrics.

app_checks:
  - name: elasticsearch
    check_module: elastic
    pattern:
      port: 9200
      comm: java
    conf:
      url: http://localhost:9200

For more metrics, you may need to change the elasticsearch default setting in dragent.yaml:

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml.

Example 1: Agent authentication to Elasticsearch Cluster with Authentication

Password Authentication

app_checks:
  - name: elasticsearch
    check_module: elastic
    pattern:
      port: 9200
      comm: java
    conf:
      url: https://sysdigcloud-elasticsearch:9200
      username: readonly
      password: some_password
      ssl_verify: false

Certificate Authentication

app_checks:
   - name: elasticsearch
     check_module: elastic
     pattern:
       port: 9200
       comm: java
     conf:
       url: https://localhost:9200
       ssl_cert: /tmp/certs/ssl.crt
       ssl_key: /tmp/certs/ssl.key
       ssl_verify: true

ssl_cert: Path to the certificate chain used for validating the authenticity of the Elasticsearch server.

ssl_key: Path to the certificate key used for authenticating to the Elasticsearch server.

Example 2: Enable Primary shard Statistics

app_checks:
  - name: elasticsearch
    check_module: elastic
    pattern:
      port: 9200
      comm: java
    conf:
      url: http://localhost:9200
      pshard_stats : true

pshard-specific Metrics

Enable pshard_stats to monitor the following additional metrics:

Metric Name
elasticsearch.primaries.flush.total
elasticsearch.primaries.flush.total.time
elasticsearch.primaries.docs.count
elasticsearch.primaries.docs.deleted
elasticsearch.primaries.get.current
elasticsearch.primaries.get.exists.time
elasticsearch.primaries.get.exists.total
elasticsearch.primaries.get.missing.time
elasticsearch.primaries.get.missing.total
elasticsearch.primaries.get.time
elasticsearch.primaries.get.total
elasticsearch.primaries.indexing.delete.current
elasticsearch.primaries.indexing.delete.time
elasticsearch.primaries.indexing.delete.total
elasticsearch.primaries.indexing.index.current
elasticsearch.primaries.indexing.index.time
elasticsearch.primaries.indexing.index.total
elasticsearch.primaries.merges.current
elasticsearch.primaries.merges.current.docs
elasticsearch.primaries.merges.current.size
elasticsearch.primaries.merges.total
elasticsearch.primaries.merges.total.docs
elasticsearch.primaries.merges.total.size
elasticsearch.primaries.merges.total.time
elasticsearch.primaries.refresh.total
elasticsearch.primaries.refresh.total.time
elasticsearch.primaries.search.fetch.current
elasticsearch.primaries.search.fetch.time
elasticsearch.primaries.search.fetch.total
elasticsearch.primaries.search.query.current
elasticsearch.primaries.search.query.time
elasticsearch.primaries.search.query.total
elasticsearch.primaries.store.size

Example 3: Enable Primary shard Statistics for Master Node only

app_checks:
  - name: elasticsearch
    check_module: elastic
    pattern:
      port: 9200
      comm: java
    conf:
      url: http://localhost:9200
      pshard_stats_master_node_only: true

Note that this option takes precedence over the pshard_stats option (above). This means that if the following configuration were put into place, only the pshard_stats_master_node_only option would be respected:

app_checks:
  - name: elasticsearch
    check_module: elastic
    pattern:
      port: 9200
      comm: java
    conf:
      url: http://localhost:9200
      pshard_stats: true
      pshard_stats_master_node_only: true

All Available Metrics

With the default settings and the pshard setting, the total available metrics are listed here: Elasticsearch Metrics.

Result in the Monitor UI

7.2.6 - etcd

etcdis a distributed key-value store that provides a reliable way to store data across a cluster of machines. If etcd is installed on your environment, the Sysdig agent will automatically connect. If you are using ectd older than version 2, you may need to edit the default entries to connect. See the Default Configuration section, below.

The Sysdig Agent automatically collects all metrics.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

etcd Versions

etcd v2

The app check functionality described on this page supports etcd metrics from APIs that are specific to v2 of etcd.

These APIs are present in etcd v3 as well, but export metrics only for the v2 datastores. For example, after upgrading from etcd v2 to v3, if the v2 datastores are not migrated to v3, the v2 APIs will continue exporting metrics for these datastores. If the v2 datastores are migrated to v3, the v2 APIs will no longer export metrics for these datastores.

etcd v3

etcd v3 uses a native Prometheus exporter. The exporter only exports metrics for v3 datastores. For example, after upgrading from etcd v2 to v3, if v2 datastores are not migrated to v3, the Prometheus endpoint will not export metrics for these datastores. The Prometheus endpoint will only export metrics for datastores migrated to v3 or datastores created after the upgrade to v3.

If your etcd version is v3 or higher, use the information on this page to enable an integration: Integrate Prometheus Metrics.

etcd Setup

etcd will automatically expose all metrics. You do not need to add anything to the etcd instance.

Sysdig Agent Configuration

Review how to Edit dragent.yaml to Integrate or Modify Application Checks.

The default agent configuration for etcd will look for the application on localhost, port 2379. No customization is required.

Default Configuration

By default, Sysdig’s dragent.default.yaml uses the following code to connect with etcd and collect all metrics.

app_checks:
  - name: etcd
    pattern:
      comm: etcd
    conf:
      url: "http://localhost:2379"

etcd (before version 2) does not listen on localhost, so the Sysdig agent will not connect to it automatically. In such case, you may need edit the dragent.yaml file with the hostname and port. See Example 1.

Alternatively, you can add the option -bind-addr 0.0.0.0:4001 to the etcd command line to allow the agent to connect.

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml.

Example 1

You can use {hostname} and {port} as a tokens in the conf: section. This is the recommended setting for Kubernetes customers.

app_checks:
  - name: etcd
    pattern:
      comm: etcd
    conf:
      url: "http://{hostname}:{port}"

Alternatively you can specify the real hostname and port.

app_checks:
  - name: etcd
    pattern:
      comm: etcd
    conf:
      url: "http://my_hostname:4000"  #etcd service listening on port 4000

Example 2: SSL/TLS Certificate

If encryption is used, add the appropriate SSL/TLS entries. Provide correct path of SSL/TLS key and certificates used in etcd configuration in fields ssl_keyfile, ssl_certfile, ssl_ca_certs.

app_checks:
  - name: etcd
    pattern:
      comm: etcd
    conf:
      url: "https://localhost:PORT"
      ssl_keyfile:  /etc/etcd/peer.key  # Path to key file
      ssl_certfile: /etc/etcd/peer.crt  # Path to SSL certificate
      ssl_ca_certs: /etc/etcd/ca.crt    # Path to CA certificate
      ssl_cert_validation: True

Metrics Available

See etcd Metrics.

Result in the Monitor UI

7.2.7 - fluentd

Fluentd is an open source data collector, which allows unifying data collection and consumption to better use and understand data. Fluentd structures data as JSON as much as possible, to unify all facets of processing log data: collecting, filtering, buffering, and outputting logs across multiple sources and destinations. If Fluentd is installed on your environment, the Sysdig agent will automatically connect. See See the Default Configuration section, below. The Sysdig agent automatically collects default metrics.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

Fluentd Setup

Fluentd can be installed as a package (.deb, .rpm, etc) depending on the OS flavor, or it can be deployed in a Docker container. Fluentd installation is documented here. For the examples on this page, a .deb package installation is used.

After installing Fluentd, add following lines in fluentd.conf :

<source>
  @type monitor_agent
  bind 0.0.0.0
  port 24220
</source>

Sysdig Agent Configuration

Review how to Edit dragent.yaml to Integrate or Modify Application Checks.

Default Configuration

By default, Sysdig’sdragent.default.yaml uses the following code to connect with Fluentd and collect default metrics.

(If you use a non-standard port for monitor_agent , you can configure it as usual in the agent config file dragent.yaml.)

  - name: fluentd
    pattern:
      comm: fluentd
    conf:
      monitor_agent_url: http://localhost:24220/api/plugins.json

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml.

Example

To generate the metric data, it is necessary to generate some logs through an application. In the following example, HTTP is used. (For more information, see Life of a Fluentd event.)

Execute the following command on in the Fluentd environment:

$ curl -i -X POST -d 'json={"action":"login","user":2}' http://localhost:8888/test.cycle

Expected output: (Note: Here the status code is 200 OK, as HTTP traffic is successfully generated; it will vary per application.)

HTTP/1.1 200 OK
Content-type: text/plain
Connection: Keep-Alive
Content-length: 0

Metrics Available

See fluentd Metrics.

Result in the Monitor UI

7.2.8 - Go

Golang expvaris the standard interface designed to instrument and expose custom metrics from a Go program via HTTP. In addition to custom metrics, it also exports some metrics out-of-the-box, such as command line arguments, allocation stats, heap stats, and garbage collection metrics.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

Go_expvar Setup

You will need to create a custom entry in the user settings config file for your Go application, due to the difficulty in determining if an application is written in Go by looking at process names or arguments. Be sure your app has expvars enabled, which means importing the expvar module and having an HTTP server started from inside your app, as follows:

import (
    ...
    "net/http"
    "expvar"
    ...
)

// If your application has no http server running for the DefaultServeMux,
// you'll have to have a http server running for expvar to use, for example
// by adding the following to your init function
func init() {
    go http.ServeAndListen(":8080", nil)
}

// You can also expose variables that are specific to your application
// See http://golang.org/pkg/expvar/ for more information

var (
    exp_points_processed = expvar.NewInt("points_processed")
)

func processPoints(p RawPoints) {
    points_processed, err := parsePoints(p)
    exp_points_processed.Add(points_processed)
    ...
}

See also the following blog entry: How to instrument Go code with custom expvar metrics.

Sysdig Agent Configuration

Review how to Edit dragent.yaml to Integrate or Modify Application Checks.

Default Configuration

No default configuration for Go is provided in the Sysdig agent dragent.default.yaml file. You must edit the agent config file as described in Example 1.

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml.

Example

Add the following code sample to dragent.yaml to collect Go metrics.

app_checks:
  - name: go-expvar
    check_module: go_expvar
    pattern:
          comm: go-expvar
    conf:
      expvar_url: "http://localhost:8080/debug/vars" # automatically match url using the listening port
      # Add custom metrics if you want
      metrics:
        - path: system.numberOfSeconds
          type: gauge # gauge or rate
          alias: go_expvar.system.numberOfSeconds
        - path: system.lastLoad
          type: gauge
          alias: go_expvar.system.lastLoad
        - path: system.numberOfLoginsPerUser/.* # You can use / to get inside the map and use .* to match any record inside
          type: gauge
        - path: system.allLoad/.*
          type: gauge

Metrics Available

See Go Metrics.

Result in the Monitor UI

7.2.9 - HAProxy

HAProxy provides a high-availability load balancer and proxy server for TCP- and HTTP-based applications which spreads requests across multiple servers.

The Sysdig agent automatically collects haproxy metrics. You can also edit the agent configuration file to collect additional metrics.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

HAProxy Setup

The stats feature must be enabled on your HAProxy instance. This can be done by adding the following entry to the HAProxy configuration file /etc/haproxy/haproxy.cfg

listen stats
  bind :1936
  mode http
  stats enable
  stats hide-version
  stats realm Haproxy\ Statistics
  stats uri /haproxy_stats
  stats auth stats:stats

Sysdig Agent Configuration

Review how to Edit dragent.yaml to Integrate or Modify Application Checks.

Default Configuration

By default, Sysdig’s dragent.default.yaml uses the following code to connect with HAProxy and collect haproxy metrics:

app_checks:
  - name: haproxy
    pattern:
      comm: haproxy
      port: 1936
    conf:
      username: stats
      password: stats
      url: http://localhost:1936/
      collect_aggregates_only: True
    log_errors: false

You can get a few additional status metrics by editing the configuration in dragent.yaml,as in the following examples.

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml

Example: Collect Status Metrics Per Service

Enable the collect_status_metrics flag to collect the metrics haproxy.count_per_status, and haproxy.backend_hosts.

app_checks:
  - name: haproxy
    pattern:
      comm: haproxy
      port: 1936
    conf:
      username: stats
      password: stats
      url: http://localhost:1936/haproxy_stats
      collect_aggregates_only: True
      collect_status_metrics: True
    log_errors: false

Example: Collect Status Metrics Per Host

Enable:

  • collect_status_metrics_by_host: Instructs the check to collect status metrics per host, instead of per service. This only applies if `collect_status_metrics` is true.

  • tag_service_check_by_host: When this flag is set, the hostname is also passed with the service check ‘haproxy.backend_up’.

    By default, only the backend name and service name are associated with it.

app_checks:
  - name: haproxy
    pattern:
      comm: haproxy
      port: 1936
    conf:
      username: stats
      password: stats
      url: http://localhost:1936/haproxy_stats
      collect_aggregates_only: True
      collect_status_metrics: True
      collect_status_metrics_by_host: True
      tag_service_check_by_host: True
    log_errors: false

Example: Collect HAProxy Stats by UNIX Socket

If you’ve configured HAProxy to report statistics to a UNIX socket, you can set the url in dragent.yaml to the socket’s path (e.g., unix:///var/run/haproxy.sock).

Set up HAProxy Config File

Edit your HAProxy configuration file ( /etc/haproxy/haproxy.cfg ) to add the following lines to the global section:

global
    [snip]
       stats socket /run/haproxy/admin.sock mode 660 level admin
       stats timeout 30s
    [snip]

Edit dragent.yaml url

Add the socket URL from the HAProxy config to the dragent.yaml file:

app_checks:
      - name: haproxy
        pattern:
          comm: haproxy
        conf:
          url: unix:///run/haproxy/admin.sock
        log_errors: True

Metrics Available

See HAProxy Metrics.

Example: Enable Service Check

Required: Agent 9.6.0+

enable_service_check: Enable/Disable service check haproxy.backend.up.

When set to false , all service checks will be disabled.

app_checks:
  - name: haproxy
    pattern:
      comm: haproxy
      port: 1936
    conf:
      username: stats
      password: stats
      url: http://localhost:1936/haproxy_stats
      collect_aggregates_only: true
      enable_service_check: false

Example: Filter Metrics Per Service

Required: Agent 9.6.0+

services_exclude (Optional): Name or regex of services to be excluded.

services_include (Optional): Name or regex of services to be included

If a service is excluded with services_exclude, it can still be be included explicitly by services_include. The following example excludes all services except service_1 and service_2.

app_checks:
  - name: haproxy
    pattern:
      comm: haproxy
      port: 1936
    conf:
      username: stats
      password: stats
      url: http://localhost:1936/haproxy_stats
      collect_aggregates_only: true
      services_exclude:
        - ".*"
      services_include:
        - "service_1"
        - "service_2"

Additional Options: active_tag, headers

Required: Agent 9.6.0+

There are two additional configuration options introduced with agent 9.6.0:

  • active_tag (Optional. Default: false):

    Adds tag active to backend metrics that belong to the active pool of connections.

  • headers (Optional):

    Extra headers such as auth-token can be passed along with requests.

app_checks:
  - name: haproxy
    pattern:
      comm: haproxy
      port: 1936
    conf:
      username: stats
      password: stats
      url: http://localhost:1936/haproxy_stats
      collect_aggregates_only: true
      active_tag: true
      headers:
        <HEADER_NAME>: <HEADER_VALUE>
        <HEADER_NAME>: <HEADER_VALUE>

Result in the Monitor UI

7.2.10 - HTTP

The HTTP check monitors HTTP-based applications for URL availability.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

HTTP Setup

You do not need to configure anything on HTTP-based applications for the Sysdig agent to connect.

Sysdig Agent Configuration

Review how to Edit dragent.yaml to Integrate or Modify Application Checks.

Default Configuration

No default entry is present in the dragent.default.yaml for the HTTP check. You need to add an entry in dragent.yaml as shown in following examples.

Never edit dragent.default.yaml directly; always edit only dragent.yaml.

Example 1

First you must identify the process pattern (comm:). It must match an actively running process for the HTTP check to work. Sysdig recommends the process be the one that is serving the URL being checked.

If the URL is is remote from the agent, the user should use a process that is always running, such as “systemd”.

Confirm the “comm” value using the following command:

cat /proc/1/comm

Add the following entry to the dragent.yaml file and modify the 'name:''comm:' and 'url:' parameters as needed:

app_checks:
  - name: EXAMPLE_WEBSITE
    check_module: http_check
    pattern:
      comm:  systemd
    conf:
      url: https://www.MYEXAMPLE.com

Example 2

There are multiple configuration options available with the HTTP check. A full list is provided in the table following Example 2. These keys should be listed under the conf: section of the configuration in Example 1.

app_checks:
  - name: EXAMPLE_WEBSITE
    check_module: http_check
    pattern:
      comm:  systemd
    conf:
      url: https://www.MYEXAMPLE.com
      # timeout: 1
      #  method: get
      #  data:
      #    <KEY>: <VALUE>
      #  content_match: '<REGEX>''
      #  reverse_content_match: false
      #  username: <USERNAME>
      #  ntlm_domain: <DOMAIN>
      #  password: <PASSWORD>
      #  client_cert: /opt/client.crt
      #  client_key: /opt/client.key
      #  http_response_status_code: (1|2|3)\d\d
      #  include_content: false
      #  collect_response_time: true
      #  disable_ssl_validation: true
      #  ignore_ssl_warning: false
      #  ca_certs: /etc/ssl/certs/ca-certificates.crt
      #  check_certificate_expiration: true
      #  days_warning: <THRESHOLD_DAYS>
      #  check_hostname: true
      #  ssl_server_name: <HOSTNAME>
      #  headers:
      #    Host: alternative.host.example.com
      #    X-Auth-Token: <AUTH_TOKEN>
      #  skip_proxy: false
      #  allow_redirects: true
      #  include_default_headers: true
      #  tags:
      #    - <KEY_1>:<VALUE_1>
      #    - <KEY_2>:<VALUE_2>

Key

Description

url

The URL to test.

timeout

The time in seconds to allow for a response.

method

The HTTP method. This setting defaults to GET, though many other HTTP methods are supported, including POST and PUT.

data

The data option is only available when using the POST method. Data should be included as key-value pairs and will be sent in the body of the request.

content_match

A string or Python regular expression. The HTTP check will search for this value in the response and will report as DOWN if the string or expression is not found.

reverse_content_match

When true, reverses the behavior of the content_matchoption, i.e. the HTTP check will report as DOWN if the string or expression in content_match IS found. (default is false)

username & password

If your service uses basic authentication, you can provide the username and password here.

http_response_status_code

A string or Python regular expression for an HTTP status code. This check will report DOWN for any status code that does not match. This defaults to 1xx, 2xx and 3xx HTTP status codes. For example: 401 or 4\d\d.

include_content

When set to true, the check will include the first 200 characters of the HTTP response body in notifications. The default value is false.

collect_response_time

By default, the check will collect the response time (in seconds) as the metric network.http.response_time. To disable, set this value to false.

disable_ssl_validation

This setting will skip SSL certificate validation and is enabled by default. If you require SSL certificate validation, set this to false. This option is only used when gathering the response time/aliveness from the specified endpoint. Note this setting doesn't apply to the check_certificate_expirationoption.

ignore_ssl_warning

When SSL certificate validation is enabled (see setting above), this setting allows you to disable security warnings.

ca_certs

This setting allows you to override the default certificate path as specified in init_config

check_certificate_expiration

When check_certificate_expiration is enabled, the service check will check the expiration date of the SSL certificate.

Note that this will cause the SSL certificate to be validated, regardless of the value of the disable_ssl_validation setting.

days_warning

When check_certificate_expiration is enabled, these settings will raise a warning alert when the SSL certificate is within the specified number of days from expiration.

check_hostname

When check_certificate_expiration is enabled, this setting will raise a warning if the hostname on the SSL certificate does not match the host of the given URL.

headers

This parameter allows you to send additional headers with the request. e.g. X-Auth-Token: <AUTH_TOKEN>

skip_proxy

If set, the check will bypass proxy settings and attempt to reach the check URL directly. This defaults to false.

allow_redirects

This setting allows the service check to follow HTTP redirects and defaults to true.

tags

A list of arbitrary tags that will be associated with the check.

Metrics Available

HTTP metrics concern response time and SSL certificate expiry information.

See HTTP Metrics.

Service Checks

http.can_connect:

Returns DOWN when any of the following occur:

  • the request to URL times out

  • the response code is 4xx/5xx, or it doesn’t match the pattern provided in the http_response_status_code

  • the response body does not contain the pattern in content_match

  • reverse_content_match is true and the response body does contain the pattern in content_match

  • URI contains https and disable_ssl_validation is false, and the SSL connection cannot be validated

  • Otherwise, returns UP.

Segmentation of the http.can_connect can be done by URL.

http.ssl_cert:

The check returns:

  • DOWN if the URL’s certificate has already expired

  • WARNING if the URL’s certificate expires in less than days_warning days

  • Otherwise, returns UP.

To disable this check, set check_certificate_expiration to false.

Result in the Monitor UI

7.2.11 - Jenkins

Jenkins is an open-source automation server which helps to automate part of the software development process, permitting continuous integration and facilitating the technical aspects of continuous delivery. It supports version control tools (such as Subversion, Git, Mercurial, etc), can execute Apache Ant, Apache Maven and SBT-based projects, and allows shell scripts and Windows batch commands. If Jenkins is installed on your environment, the Sysdig agent will automatically connect and collect all Jenkins metrics. See the Default Configuration section, below.

This page describes the default configuration settings, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

Jenkins Setup

Requires the standard Jenkins server setup with one or more Jenkins Jobs running on it.

Sysdig Agent Configuration

Review how to Edit dragent.yaml to Integrate or Modify Application Checks.

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml.

Default Configuration

By default, Sysdig’s dragent.default.yaml uses the following code to connect with Jenkins and collect basic metrics.

  - name: jenkins
    pattern:
      comm: java
      port: 50000
    conf:
      name: default
      jenkins_home: /var/lib/jenkins #this depends on your environment

Jenkins Folders Plugin

By default, the Sysdig agent does not monitor jobs under job folders created using Folders plugin.

Set jobs_folder_depth to monitor these jobs. Job folders are scanned recursively for jobs until the designated folder depth is reached. The default value = 1.

app_checks:
  - name: jenkins
    pattern:
      comm: java
      port: 50000
    conf:
      name: default
      jenkins_home: /var/lib/jenkins
      jobs_folder_depth: 3

Metrics Available

The following metrics will be available only after running one or more Jenkins jobs. They handle queue size, job duration, and job waiting time.

See Jenkins Metrics.

Result in the Monitor UI

7.2.12 - Lighttpd

Lighttpd is a secure, fast, compliant, and very flexible web server that has been optimized for high-performance environments. It has a very low memory footprint compared to other web servers and takes care of CPU load. Its advanced feature set (FastCGI, CGI, Auth, Output Compression, URL Rewriting, and many more) make Lighttpd the perfect web server software for every server that suffers load problems. If Lighttpd is installed on your environment, the Sysdig agent will automatically connect. See the Default Configuration section, below. The Sysdig agent automatically collects the default metrics.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

At this time, the Sysdig app check for Lighttpd supports Lighttpd version 1.x.x only.

Lighttpd Setup

For Lighttpd, the status page must be enabled. Add mod_status in the /etc/lighttpd/lighttpd.conf config file:

server.modules = ( ..., "mod_status", ... )

Then configure an endpoint for it. If (for security purposes) you want to open the status page only to users from the local network, it can be done by adding the following lines in the /etc/lighttpd/lighttpd.conf file :

$HTTP["remoteip"] == "127.0.0.1/8" {
    status.status-url = "/server-status"
  }

If you want an endpoint to be open for remote users based on authentication, then the mod_auth module should be enabled in the /etc/lighttpd/lighttpd.conf config file:

server.modules = ( ..., "mod_auth", ... )

Then you can add the auth.require parameter in the /etc/lighttpd/lighttpd.conf config file:

auth.require = ( "/server-status" => ( "method"  => ... , "realm"   => ... , "require" => ... ) )

For more information on the auth.require parameter, see the Lighttpd documentation..

Sysdig Agent Configuration

Review how to Edit dragent.yaml to Integrate or Modify Application Checks.

Default Configuration

By default, Sysdig’s dragent.default.yaml uses the following code to connect with Lighttpd and collect basic metrics.

app_checks:
  - name: lighttpd
    pattern:
      comm: lighttpd
    conf:
      lighttpd_status_url: "http://localhost:{port}/server-status?auto"
    log_errors: false

Metrics Available

These metrics are supported for Lighttpd version 1.x.x only. Lighttpd version 2.x.x is being built and is NOT ready for use as of this publication.

See Lighttpd Metrics.

Result in the Monitor UI

7.2.13 - Memcached

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from the results of database calls, API calls, or page rendering. If Memcached is installed on your environment, the Sysdig agent will automatically connect. See the Default Configuration section, below. The Sysdig agent automatically collects basic metrics. You can also edit the configuration to collect additional metrics related to items and slabs.

This page describes the default configuration settings, how to edit the configuration to collect additional information, the metrics available for integration, and a sample result in the Sysdig Monitor UI.

Memcached Setup

Memcached will automatically expose all metrics. You do not need to add anything on Memcached instance.

Sysdig Agent Configuration

Review how to Edit dragent.yaml to Integrate or Modify Application Checks.

Default Configuration

By default, Sysdig’s dragent.default.yaml uses the following code to connect with Memcached and collect basic metrics:

app_checks:
  - name: memcached
    check_module: mcache
    pattern:
      comm: memcached
    conf:
      url: localhost
      port: "{port}"

Additional metrics can be collected by editing Sysdig’s configuration file dragent.yaml. If SASL is enabled, authentication parameters must be added to dragent.yaml.

Remember! Never edit dragent.default.yaml directly; always edit only dragent.yaml.

Example 1: Additional Metrics

memcache.items.* and memcache.slabs.* can be collected by setting flags in the options section, as follows . Either value can be set to false if you do not want to collect metrics from them.

app_checks:
  - name: memcached
    check_module: mcache
    pattern:
      comm: memcached
    conf:
      url: localhost
      port: "{port}"
    options:
      items: true       # Default is false
      slabs: true       # Default is false

Example 2: SASL

SASL authentication can be enabled with Memcached (see instructions here). If enabled, credentials must be provided against username and password fields as shown in Example 2.

app_checks:
  - name: memcached
    check_module: mcache
    pattern:
      comm: memcached
    conf:
      url: localhost
      port: "{port}"
      username: <username>
      # Some memcached version will support <username>@<hostname>.
      # If memcached is installed as a container, hostname of memcached container will be used as username
      password: <password>

Metrics Available

See Memcached Metrics.

Result in the Monitor UI

7.2.14 - Mesos/Marathon

Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with APIs for resource management and scheduling across entire datacenter and cloud environments. The Mesos metrics are divided into master and