This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Application Integrations

This section displays the ever-growing application integrations library in Monitor.

1 - Apache

Metrics, Dashboards, Alerts and more for Apache Integration in Sysdig Monitor.
Apache

This integration is enabled by default.

Versions supported: 2.4

This integration uses a sidecar exporter that is available in UBI or scratch base image.

This integration has 11 metrics.

Timeseries generated: 100 timeseries

List of Alerts

AlertDescriptionFormat
[Apache] No Instance UpNo instances upPrometheus
[Apache] Up Time Less Than One HourInstance with UpTime less than one hourPrometheus
[Apache] Time Since Last OK Request More Than One HourTime since last OK request higher than one hourPrometheus
[Apache] High Error RateHigh error ratePrometheus
[Apache] High Rate Of Busy Workers In InstanceLow workers in open_slot statePrometheus

List of Dashboards

Apache App Overview

The dashboard provides information on the status of the Apache resources. Apache App Overview

List of Metrics

Metric name
apache_accesses_total
apache_connections
apache_cpuload
apache_duration_ms_total
apache_http_last_request_seconds
apache_http_response_codes_total
apache_scoreboard
apache_sent_kilobytes_total
apache_up
apache_uptime_seconds_total
apache_workers

Preparing the Integration

Create Grok Configuration

You need to add the Grok configuration in order to parse Apache logs and get metrics from them.

Install It Directly In Your Cluster

helm install -n Your-Application-Namespace apache-exporter --repo https://sysdiglabs.github.io/integrations-charts --set configmap=true

Download and Apply

You can download the file and execute the next command

kubectl -n Your-Application-Namespace apply -f grok-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grok-config
data:
  config.yml: |
    global:
      config_version: 3
    input:
      type: file
      path: /tmp/logs/accesss.log
      fail_on_missing_logfile: false
      readall: true
    imports:
    - type: grok_patterns
      dir: ./patterns
    metrics:
    - type: counter
      name: apache_http_response_codes_total
      help: HTTP requests to Apache
      match: '%{COMMONAPACHELOG}'
      labels:
        code: '{{.response}}'
        method: '{{.verb}}'
    - type: gauge
      name: apache_http_response_bytes_total
      help: Size of HTTP responses
      match: '%{COMMONAPACHELOG}'
      value: '{{.bytes}}'
      cumulative: true
      labels:
        code: '{{.response}}'
        method: '{{.verb}}'
    - type: gauge
      name: apache_http_last_request_seconds
      help: Timestamp of the last HTTP request
      match: '%{COMMONAPACHELOG}'
      value: '{{timestamp "02/Jan/2006:15:04:05 -0700" .timestamp}}'
      labels:
        code: '{{.response}}'
        method: '{{.verb}}'
    server:
      protocol: http    

Check Apache Configuration

Apache provides metrics in its own format via its ServerStatus module. To enable this module, include (or uncomment) the following line in your apache configuration file:

LoadModule status_module modules/mod_status.so
<Location "/server-status">
  SetHandler server-status
</Location>

To configure Apache server to produce common logs, include (or uncomment) the following in your Apache configuration file:

<IfModule log_config_module>
       LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
       CustomLog /usr/local/apache2/logs/accesss.log common
</IfModule>

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/apache-exporter

Monitoring and Troubleshooting Apache

This document describes important metrics and queries that you can use to monitor and troubleshoot Apache.

Tracking metrics status

You can track Apache metrics status with following alerts: Exporter proccess is not serving metrics

# [Apache] Exporter Process Down
absent(apache_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

These are the default agent jobs for this integration:

- job_name: apache-exporter-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "apache"
  - action: replace
    source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:9117
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
- job_name: apache-grok-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "apache"
  - action: replace
    source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:9144
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

2 - Calico

Metrics, Dashboards, Alerts and more for Calico Integration in Sysdig Monitor.
Calico

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

Versions supported: 3.23.3

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 22 metrics.

Timeseries generated: 838 Timeseries

List of Alerts

AlertDescriptionFormat
[Calico-Node] Dataplane Updates Are Failing and RetryingThe update actions for dataplane are failing and retrying several timesPrometheus
[Calico-Node] IP Set Command FailuresEncountered a number of ipset command failuresPrometheus
[Calico-Node] IP Tables Restore FailuresEncountered a number of iptable restore failuresPrometheus
[Calico-Node] IP Tables Save FailuresEncountered a number of iptable restore failuresPrometheus
[Calico-Node] Errors While LoggingEncountered a number of errors while loggingPrometheus
[Calico-Node] Latency Increase in Datastore OnUpdate CallThe duration of datastore OnUpdate calls are increasingPrometheus
[Calico-Node] Latency Increase in Dataplane UpdateIncreased response time for dataplane updatesPrometheus
[Calico-Node] Latency Increase in Acquire Iptables LockIncreased response time for dataplane updatesPrometheus
[Calico-Node] Latency Increase While Listing All the Interfaces during a ResyncIncreased response time for interface listing during a resyncPrometheus
[Calico-Node] Latency Increase in Interface ResyncIncreased response time for interface resyncPrometheus
[Calico-Node] Fork/Exec Child Processes Results in High LatencyIncreased response time for Fork/Exec child processesPrometheus

List of Dashboards

Calico

The dashboard provides information on the Calico integration. Calico

List of Metrics

Metric name
felix_calc_graph_update_time_seconds
felix_cluster_num_hosts
felix_cluster_num_policies
felix_cluster_num_profiles
felix_exec_time_micros
felix_int_dataplane_addr_msg_batch_size
felix_int_dataplane_apply_time_seconds
felix_int_dataplane_failures
felix_int_dataplane_iface_msg_batch_size
felix_int_dataplane_msg_batch_size
felix_ipset_calls
felix_ipset_errors
felix_ipset_lines_executed
felix_iptables_lines_executed
felix_iptables_lock_acquire_secs
felix_iptables_restore_calls
felix_iptables_restore_errors
felix_iptables_save_calls
felix_iptables_save_errors
felix_log_errors
felix_route_table_list_seconds
felix_route_table_per_iface_sync_seconds

Preparing the Integration

Enable Calico Prometheus Metrics

Calico can expose Prometheus metrics natively, however, this is an option that is not always enabled.

You can use the following command to turn Prometheus metrics on:

kubectl patch felixconfiguration default --type merge --patch '{"spec":{"prometheusMetricsEnabled": true}}'

You should see and output like below:

felixconfiguration.projectcalico.org/default patched

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting Calico

Here are some interesting metrics and queries to monitor and troubleshoot Calico.

About the Calico User

Hosts

A host endpoint resource (HostEndpoint) represents one or more real or virtual interfaces attached to a host that is running Calico. It enforces Calico policy on the traffic that is entering or leaving the host’s default network namespace through those interfaces.

  • A host endpoint with interfaceName: * represents all of a host’s real or virtual interfaces.

  • A host endpoint for one specific real interface is configured by interfaceName: , for example interfaceName: eth0, or by leaving interfaceName empty and including one of the interface’s IPs in expectedIPs.

Each host endpoint may include a set of labels and list of profiles that Calico will use to apply policy to the interface.

Profiles

Profiles provide a way to group multiple endpoints so that they inherit a shared set of labels. For historic reasons, Profiles can also include policy rules, but that feature is deprecated in favor of the much more flexible NetworkPolicy and GlobalNetworkPolicy resources.

Each Calico endpoint or host endpoint can be assigned to zero or more profiles.

Policies

If you are new to Kubernetes, start with “Kubernetes policy” and learn the basics of enforcing policy for pod traffic. The good news is, Kubernetes and Calico policies are very similar and work alongside each other – so managing both types is easy.

Kubernetes network policy lets developers secure access to and from their applications using the same simple language they use to deploy them. Developers can focus on their applications without understanding low-level networking concepts. Enabling developers to easily secure their applications using network policies supports a shift left DevOps environment.

Errors

Dataplane Updates Failures and Retries

Dataplane is base of work for Calico. It has three different types of Dataplanes (Linux eBPF, Standard Linux and Windows HNS). Dataplane is responsible for main important parts in Calico: base networking, network policy and IP address management capabilities. So be aware of possible errors in dataplane is keystone for Calico monitoring.

rate(felix_int_dataplane_failures[5m])

Ipset Command Failures

IP sets are stored collections of IP addresses, network ranges, MAC addresses, port numbers, and network interface names. The iptables tool can leverage IP sets for more efficient rule matching.

For example, let’s say you want to drop traffic that originates from one of several IP address ranges that you know to be malicious. Instead of configuring rules for each range in iptables directly, you can create an IP set and then reference that set in an iptables rule. This makes your rule sets dynamic and therefore easier to configure; whenever you need to add or swap out network identifiers that are handled by the firewall, you simply change the IP set.

For that reason we need to monitor failures fot his kind of command in calico.

rate(felix_ipset_errors[5m])

Iptables Save Failures and Iptables Restore Failures

The actual iptables rules are created and customized on the command line with the command iptables for IPv4 and ip6tables for IPv6.

These can be saved in a file with the command iptables-save for IPv4.

Debian/Ubuntu: iptables-save > /etc/iptables/rules.v4
RHEL/CentOS: iptables-save > /etc/sysconfig/iptables

These files can be loaded again with the command iptables-restore for IPv4.

Debian/Ubuntu: iptables-restore < /etc/iptables/rules.v4
RHEL/CentOS: iptables-restore < /etc/sysconfig/iptables

This is basically the main purpose of calico, so monitor failures of the features is very important.

rate(felix_iptables_save_errors[5m])
rate(felix_iptables_restore_errors[5m])

Latency

Most usefull way to inform about latency is show some alert with quantiles.

Calico metrics does not provides buckets, it summarizes all that info with specific labels. For Latency metrics Calico provides quantile labels 0.5, 0.9 and 0.99.

Latency in Datastore OnUpdate Call

# Latency on dataplane update
felix_calc_graph_update_time_seconds{quantile="0.99"}

# Latency on acquire iptables lock
felix_int_dataplane_apply_time_seconds{quantile="0.99"}

# Latency to list all the interfaces during a resync
felix_iptables_lock_acquire_secs{quantile="0.99"}

Saturation

The way to monitor saturation in Calico is batch size. Here we can analyze three kinds of batches and also analyze them by quantiles.

# Number of messages processed in each batch
felix_int_dataplane_msg_batch_size{quantile="0.99"}

# Interface state messages processed in each batch
felix_int_dataplane_iface_msg_batch_size{quantile="0.99"}

# Interface address messages processed in each batch
felix_int_dataplane_addr_msg_batch_size{quantile="0.99"}

Traffic

One of the four golden signals we have to monitor to is traffic, in this case for calico, we need to monitor the most core network requests. Ipset and Iptables commands are the lowest level interaction in calico, in order to create that traffic Calico needs to create, destroy and update any policy network.

# Number of ipset commands executed.
rate(felix_ipset_calls[5m])

# Number of ipset operations executed.
rate(felix_ipset_lines_executed[5m])

# Number of iptables rule updates executed.
rate(felix_iptables_lines_executed[5m])

# Number of iptables-restore calls.
rate(felix_iptables_restore_calls[5m])

# Number of iptables-save calls.
rate(felix_iptables_save_calls[$__interval])

Agent Configuration

These are the default agent jobs for this integration:

- job_name: 'calico-node-default'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (calico-node);(.{0}$)
    replacement: calico
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "calico"
  - action: replace
    source_labels: [__address__]
    regex: ([^:]+)(?::\d+)?
    replacement: $1:9091
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (felix_calc_graph_update_time_seconds|felix_cluster_num_hosts|felix_cluster_num_policies|felix_cluster_num_profiles|felix_exec_time_micros|felix_int_dataplane_addr_msg_batch_size|felix_int_dataplane_apply_time_seconds|felix_int_dataplane_failures|felix_int_dataplane_iface_msg_batch_size|felix_int_dataplane_msg_batch_size|felix_ipset_calls|felix_ipset_errors|felix_ipset_lines_executed|felix_iptables_lines_executed|felix_iptables_lock_acquire_secs|felix_iptables_restore_calls|felix_iptables_restore_errors|felix_iptables_save_calls|felix_iptables_save_errors|felix_log_errors|felix_route_table_list_seconds|felix_route_table_per_iface_sync_seconds)
    action: keep
- job_name: 'calico-controller-default'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    separator: ;
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (calico-kube-controllers);(.{0}$)
    replacement: calico-controller
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "calico-controller"
  - action: replace
    source_labels: [__address__]
    regex: ([^:]+)(?::\d+)?
    replacement: $1:9094
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

3 - Cassandra

Metrics, Dashboards, Alerts and more for Cassandra Integration in Sysdig Monitor.
Cassandra

This integration is enabled by default.

Versions supported: > v3.x

This integration uses a sidecar exporter that is available in UBI or scratch base image.

This integration has 30 metrics.

Timeseries generated: The JMX-Exporter generates ~850 timeseries (the number of keyspaces and tables).

List of Alerts

AlertDescriptionFormat
[Cassandra] Compaction Task PendingThere are many Cassandra compaction tasks pending.Prometheus
[Cassandra] Commitlog Pending TasksThere are many Cassandra Commitlog tasks pending.Prometheus
[Cassandra] Compaction Executor Blocked TasksThere are many Cassandra compaction executor blocked tasks.Prometheus
[Cassandra] Flush Writer Blocked TasksThere are many Cassandra flush writer blocked tasks.Prometheus
[Cassandra] Storage ExceptionsThere are storage exceptions in Cassandra node.Prometheus
[Cassandra] High Tombstones ScannedThere is a high number of tombstones scanned.Prometheus
[Cassandra] JVM Heap MemoryHigh JVM Heap Memory.Prometheus

List of Dashboards

Cassandra

The dashboard provides information on the status of Cassandra. Cassandra

List of Metrics

Metric name
cassandra_bufferpool_misses_total
cassandra_bufferpool_size_total
cassandra_client_connected_clients
cassandra_client_request_read_latency
cassandra_client_request_read_timeouts
cassandra_client_request_read_unavailables
cassandra_client_request_write_latency
cassandra_client_request_write_timeouts
cassandra_client_request_write_unavailables
cassandra_commitlog_completed_tasks
cassandra_commitlog_pending_tasks
cassandra_commitlog_total_size
cassandra_compaction_compacted_bytes_total
cassandra_compaction_completed_tasks
cassandra_compaction_pending_tasks
cassandra_cql_prepared_statements_executed_total
cassandra_cql_regular_statements_executed_total
cassandra_dropped_messages_mutation
cassandra_dropped_messages_read
cassandra_jvm_gc_collection_count
cassandra_jvm_gc_duration_seconds
cassandra_jvm_memory_usage_max_bytes
cassandra_jvm_memory_usage_used_bytes
cassandra_storage_internal_exceptions_total
cassandra_storage_load_bytes_total
cassandra_table_read_requests_per_second
cassandra_table_tombstoned_scanned
cassandra_table_total_disk_space_used
cassandra_table_write_requests_per_second
cassandra_threadpool_blocked_tasks_total

Preparing the Integration

Create ConfigMap for the JMX-Exporter

The JMX-Exporter requires a ConfigMap with the Cassandra JXM configurations, which can be easily installed using a simple command. The following example is for a Cassandra cluster which exposes the jmx port 7199 and it’s deployed in the ‘cassandra’ namespace (modify the jmx port and the namespace as per your needs):

helm repo add promcat-charts https://sysdiglabs.github.io/integrations-charts 
helm repo update
helm -n cassandra install cassandra-jmx-exporter promcat-charts/jmx-exporter --set jmx_port=7199 --set integrationType=cassandra --set onlyCreateJMXConfigMap=true

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/jmx-exporter

Monitoring and Troubleshooting Cassandra

Here are some interesting metrics and queries to monitor and troubleshoot Cassandra.

General Stats

Node Down

Let’s get the number of expected of nodes, and the actual number of nodes up and running. If the number is not the same, then there might a problem.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_desired)
-
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_ready)
> 0

Dropped Messages

Dropped Messages Mutation

If there are dropped mutation messages then we probably have write/read failures due to timeouts.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_mutation)
Dropped Messages Read
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_read)

Buffer Pool

Buffer Pool Size

This buffer is allocated as off-heap in addition to the memory allocated for heap. Memory is allocated when needed. Check if miss rate is high.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_size_total)
Buffer Pool Misses
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_misses_total)

CQL Statements

CQL Prepared Statements

Use prepared statements (query with bound variables) as they are more secure and can be cached.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_prepared_statements_executed_total[$__interval]))
CQL Regular Statements

This value should be as low as possible if you are looking for good performance.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_regular_statements_executed_total[$__interval]))

Connected Clients

The number of current client connections in each node.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_connected_clients)

Client Request Latency

Write Latency

95th percentile client request write latency.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_write_latency{quantile="0.95"})
Read Latency

95th percentile client request read latency.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_read_latency{quantile="0.95"})

Unavailable Exceptions

Number of exceptions encountered in regular reads / writes. This number should be near 0 in a healthy cluster.

Read Unavailable Exceptions
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_unavailables[$__interval]))
Write Unavailable Exceptions
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_write_unavailables[$__interval]))

Write Unavailable Exceptions

Write / read request timeouts in Cassandra nodes. If there are timeouts, check for:

1.- ‘read_request_timeout_in_ms’ value in cassandra.yaml in case it is too low. 2.- Check tombstones that can degrade performance. You can find tombstones query below

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)
Client Request Read Timeout
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_timeouts[$__interval]))
Client Request Write Timeout
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_write_request_read_timeouts[$__interval]))

Threadpool Blocked Tasks

Compaction Blocked Tasks

Pending compactions that are blocked. This metric could deviate from “pending compactions” which includes an estimate of tasks that these pending tasks might create after completion.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="CompactionExecutor"}[$__interval]))
Flush Writer Blocked Tasks

The writer flush defines the number of parallel writes on disk. This value should be near 0. Check your “memtable_flush_writers” value to match with your number of cores if you are using SSD disks.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="MemtableFlushWriter"}[$__interval]))

Compactions

Pending Compactions

Compactions that are queued. This value should be as low as possible. If it reaches more than 50 you can start having CPU and Memory pressure.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_compaction_pending_tasks)
Total Size Compacted

Cassandra triggers minor compactions automatically so the compacted size should be low unless you trigger a major compaction across the node.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_compaction_compacted_bytes_total[$__interval]))

Commit Log

Commit Log Pending Tasks

This value should be under 15-20 for performance purposes.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_commitlog_pending_tasks)

Storage

Storage Exceptions

Look carefully at this value as any storage error over 0 is critical for Cassandra.

sum  by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_storage_internal_exceptions_total)

JVM and GC

JVM Heap Usage

If you want to tune your Heap memory you can use this query.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="Heap"})

If you want to know the maximum heap memory you can use this query.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_max_bytes{area="Heap"})
JVM NonHeap Usage

Use this query for NonHeap memory.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="NonHeap"})
GC Info

If there is memory pressure the max GC duration will start increasing.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_gc_duration_seconds)

Keyspaces and Tables

Keyspace Size

This query gives you information of all keyspaces.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_table_total_disk_space_used)
Table Size

This query gives you information of all tables.

Table Highest Increase Size

Very useful to know what tables are growing too fast.

topk(10,sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(delta(cassandra_table_total_disk_space_used[$__interval])))
Tombstones Scanned

Cassandra does not delete data from disk at once. Instead, it writes a tombstone with a value that indicates the data has been deleted.

A high value (more than 1000) can cause GC pauses, latency and read failures. Sometimes you need to issue a manual compaction from nodetool.

sum  by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)

Agent Configuration

This is the default agent job for this integration:

- job_name: 'cassandra-default'
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (cassandra-exporter);(.{0}$)
    replacement: cassandra
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "cassandra"
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (cassandra_bufferpool_misses_total|cassandra_bufferpool_size_total|cassandra_client_connected_clients|cassandra_client_request_read_latency|cassandra_client_request_read_timeouts|cassandra_client_request_read_unavailables|cassandra_client_request_write_latency|cassandra_client_request_write_timeouts|cassandra_client_request_write_unavailables|cassandra_commitlog_completed_tasks|cassandra_commitlog_pending_tasks|cassandra_commitlog_total_size|cassandra_compaction_compacted_bytes_total|cassandra_compaction_completed_tasks|cassandra_compaction_pending_tasks|cassandra_cql_prepared_statements_executed_total|cassandra_cql_regular_statements_executed_total|cassandra_dropped_messages_mutation|cassandra_dropped_messages_read|cassandra_jvm_gc_collection_count|cassandra_jvm_gc_duration_seconds|cassandra_jvm_memory_usage_max_bytes|cassandra_jvm_memory_usage_used_bytes|cassandra_storage_internal_exceptions_total|cassandra_storage_load_bytes_total|cassandra_table_read_requests_per_second|cassandra_table_tombstoned_scanned|cassandra_table_total_disk_space_used|cassandra_table_write_requests_per_second|cassandra_threadpool_blocked_tasks_total)
    action: keep 

4 - Ceph

Metrics, Dashboards, Alerts and more for Ceph Integration in Sysdig Monitor.
Ceph

This integration is enabled by default.

Versions supported: > v15.2.12

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 24 metrics.

Timeseries generated: 600 timeseries

List of Alerts

AlertDescriptionFormat
[Ceph] Ceph Manager is absentCeph Manager has disappeared from Prometheus target discovery.Prometheus
[Ceph] Ceph Manager is missing replicasCeph Manager is missing replicas.Prometheus
[Ceph] Ceph quorum at riskStorage cluster quorum is low. Contact Support.Prometheus
[Ceph] High number of leader changesCeph Monitor has seen a lot of leader changes per minute recently.Prometheus

List of Dashboards

Ceph

The dashboard provides information on the status, capacity, latency and throughput of Ceph. Ceph

List of Metrics

Metric name
ceph_cluster_total_bytes
ceph_cluster_total_used_bytes
ceph_health_status
ceph_mgr_status
ceph_mon_metadata
ceph_mon_num_elections
ceph_mon_quorum_status
ceph_osd_apply_latency_ms
ceph_osd_commit_latency_ms
ceph_osd_in
ceph_osd_metadata
ceph_osd_numpg
ceph_osd_op_r
ceph_osd_op_r_latency_count
ceph_osd_op_r_latency_sum
ceph_osd_op_r_out_bytes
ceph_osd_op_w
ceph_osd_op_w_in_bytes
ceph_osd_op_w_latency_count
ceph_osd_op_w_latency_sum
ceph_osd_recovery_bytes
ceph_osd_recovery_ops
ceph_osd_up
ceph_pool_max_avail

Preparing the Integration

Enable Prometheus Module

Ceph instruments Prometheus metrics and annotates the manager pod with Prometheus annotations.

Make sure that the Prometheus module is activated in the Ceph cluster by running the following command:

ceph mgr module enable prometheus

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting Ceph

This document describes important metrics and queries that you can use to monitor and troubleshoot Ceph.

Tracking metrics status

You can track Ceph metrics status with following alerts: Exporter proccess is not serving metrics

# [Ceph] Exporter Process Down
absent(ceph_health_status{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: ceph-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    regex: mgr;9283
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

5 - Consul

Metrics, Dashboards, Alerts and more for Consul Integration in Sysdig Monitor.
Consul

This integration is enabled by default.

Versions supported: > 1.11.1

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 64 metrics.

Timeseries generated: 1800 timeseries

List of Alerts

AlertDescriptionFormat
[Consul] KV Store update time anomalyKV Store update time anomalyPrometheus
[Consul] Transaction time anomalyTransaction time anomalyPrometheus
[Consul] Raft transactions count anomalyRaft transactions count anomalyPrometheus
[Consul] Raft commit time anomalyRaft commit time anomalyPrometheus
[Consul] Leader time to contact followers too highLeader time to contact followers too highPrometheus
[Consul] Flapping leadershipFlapping leadershipPrometheus
[Consul] Too many electionsToo many electionsPrometheus
[Consul] Server cluster unhealthyServer cluster unhealthyPrometheus
[Consul] Zero failure toleranceZero failure tolerancePrometheus
[Consul] Client RPC requests anomalyConsul client RPC requests anomalyPrometheus
[Consul] Client RPC requests rate limit exceededConsul client RPC requests rate limit exceededPrometheus
[Consul] Client RPC requests failedConsul client RPC requests failedPrometheus
[Consul] License ExpiryConsul License ExpiryPrometheus
[Consul] Garbage Collection pause highConsul Garbage Collection pause highPrometheus
[Consul] Garbage Collection pause too highConsul Garbage Collection pause too highPrometheus
[Consul] Raft restore duration too highConsul Raft restore duration too highPrometheus
[Consul] RPC requests error rate is highConsul RPC requests error rate is highPrometheus
[Consul] Cache hit rate is lowConsul Cache hit rate is lowPrometheus
[Consul] High 4xx RequestError RateHigh 4xx RequestError RatePrometheus
[Consul] High Request LatencyEnvoy High Request LatencyPrometheus
[Consul] High Response LatencyEnvoy High Response LatencyPrometheus
[Consul] Certificate close to expireCertificate close to expirePrometheus

List of Dashboards

Consul

The dashboard provides information on the status and latency of Consul. Consul

Consul Envoy

The dashboard provides information on the Consul Envoy proxies. Consul Envoy

List of Metrics

Metric name
consul_autopilot_failure_tolerance
consul_autopilot_healthy
consul_client_rpc
consul_client_rpc_exceeded
consul_client_rpc_failed
consul_consul_cache_bypass
consul_consul_cache_entries_count
consul_consul_cache_evict_expired
consul_consul_cache_fetch_error
consul_consul_cache_fetch_success
consul_kvs_apply_sum
consul_raft_apply
consul_raft_commitTime_sum
consul_raft_fsm_lastRestoreDuration
consul_raft_leader_lastContact
consul_raft_leader_oldestLogAge
consul_raft_rpc_installSnapshot
consul_raft_state_candidate
consul_raft_state_leader
consul_rpc_cross_dc
consul_rpc_queries_blocking
consul_rpc_query
consul_rpc_request
consul_rpc_request_error
consul_runtime_gc_pause_ns
consul_runtime_gc_pause_ns_sum
consul_system_licenseExpiration
consul_txn_apply_sum
envoy_cluster_membership_change
envoy_cluster_membership_healthy
envoy_cluster_membership_total
envoy_cluster_upstream_cx_active
envoy_cluster_upstream_cx_connect_ms_bucket
envoy_cluster_upstream_rq_active
envoy_cluster_upstream_rq_pending_active
envoy_cluster_upstream_rq_time_bucket
envoy_cluster_upstream_rq_xx
envoy_server_days_until_first_cert_expiring
go_build_info
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
process_cpu_seconds_total
process_max_fds
process_open_fds

Preparing the Integration

Enable Prometheus Metrics and Disable Hostname in Metrics

As seen in Consul documentation pages Helm Global Metrics and Prometheus Retention Time, to make Consul expose an endpoint for scraping metrics, you need to enable a few global.metrics configurations. You also need to enable the telemetry.disable_hostname “extra configurations” in the Consul Server and Client, so the metrics don’t contain the name of the instances.

If you install Consul with Helm, you need to use the following flags:

--set 'global.metrics.enabled=true'
--set 'global.metrics.enableAgentMetrics=true'
--set 'server.extraConfig="{"telemetry": {"disable_hostname": true}}"'
--set 'client.extraConfig="{"telemetry": {"disable_hostname": true}}"'

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting Consul

This document describes important metrics and queries that you can use to monitor and troubleshoot Consul.

Tracking metrics status

You can track Consul metrics status with following alerts: Exporter proccess is not serving metrics

# [Consul] Exporter Process Down
absent(consul_autopilot_healthy{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Exporter proccess is not serving metrics

# [Consul] Exporter Process Down
absent(envoy_cluster_upstream_cx_active{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

These are the default agent jobs for this integration:

- job_name: 'consul-server-default'
  metrics_path: '/v1/agent/metrics'
  params:
    format: ['prometheus']
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    regex: true
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (consul);(.{0}$)
    replacement: consul
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "consul"
  - action: keep
    source_labels: [__address__]
    regex: (.*:8500)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
- job_name: 'consul-envoy-default'
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    regex: true
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (envoy-sidecar);(.{0}$)
    replacement: consul
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "consul"
  - action: replace
    source_labels: [__address__]
    regex: (.+?)(\\:\\d)?
    replacement: $1:20200
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: (envoy_cluster_upstream_cx_active|envoy_cluster_upstream_rq_active|envoy_cluster_upstream_rq_pending_active|envoy_cluster_membership_total|envoy_cluster_membership_healthy|envoy_cluster_membership_change|envoy_cluster_upstream_rq_xx|envoy_cluster_upstream_cx_connect_ms_bucket|envoy_server_days_until_first_cert_expiring|envoy_cluster_upstream_rq_time_bucket)
      action: keep

6 - Elasticsearch

Metrics, Dashboards, Alerts and more for Elasticsearch Integration in Sysdig Monitor.
Elasticsearch

This integration is enabled by default.

Versions supported: > v6.8

This integration uses a standalone exporter that is available in UBI or scratch base image.

This integration has 28 metrics.

Timeseries generated: 400 timeseries

List of Alerts

AlertDescriptionFormat
[Elasticsearch] Heap Usage Too HighThe heap usage is over 90%Prometheus
[Elasticsearch] Heap Usage WarningThe heap usage is over 80%Prometheus
[Elasticsearch] Disk Space LowDisk available less than 20%Prometheus
[Elasticsearch] Disk Out Of SpaceDisk available less than 10%Prometheus
[Elasticsearch] Cluster RedCluster in Red statusPrometheus
[Elasticsearch] Cluster YellowCluster in Yellow statusPrometheus
[Elasticsearch] Relocation ShardsRelocating shards for too longPrometheus
[Elasticsearch] Initializing ShardsInitializing shards takes too longPrometheus
[Elasticsearch] Unassigned ShardsUnassigned shards for long timePrometheus
[Elasticsearch] Pending TasksElasticsearch has a high number of pending tasksPrometheus
[Elasticsearch] No New DocumentsElasticsearch has no new documents for a period of timePrometheus

List of Dashboards

ElasticSearch Cluster

The dashboard provides information on the status of the ElasticSearch cluster health and its usage of resources. ElasticSearch Cluster

ElasticSearch Infra

The dashboard provides information on the usage of CPU, memory, disk and networking of ElasticSearch. ElasticSearch Infra

List of Metrics

Metric name
elasticsearch_cluster_health_active_primary_shards
elasticsearch_cluster_health_active_shards
elasticsearch_cluster_health_initializing_shards
elasticsearch_cluster_health_number_of_data_nodes
elasticsearch_cluster_health_number_of_nodes
elasticsearch_cluster_health_number_of_pending_tasks
elasticsearch_cluster_health_relocating_shards
elasticsearch_cluster_health_status
elasticsearch_cluster_health_unassigned_shards
elasticsearch_filesystem_data_available_bytes
elasticsearch_filesystem_data_size_bytes
elasticsearch_indices_docs
elasticsearch_indices_indexing_index_time_seconds_total
elasticsearch_indices_indexing_index_total
elasticsearch_indices_merges_total_time_seconds_total
elasticsearch_indices_search_query_time_seconds
elasticsearch_indices_store_throttle_time_seconds_total
elasticsearch_jvm_gc_collection_seconds_count
elasticsearch_jvm_gc_collection_seconds_sum
elasticsearch_jvm_memory_committed_bytes
elasticsearch_jvm_memory_max_bytes
elasticsearch_jvm_memory_used_bytes
elasticsearch_os_load1
elasticsearch_os_load15
elasticsearch_os_load5
elasticsearch_process_cpu_percent
elasticsearch_transport_rx_size_bytes_total
elasticsearch_transport_tx_size_bytes_total

Preparing the Integration

Create the Secrets

Keep in mind:

  • If your ElasticSearch cluster is using basic authentication, the secret that contains the url must have the user and password.
  • The secrets need to be created in the same namespace where the exporter will be deployed.
  • Use the same user name and password that you used for the api.
  • You can change the name of the secret. If you do this, you will need to select it in the next steps of the integration.

Create the Secret for the URL

Without Authentication
kubectl -n Your-Application-Namespace create secret generic elastic-url-secret \
  --from-literal=url='http://SERVICE:PORT'
With Basic Auth
kubectl -n Your-Application-Namespace create secret generic elastic-url-secret \
  --from-literal=url='https://USERNAME:PASSWORD@SERVICE:PORT'

NOTE: You can use either http or https in the URL.

Create the Secret for the TLS Certs

If you are using HTTPS with custom certificates, follow the instructions given below.

kubectl create -n Your-Application-Namespace secret generic elastic-tls-secret \
  --from-file=root-ca.crt=/path/to/tls/ca-cert \
  --from-file=root-ca.key=/path/to/tls/ca-key \
  --from-file=root-ca.pem=/path/to/tls/ca-pem

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/elasticsearch-exporter

Monitoring and Troubleshooting Elasticsearch

This document describes important metrics and queries that you can use to monitor and troubleshoot Elasticsearch.

Tracking metrics status

You can track Elasticsearch metrics status with following alerts: Exporter proccess is not serving metrics

# [Elasticsearch] Exporter Process Down
absent(elasticsearch_cluster_health_status{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Exporter proccess is not serving metrics

# [Elasticsearch] Exporter Process Down
absent(elasticsearch_process_cpu_percent{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: elasticsearch-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "elasticsearch"
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
    target_label: kube_workload_type
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
    target_label: kube_workload_name
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (elasticsearch_cluster_health_active_primary_shards|elasticsearch_cluster_health_active_shards|elasticsearch_cluster_health_initializing_shards|elasticsearch_cluster_health_number_of_data_nodes|elasticsearch_cluster_health_number_of_nodes|elasticsearch_cluster_health_number_of_pending_tasks|elasticsearch_cluster_health_relocating_shards|elasticsearch_cluster_health_status|elasticsearch_cluster_health_unassigned_shards|elasticsearch_filesystem_data_available_bytes|elasticsearch_filesystem_data_size_bytes|elasticsearch_indices_docs|elasticsearch_indices_indexing_index_time_seconds_total|elasticsearch_indices_indexing_index_total|elasticsearch_indices_merges_total_time_seconds_total|elasticsearch_indices_search_query_time_seconds|elasticsearch_indices_store_throttle_time_seconds_total|elasticsearch_jvm_gc_collection_seconds_count|elasticsearch_jvm_gc_collection_seconds_sum|elasticsearch_jvm_memory_committed_bytes|elasticsearch_jvm_memory_max_bytes|elasticsearch_jvm_memory_pool_peak_used_bytes|elasticsearch_jvm_memory_used_bytes|elasticsearch_os_load1|elasticsearch_os_load15|elasticsearch_os_load5|elasticsearch_process_cpu_percent|elasticsearch_transport_rx_size_bytes_total|elasticsearch_transport_tx_size_bytes_total)
    action: keep

7 - Fluentd

Metrics, Dashboards, Alerts and more for Fluentd Integration in Sysdig Monitor.
Fluentd

This integration is enabled by default.

Versions supported: > v1.12.4

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 12 metrics.

Timeseries generated: 640 timeseries

List of Alerts

AlertDescriptionFormat
[Fluentd] No Input From ContainerNo Input From Container. This alert does not work in OpenShift.Prometheus
[Fluentd] High Error RatioHigh Error Ratio.Prometheus
[Fluentd] High Retry RatioHigh Retry Ratio.Prometheus
[Fluentd] High Retry WaitHigh Retry Wait.Prometheus
[Fluentd] Low Buffer Available SpaceLow Buffer Available Space.Prometheus
[Fluentd] Buffer Queue Length IncreasingBuffer Queue Length Increasing.Prometheus
[Fluentd] Buffer Total Bytes IncreasingBuffer Total Bytes Increasing.Prometheus
[Fluentd] High Slow Flush RatioHigh Slow Flush Ratio.Prometheus
[Fluentd] No Output Records From PluginNo Output Records From Plugin.Prometheus

List of Dashboards

Fluentd

The dashboard provides information on the status of Fluentd. Fluentd

List of Metrics

Metric name
fluentd_input_status_num_records_total
fluentd_output_status_buffer_available_space_ratio
fluentd_output_status_buffer_queue_length
fluentd_output_status_buffer_total_bytes
fluentd_output_status_emit_count
fluentd_output_status_emit_records
fluentd_output_status_flush_time_count
fluentd_output_status_num_errors
fluentd_output_status_retry_count
fluentd_output_status_retry_wait
fluentd_output_status_rollback_count
fluentd_output_status_slow_flush_count

Preparing the Integration

OpenShift

If you have installed Fluentd using the OpenShift Logging Operator, no further action is required to enable monitoring.

Kubernetes

Enable Prometheus Metrics

For Fluentd to expose Prometheus metrics, enable the following plugins:

  • ‘prometheus’ input plugin
  • ‘prometheus_monitor’ input plugin
  • ‘prometheus_output_monitor’ input plugin

As seen in the official plugin documentation, you can enable them with the following configurations:

<source>
    @type prometheus
    @id in_prometheus
    bind "0.0.0.0"
    port 24231
    metrics_path "/metrics"
</source>

<source>
    @type prometheus_monitor
    @id in_prometheus_monitor
</source>

<source>
    @type prometheus_output_monitor
    @id in_prometheus_output_monitor
</source>

If you are deploying Fluentd using the official Helm chart, it already has these plugins enabled by default in its configuration, so no additional actions are needed.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting Fluentd

This document describes important metrics and queries that you can use to monitor and troubleshoot Fluentd.

Tracking metrics status

You can track Fluentd metrics status with following alerts: Exporter proccess is not serving metrics

# [Fluentd] Exporter Process Down
absent(fluentd_output_status_buffer_available_space_ratio{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

These are the default agent jobs for this integration:

- job_name: 'fluentd-default'
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (fluentd);(.{0}$)
    replacement: fluentd
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "fluentd"
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - action: replace
    source_labels: 
    - __name__
    - tag
    regex: fluentd_input_status_num_records_total;kubernetes.var.log.containers.([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)-[a-zA-Z0-9]+.log
    target_label: input_pod
    replacement: $1
  - action: replace
    source_labels: 
    - __name__
    - tag
    regex: fluentd_input_status_num_records_total;kubernetes.var.log.containers.([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)-[a-zA-Z0-9]+.log
    target_label: input_namespace
    replacement: $2
  - action: replace
    source_labels: 
    - __name__
    - tag
    regex: fluentd_input_status_num_records_total;kubernetes.var.log.containers.([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)-[a-zA-Z0-9]+.log
    target_label: input_container
    replacement: $3

    
- job_name: openshift-fluentd-default
  scheme: https
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (collector);(.{0}$)
    replacement: collector
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "collector"
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (fluentd_output_status_buffer_available_space_ratio|fluentd_output_status_buffer_queue_length|fluentd_output_status_buffer_total_bytes|fluentd_output_status_emit_count|fluentd_output_status_emit_records|fluentd_output_status_flush_time_count|fluentd_output_status_num_errors|fluentd_output_status_retry_count|fluentd_output_status_retry_wait|fluentd_output_status_rollback_count|fluentd_output_status_slow_flush_count)
    action: keep        

8 - Go

Metrics, Dashboards, Alerts and more for Go Integration in Sysdig Monitor.
Go

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

This integration has 26 metrics.

List of Alerts

AlertDescriptionFormat
[Go] Slow Garbage CollectorGarbage collector took too long.Prometheus
[Go] Few Free File DescriptorsFew free file descriptors.Prometheus

List of Dashboards

Go Internals

The dashboard provides information on the Go integration. Go Internals

List of Metrics

Metric name
go_build_info
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
process_cpu_seconds_total
process_max_fds
process_open_fds

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This integration has no default agent job.

9 - HAProxy Ingress

Metrics, Dashboards, Alerts and more for HAProxy Ingress Integration in Sysdig Monitor.
HAProxy Ingress

This integration is enabled by default.

Versions supported: > v0.13

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 31 metrics.

Timeseries generated: 150x number of ingress pods, 50x number of ingress pods x ingress resources

List of Alerts

AlertDescriptionFormat
[Haproxy-Ingress] Uptime less than 1 hourThis alert detects when all of the instances of the ingress controller have an uptime of less than 1 hour.Prometheus
[Haproxy-Ingress] Frontend DownThis alert detects when a frontend has all of its instances down for more than 10 minutes.Prometheus
[Haproxy-Ingress] Backend DownThis alert detects when a backend has all of its instances down for more than 10 minutes.Prometheus
[Haproxy-Ingress] High Sessions UsageThis alert triggers when the backend sessions overpass the 85% of the sessions capacity for 10 minutes.Prometheus
[Haproxy-Ingress] High Error RateThis alert triggers when there is an error rate over 15% for over 10 minutes in a proxy.Prometheus
[Haproxy-Ingress] High Request Denied RateThese alerts detect when there is a denied rate of requests over 10% for over 10 minutes in a proxy.Prometheus
[Haproxy-Ingress] High Response Denied RateThese alerts detect when there is a denied rate of responses over 10% for over 10 minutes in a proxy.Prometheus
[Haproxy-Ingress] High Response RateThis alert triggers when a proxy has a mean response time higher than 250ms for over 10 minutes.Prometheus

List of Dashboards

HAProxy Ingress Overview

The dashboard provides information on the HAProxy Ingress Overview. HAProxy Ingress Overview

HAProxy Ingress Service Details

The dashboard provides information on the HAProxy Ingress Service Details. HAProxy Ingress Service Details

List of Metrics

Metric name
haproxy_backend_bytes_in_total
haproxy_backend_bytes_out_total
haproxy_backend_client_aborts_total
haproxy_backend_connect_time_average_seconds
haproxy_backend_current_queue
haproxy_backend_http_requests_total
haproxy_backend_http_responses_total
haproxy_backend_limit_sessions
haproxy_backend_queue_time_average_seconds
haproxy_backend_requests_denied_total
haproxy_backend_response_time_average_seconds
haproxy_backend_responses_denied_total
haproxy_backend_sessions_total
haproxy_backend_status
haproxy_frontend_bytes_in_total
haproxy_frontend_bytes_out_total
haproxy_frontend_connections_total
haproxy_frontend_denied_connections_total
haproxy_frontend_denied_sessions_total
haproxy_frontend_request_errors_total
haproxy_frontend_requests_denied_total
haproxy_frontend_responses_denied_total
haproxy_frontend_status
haproxy_process_active_peers
haproxy_process_current_connection_rate
haproxy_process_current_run_queue
haproxy_process_current_session_rate
haproxy_process_current_tasks
haproxy_process_jobs
haproxy_process_ssl_connections_total
haproxy_process_start_time_seconds

Preparing the Integration

Enable Prometheus Metrics

For HAProxy to expose Prometheus metrics, the following options must be enabled:

  • controller.metrics.enabled = true
  • controller.stats.enabled = true

You can check all the properties in the official web page.

If you are deploying HAProxy using the official Helm chart, they can be enabled with the following configurations:

helm install haproxy-ingress haproxy-ingress/haproxy-ingress \
--set-string "controller.stats.enabled = true" \
--set-string "controller.metrics.enabled = true"

This configuration creates the following section in haproxy.cfg file

frontend prometheus
    mode http
    bind :9101
    http-request use-service prometheus-exporter if { path /metrics }
    http-request use-service lua.send-prometheus-root if { path / }
    http-request use-service lua.send-404
    no log

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting HAProxy Ingress

This document describes important metrics and queries that you can use to monitor and troubleshoot HAProxy Ingress.

Tracking metrics status

You can track HAProxy Ingress metrics status with following alerts: Exporter proccess is not serving metrics

# [HAProxy Ingress] Exporter Process Down
absent(haproxy_frontend_status{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Exporter proccess is not serving metrics

# [HAProxy Ingress] Exporter Process Down
absent(haproxy_backend_status{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: 'haproxy-default'
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (haproxy-ingress);(.{0}$)
    replacement: haproxy-ingress
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "haproxy-ingress"
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (haproxy_backend_bytes_in_total|haproxy_backend_bytes_out_total|haproxy_backend_client_aborts_total|haproxy_backend_connect_time_average_seconds|haproxy_backend_current_queue|haproxy_backend_http_requests_total|haproxy_backend_http_responses_total|haproxy_backend_limit_sessions|haproxy_backend_queue_time_average_seconds|haproxy_backend_requests_denied_total|haproxy_backend_response_time_average_seconds|haproxy_backend_responses_denied_total|haproxy_backend_sessions_total|haproxy_backend_status|haproxy_frontend_bytes_in_total|haproxy_frontend_bytes_out_total|haproxy_frontend_connections_total|haproxy_frontend_denied_connections_total|haproxy_frontend_denied_sessions_total|haproxy_frontend_request_errors_total|haproxy_frontend_requests_denied_total|haproxy_frontend_responses_denied_total|haproxy_frontend_status|haproxy_process_active_peers|haproxy_process_current_connection_rate|haproxy_process_current_run_queue|haproxy_process_current_session_rate|haproxy_process_current_tasks|haproxy_process_jobs|haproxy_process_ssl_connections_total|haproxy_process_start_time_seconds)
    action: keep    

10 - HAProxy Ingress OpenShift

Metrics, Dashboards, Alerts and more for HAProxy Ingress OpenShift Integration in Sysdig Monitor.
HAProxy Ingress OpenShift

This integration is enabled by default.

Versions supported: > v3.11

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 28 metrics.

Timeseries generated: The HAProxy ingress router generates ~400 time series per HAProxy router pod.

List of Alerts

AlertDescriptionFormat
[OpenShift-HAProxy-Router] Router DownRouter HAProxy down. No instances running.Prometheus
[OpenShift-HAProxy-Router] HAProxy DownHAProxy down on a pod.Prometheus
[OpenShift-HAProxy-Router] HAProxy Reload FailureHAProxy reloads are failing. New configurations will not be applied.Prometheus
[OpenShift-HAProxy-Router] Percentage of routers lowLess than 75% Routers are up.Prometheus
[OpenShift-HAProxy-Router] Route DownThis alert detects if all servers are down in a routePrometheus
[OpenShift-HAProxy-Router] High LatencyThis alert detects high latency in at least one server of the route.Prometheus
[OpenShift-HAProxy-Router] Pod Health Check FailureThis alert triggers when there is a recurrent pod health check failure.Prometheus
[OpenShift-HAProxy-Router] Queue not empty in routeThis alert triggers when a queue is not empty in a route.Prometheus
[OpenShift-HAProxy-Router] High error rate in routeThis alert triggers when the error rate in a route is higher than 15%.Prometheus
[OpenShift-HAProxy-Router] Connection errors in routeThis alert triggers when there are recurring connection errors in a route.Prometheus

List of Dashboards

OpenShift HAProxy Ingress Overview

The dashboard provides information on the OpenShift HAProxy Ingress overview. OpenShift HAProxy Ingress Overview

OpenShift HAProxy Ingress Service Details

The dashboard provides information on the OpenShift HAProxy Ingress Service golden signals. OpenShift HAProxy Ingress Service Details

List of Metrics

Metric name
haproxy_backend_http_average_connect_latency_milliseconds
haproxy_backend_http_average_queue_latency_milliseconds
haproxy_backend_http_average_response_latency_milliseconds
haproxy_backend_up
haproxy_frontend_bytes_in_total
haproxy_frontend_bytes_out_total
haproxy_frontend_connections_total
haproxy_frontend_current_session_rate
haproxy_frontend_http_responses_total
haproxy_process_cpu_seconds_total
haproxy_process_max_fds
haproxy_process_resident_memory_bytes
haproxy_process_start_time_seconds
haproxy_process_virtual_memory_bytes
haproxy_server_bytes_in_total
haproxy_server_bytes_out_total
haproxy_server_check_failures_total
haproxy_server_connection_errors_total
haproxy_server_connections_total
haproxy_server_current_queue
haproxy_server_current_sessions
haproxy_server_downtime_seconds_total
haproxy_server_http_average_response_latency_milliseconds
haproxy_server_http_responses_total
haproxy_server_up
haproxy_up
kube_workload_status_desired
template_router_reload_failure

Preparing the Integration

Openshift 3.11

Once the Sysdig agent is deployed, check if it is running on all nodes (compute, master, and infra):

oc get nodes
oc get pods -n sysdig-agent -o wide

Apply this patch in case the Agent is not running on infra/master.

oc patch namespace sysdig-agent --patch-file='sysdig-agent-namespace-patch.yaml'

sysdig-agent-namespace-patch.yaml file

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/node-selector: ""

OpenShift integrates security by default. Therefore, if you want Sysdig agent to scrape HAProxy router metrics, provide it with the necessary permissions. To do so:

oc apply -f router-clusterrolebinding-sysdig-agent-oc3.yaml

router-clusterrolebinding-sysdig-agent-oc3.yaml file

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: haproxy-route-monitoring
rules:
- apiGroups:
  - route.openshift.io
  resources:
  - routers/metrics
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app: sysdig-agent
  name: sysdig-router-monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: haproxy-route-monitoring
subjects:
- kind: ServiceAccount
  name: sysdig-agent
  namespace: sysdig-agent   # Remember to change to the namespace where you have the Sysdig agents deployed

Openshift 4.X

OpenShift integrates security by default. Therefore, if you want Sysdig agent to scrape HAProxy router metrics, provide it with the necessary permissions. To do so:

oc apply -f router-clusterrolebinding-sysdig-agent-oc4.yaml

router-clusterrolebinding-sysdig-agent-oc4.yaml file

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: router-monitoring-sysdig-agent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: router-monitoring
subjects:
- kind: ServiceAccount
  name: sysdig-agent
  namespace: sysdig-agent   # Remember to change to the namespace where you have the Sysdig agents deployed

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting HAProxy Ingress OpenShift

This document describes important metrics and queries that you can use to monitor and troubleshoot HAProxy Ingress OpenShift.

Tracking metrics status

You can track HAProxy Ingress OpenShift metrics status with following alerts: Exporter proccess is not serving metrics

# [HAProxy Ingress OpenShift] Exporter Process Down
absent(haproxy_process_start_time_seconds{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Exporter proccess is not serving metrics

# [HAProxy Ingress OpenShift] Exporter Process Down
absent(haproxy_server_http_average_response_latency_milliseconds{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: 'haproxy-router'
  scheme: https
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: replace
    source_labels: [__address__]
    regex: ([^:]+)(?::\d+)?
    replacement: $1:1936
    target_label: __address__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (router);(.{0}$)
    replacement: openshift-haproxy
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "openshift-haproxy"
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (haproxy_backend_http_average_connect_latency_milliseconds|haproxy_backend_http_average_queue_latency_milliseconds|haproxy_backend_http_average_response_latency_milliseconds|haproxy_backend_up|haproxy_frontend_bytes_in_total|haproxy_frontend_bytes_out_total|haproxy_frontend_connections_total|haproxy_frontend_current_session_rate|haproxy_frontend_http_responses_total|haproxy_process_cpu_seconds_total|haproxy_process_max_fds|haproxy_process_resident_memory_bytes|haproxy_process_start_time_seconds|haproxy_process_virtual_memory_bytes|haproxy_server_bytes_in_total|haproxy_server_bytes_out_total|haproxy_server_check_failures_total|haproxy_server_connection_errors_total|haproxy_server_connections_total|haproxy_server_current_queue|haproxy_server_current_sessions|haproxy_server_downtime_seconds_total|haproxy_server_http_average_response_latency_milliseconds|haproxy_server_http_responses_total|haproxy_server_up|haproxy_up|template_router_reload_failure)
    action: keep    

11 - Harbor

Metrics, Dashboards, Alerts and more for Harbor Integration in Sysdig Monitor.
Harbor

This integration is enabled by default.

Versions supported: > v2.3

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 44 metrics.

Timeseries generated: 800 timeseries

List of Alerts

AlertDescriptionFormat
[Harbor] Harbor Core Is DownHarbor Core Is DownPrometheus
[Harbor] Harbor Database Is DownHarbor Database Is DownPrometheus
[Harbor] Harbor Registry Is DownHarbor Registry Is DownPrometheus
[Harbor] Harbor Redis Is DownHarbor Redis Is DownPrometheus
[Harbor] Harbor Trivy Is DownHarbor Trivy Is DownPrometheus
[Harbor] Harbor JobService Is DownHarbor JobService Is DownPrometheus
[Harbor] Project Quota Is Raising The LimitProject Quota Is Raising The LimitPrometheus
[Harbor] Harbor p99 latency is higher than 10 secondsHarbor p99 latency is higher than 10 secondsPrometheus
[Harbor] Harbor Error Rate is HighHarbor Error Rate is HighPrometheus

List of Dashboards

Harbor

The dashboard provides information on the Harbour instance status, storage usage, projects and tasks. Harbor

List of Metrics

Metric name
go_build_info
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
harbor_artifact_pulled
harbor_core_http_request_duration_seconds
harbor_jobservice_task_process_time_seconds
harbor_project_member_total
harbor_project_quota_byte
harbor_project_quota_usage_byte
harbor_project_repo_total
harbor_project_total
harbor_quotas_size_bytes
harbor_task_concurrency
harbor_task_queue_latency
harbor_task_queue_size
harbor_up
process_cpu_seconds_total
process_max_fds
process_open_fds
registry_http_request_duration_seconds_bucket
registry_http_request_size_bytes_bucket
registry_http_requests_total
registry_http_response_size_bytes_bucket
registry_storage_action_seconds_bucket

Preparing the Integration

Enable Prometheus Metrics

As seen in the Harbor documentation page Configure the Harbor YML File, to make Harbor expose an endpoint for scraping metrics, you need to set the ‘metric.enabled’ configuration to ’true’.

If you install Harbor with Helm, you need to use the following flag:

--set 'metrics.enabled=true'

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting Harbor

This document describes important metrics and queries that you can use to monitor and troubleshoot Harbor.

Tracking metrics status

You can track Harbor metrics status with following alerts: Exporter proccess is not serving metrics

# [Harbor] Exporter Process Down
absent(harbor_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

These are the default agent jobs for this integration:

- job_name: harbor-exporter-default
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_container_port_number
    regex: exporter;8080
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:8001
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

- job_name: harbor-core-default
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_container_port_number
    regex: core;8080
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:8001
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

- job_name: harbor-registry-default
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_container_port_number
    regex: registry;5000
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:8001
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

- job_name: harbor-jobservice-default
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_container_port_number
    regex: jobservice;8080
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:8001
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

12 - Istio

Metrics, Dashboards, Alerts and more for Istio Integration in Sysdig Monitor.
Istio

This integration is enabled by default.

Versions supported: 1.14

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 28 metrics.

Timeseries generated: 15 timeseries

List of Alerts

AlertDescriptionFormat
[Istio-Citadel] CSR without successSome of the Certificate Signing Request (CSR) were not correctly requestedPrometheus
[Istio-Pilot] Inbound listener rules conflictsThere are some conflict with inbound listener rulesPrometheus
[Istio-Pilot] Endpoint found in unready stateEndpoint found in unready statePrometheus
[Istio] Unstable requests for sidecar injectionsSidecar injections requests are failingPrometheus
[Istiod] Istiod Uptime issueIstiod UpTime is taking more time than usualPrometheus

List of Dashboards

Istio v1.14 Control Plane

The dashboard provides information on the Istio Control Plane, Pilot, Galley, Mixer and Citadel. Istio v1.14 Control Plane

List of Metrics

Metric name
citadel_server_csr_count
citadel_server_success_cert_issuance_count
galley_validation_failed
galley_validation_passed
istiod_uptime_seconds
pilot_conflict_inbound_listener
pilot_conflict_outbound_listener_http_over_current_tcp
pilot_conflict_outbound_listener_tcp_over_current_http
pilot_conflict_outbound_listener_tcp_over_current_tcp
pilot_endpoint_not_ready
pilot_services
pilot_total_xds_internal_errors
pilot_total_xds_rejects
pilot_virt_services
pilot_xds
pilot_xds_cds_reject
pilot_xds_config_size_bytes_bucket
pilot_xds_eds_reject
pilot_xds_lds_reject
pilot_xds_push_context_errors
pilot_xds_push_time_bucket
pilot_xds_pushes
pilot_xds_rds_reject
pilot_xds_send_time_bucket
pilot_xds_write_timeout
sidecar_injection_failure_total
sidecar_injection_requests_total
sidecar_injection_success_total

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting Istio

This document describes resumed alarms and dashboards for Istio Service. Istio Services are based on network rules as the foundation, so all the alarms and dashboards monitor any problem related to traffic and connections from source and destination.

Alarms

Most of the alarms associated with Istio configuration notifies problems with the Pilot or Citadel server. These servers are responsible for important Istio configuration.

Citadel controls authentication and identity management between services, and manages certificates in every workload.

Pilot accepts the rules created for traffic behavior provided by the control plane, and converts them into configurations applied by Envoy, based on how configuration aspects are managed locally. Basically, Pilot is responsible for iptables configuration in every workload.

CSR Without Success

Alarms are defined to notify you of faulty Certificate Signing Requests (CSRs). In order to collect that information, the following metrics are used:

  • citadel_server_csr_count
  • citadel_server_success_cert_issuance_count
rate(citadel_server_csr_count[5m]) - rate(citadel_server_success_cert_issuance_count[5m]) > 0

What is CSR: A certificate signing request (CSR) is one of the first steps towards getting your own SSL/TLS certificate. Generated on the same server you plan to install the certificate on, the CSR contains information such as common name, organization, and country. The Certificate Authority (CA) will use CSR to create your certificate. CSR also contains the public key that will be included in your certificate and is signed with the corresponding private key.

Inbound Listener Rules Conflicts

Because Istio works with networking rules, and configures IP addresses, ports, sockets, and so on to send or received traffic. The term listeners refers to these configurable values. Be aware of possible errors or conflicts with these rules.

pilot_conflict_inbound_listener > 0

Endpoint Found in Unready State

In order to have a stable platform, you need to verify that all endpoints in your network are perfectly working. Use the following alarm to collect that information:

pilot_endpoint_not_ready > 0

Unstable Requests for Sidecar Injections

Istio configures sidecar containers in every pod, and use this sidecar as the frontend server for all the requests that goes to or from that workload. To check if this sidecar injection is properly work, use the following query:

rate(sidecar_injection_requests_total [5m]) -  rate(sidecar_injection_success_total [5m]) > 0

Dashboards

Traffic

Traffic is the first golden signal that has to be gathered. Because Istio provides traffic management itself the information it provides will be detailed. Istio has three different parts that you can monitor and specify different metrics: control plane, envoy, and service itself.

This example shows gathering information about Istio service traffic.

Use the istio_requests_total with relevant labels to colloect wideband of information on different panels.

Client Request Volume and Server Request Volume

The istio_requests_total metric shows the total request traffic from both sides of the connection, using the reporter label.

The reporter label identifies the reporter of the request. It is set to destination if report is from an Istio proxy server. It will be set to source if the report is from a Istio proxy client or a gateway.

sum (irate(istio_requests_total{reporter="source"}[5m]))
sum (irate(istio_requests_total{reporter="destination"}[5m]))

Incoming Request by Source/Destination and Response Code

This dashboard shows the requests received by both source and destination using the reporter label. The following query segments the HTTP codes with the response_code label.

sum(irate(istio_requests_total{reporter="source"}[5m])) by (source_workload, source_workload_namespace, response_code)
sum(irate(istio_requests_total{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, response_code)

Client/Server Success Rate (non-5xx responses)

The following query builds a dashboard to monitor all the traffic except related to the internal server errors. The reporter label is used to segment on both source and destination.

100*(sum by (destination_service_name)(irate(istio_requests_total{reporter="destination",response_code!~"5.*"}[5m])) / sum by (destination_service_name)(irate(istio_requests_total{reporter="destination"}[5m])))
100*(sum by (destination_service_name)(irate(istio_requests_total{reporter="source",response_code!~"5.*"}[5m])) / sum by (destination_service_name)(irate(istio_requests_total{reporter="source"}[5m])))

Errors

The errors summarized in these dashboards are related with HTTP traffic managed by Istio proxies.

4xx Response Code by Source/Destination

The following query builds a dashboard that reports all the bad requests. It uses the reporter label on both source and destination.

 sum by (response_code)(irate(istio_requests_total{reporter="source", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="source", response_code=~"4.*"}) -1
 sum by (response_code)(irate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="destination", response_code=~"4.*"}) -1

5xx Response Code by Source/Destination

The following query builds a dashboard to show all the internal server errors requests. The query uses the reporter label on both source and destination.

sum by (response_code)(irate(istio_requests_total{reporter="source", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="source", response_code=~"4.*"}) -1
sum by (response_code)(irate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="destination", response_code=~"4.*"}) -1

Latency and Saturation

Both latency and saturation are reported on these dashboards because both are related to request duration and package size.

Client/Server Request Duration

The following query builds a dashboard to show critical duration of some requests using quantiles.

Note: quantiles can be modified.

histogram_quantile(0.50, sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m])) by (le, source_service_name)) / 1000
histogram_quantile(0.50, sum(irate(istio_request_duration_milliseconds_bucket{reporter="destination"}[1m])) by (le, destination_service_name)) / 1000

Incoming Request Size by Source/Destination

The following query builds a dashboard to show critical size of some requests using quantiles.

Note: quantiles can be modified.

histogram_quantile(0.50, sum(irate(istio_request_bytes_bucket{reporter="source"}[5m])) by (source_workload, source_workload_namespace, le))
histogram_quantile(0.50, sum(irate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, le))

Response Size By Source/Destination

The following query builds a dashboard to show critical size of some responses using quantiles.

Note: quantiles can be modified.

histogram_quantile(0.50, sum(irate(istio_response_bytes_bucket{reporter="source"}[5m])) by (source_workload, source_workload_namespace, le))
histogram_quantile(0.50, sum(irate(istio_response_bytes_bucket{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, le))

Agent Configuration

This is the default agent job for this integration:

- job_name: 'istiod'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (discovery);(.{0}$)
    replacement: istiod
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "istiod"
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (citadel_server_csr_count|citadel_server_success_cert_issuance_count|galley_validation_failed|galley_validation_passed|istiod_uptime_seconds|pilot_conflict_inbound_listener|pilot_conflict_outbound_listener_http_over_current_tcp|pilot_conflict_outbound_listener_tcp_over_current_http|pilot_conflict_outbound_listener_tcp_over_current_tcp|pilot_endpoint_not_ready|pilot_services|pilot_total_xds_internal_errors|pilot_total_xds_rejects|pilot_virt_services|pilot_xds|pilot_xds_cds_reject|pilot_xds_config_size_bytes_bucket|pilot_xds_eds_reject|pilot_xds_lds_reject|pilot_xds_push_context_errors|pilot_xds_push_time_bucket|pilot_xds_pushes|pilot_xds_rds_reject|pilot_xds_send_time_bucket|pilot_xds_write_timeout|sidecar_injection_failure_total|sidecar_injection_requests_total|sidecar_injection_success_total)
    action: keep

13 - Istio Envoy

Metrics, Dashboards, Alerts and more for Istio Envoy Integration in Sysdig Monitor.
Istio Envoy

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

Versions supported: 1.14

This integration has 16 metrics.

Timeseries generated: 155 timeseries per envoy

List of Alerts

AlertDescriptionFormat
[Istio-Envoy] High 4xx RequestError Rate4xx RequestError Rate is higher than 5%Prometheus
[Istio-Envoy] High 5xx RequestError Rate5xx RequestError Rate is higher than 5%Prometheus
[Istio-Envoy] High Request LatencyEnvoy Request Latency is higher than 100msPrometheus

List of Dashboards

Istio v1.14 Workload

The dashboard provides information on the Istio Envoy proxy status. Istio v1.14 Workload

Istio v1.14 Service

The dashboard provides information on the Istio Service, Request rates and duration for Http and TCP connections. Istio v1.14 Service

List of Metrics

Metric name
citadel_server_csr_count
envoy_cluster_membership_change
envoy_cluster_membership_healthy
envoy_cluster_membership_total
envoy_cluster_upstream_cx_active
envoy_cluster_upstream_cx_connect_ms_bucket
envoy_cluster_upstream_rq_active
envoy_cluster_upstream_rq_pending_active
envoy_server_days_until_first_cert_expiring
istio_build
istio_request_bytes_bucket
istio_request_duration_milliseconds_bucket
istio_requests_total
istio_response_bytes_bucket
istio_tcp_received_bytes_total
istio_tcp_sent_bytes_total

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This integration has no default agent job.

14 - Kafka

Metrics, Dashboards, Alerts and more for Kafka Integration in Sysdig Monitor.
Kafka

This integration is enabled by default.

Versions supported: > v2.7.x

This integration uses a standalone exporter that is available in UBI or scratch base image.

This integration has 37 metrics.

Timeseries generated: The JMX-Exporter generates ~270 timeseries and the Kafka-Exporter ~138 timeseries (the number of topics, partitions and consumers increases this number).

List of Alerts

AlertDescriptionFormat
[Kafka] Broker DownThere are less Kafka brokers up than expected. The ‘workload’ label of the Kafka Deployment/Stateful set must be specified.Prometheus
[Kafka] No LeaderThere is no ActiveController or ’leader’ in the Kafka cluster.Prometheus
[Kafka] Too Many LeadersThere is more than one ActiveController or ’leader’ in the Kafka cluster.Prometheus
[Kafka] Offline PartitionsThere are one or more Offline Partitions. These partitions don’t have an active leader and are hence not writable or readable.Prometheus
[Kafka] Under Replicated PartitionsThere are one or more Under Replicated Partitions.Prometheus
[Kafka] Under In-Sync Replicated PartitionsThere are one or more Under In-Sync Replicated Partitions. These partitions will be unavailable to producers who use ‘acks=all’.Prometheus
[Kafka] ConsumerGroup Lag Not DecreasingThe ConsumerGroup lag is not decreasing. The Consumers might be down, failing to process the messages and continuously retrying, or their consumption rate is lower than the production rate of messages.Prometheus
[Kafka] ConsumerGroup Without MembersThe ConsumerGroup doesn’t have any members.Prometheus
[Kafka] Producer High ThrottleTime By Client-IdThe Producer has reached its quota and has high throttle time. Applicable when Client-Id-only quotas are being used.Prometheus
[Kafka] Producer High ThrottleTime By UserThe Producer has reached its quota and has high throttle time. Applicable when User-only quotas are being used.Prometheus
[Kafka] Producer High ThrottleTime By User And Client-IdThe Producer has reached its quota and has high throttle time. Applicable when Client-Id + User quotas are being used.Prometheus
[Kafka] Consumer High ThrottleTime By Client-IdThe Consumer has reached its quota and has high throttle time. Applicable when Client-Id-only quotas are being used.Prometheus
[Kafka] Consumer High ThrottleTime By UserThe Consumer has reached its quota and has high throttle time. Applicable when User-only quotas are being used.Prometheus
[Kafka] Consumer High ThrottleTime By User And Client-IdThe Consumer has reached its quota and has high throttle time. Applicable when Client-Id + User quotas are being used.Prometheus

List of Dashboards

Kafka

The dashboard provides information on the status of Kafka. Kafka

List of Metrics

Metric name
kafka_brokers
kafka_consumergroup_current_offset
kafka_consumergroup_lag
kafka_consumergroup_members
kafka_controller_active_controller
kafka_controller_offline_partitions
kafka_log_size
kafka_network_consumer_request_time_milliseconds
kafka_network_fetch_follower_time_milliseconds
kafka_network_producer_request_time_milliseconds
kafka_server_bytes_in
kafka_server_bytes_out
kafka_server_consumer_client_byterate
kafka_server_consumer_client_throttle_time
kafka_server_consumer_user_byterate
kafka_server_consumer_user_client_byterate
kafka_server_consumer_user_client_throttle_time
kafka_server_consumer_user_throttle_time
kafka_server_messages_in
kafka_server_partition_leader_count
kafka_server_producer_client_byterate
kafka_server_producer_client_throttle_time
kafka_server_producer_user_byterate
kafka_server_producer_user_client_byterate
kafka_server_producer_user_client_throttle_time
kafka_server_producer_user_throttle_time
kafka_server_under_isr_partitions
kafka_server_under_replicated_partitions
kafka_server_zookeeper_auth_failures
kafka_server_zookeeper_disconnections
kafka_server_zookeeper_expired_sessions
kafka_server_zookeeper_read_only_connections
kafka_server_zookeeper_sasl_authentications
kafka_server_zookeeper_sync_connections
kafka_topic_partition_current_offset
kafka_topic_partition_oldest_offset
kube_workload_status_desired

Preparing the Integration

Installation of the JMX-Exporter as a Sidecar

The JMX-Exporter can be easily installed in two steps.

First deploy the ConfigMap which contains the Kafka JMX configurations. The following example is for a Kafka cluster which exposes the jmx port 9010:

helm repo add promcat-charts https://sysdiglabs.github.io/integrations-charts 
helm repo update
helm -n kafka install kafka-jmx-exporter promcat-charts/jmx-exporter --set jmx_port=9010 --set integrationType=kafka --set onlyCreateJMXConfigMap=true

Then generate a patch file and apply it to your workload (your Kafka Deployment/StatefulSet/Daemonset). The following example is for a Kafka cluster which exposes the jmx port 9010, and is deployed as a StatefulSet called ‘kafka-cp-kafka’:

helm template kafka-jmx-exporter promcat-charts/jmx-exporter --set jmx_port=9010 --set integrationType=kafka --set onlyCreateSidecarPatch=true > jmx-exporter-sidecar-patch.yaml
kubectl -n kafka patch sts kafka-cp-kafka --patch-file jmx-exporter-sidecar-patch.yaml

Create Secrets for Authentication for the Kafka-Exporter

Your Kafka cluster external endpoints might be secured by using authentication for the clients that want to connect to it (TLS, SASL+SCARM, SASL+Kerberos). If you are going to make the Kafka-Exporter (which will be deployed in the next tab) use these secured external endpoints, then you’ll need to create Kubernetes Secrets in the following step. If you prefer using an internal not-secured (plaintext) endpoint for the Kafka-Exporter to connect to the Kafka cluster, then skip this step.

If using TLS, you’ll need to create a Secret which contains the CA, the client certificate and the client key. The names of these files must be “ca.crt”, “tls.crt” and “tls.key”. The name of the secret can be any name that you want. Example:

kubectl create secret generic kafka-exporter-certs --from-file=./tls.key --from-file=./tls.crt --from-file=./ca.crt --dry-run=true -o yaml | kubectl apply -f -

If using SASL+SCRAM, you’ll need to create a Secret which contains the “username” and “password”. Example:

echo -n 'admin' > username
echo -n '1f2d1e2e67df' > password
kubectl create secret generic kafka-exporter-sasl-scram --from-file=username --from-file=password --dry-run=true -o yaml | kubectl apply -f -

If using SASL+Kerberos, you’ll need to create a Secret which contains the “kerberos.conf”. If the ‘Kerberos Auth Type’ is ‘keytabAuth’, it should also contain the “kerberos.keytab”. Example:

kubectl create secret generic kafka-exporter-sasl-kerberos --from-file=./kerberos.conf --from-file=./kerberos.keytab --dry-run=true -o yaml | kubectl apply -f -

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use these Helm charts for expert users:

Monitoring and Troubleshooting Kafka

Here are some interesting metrics and queries to monitor and troubleshoot Kafka.

Brokers

Broker Down

Let’s get the number of expected Brokers, and the actual number of Brokers up and running. If the number is not the same, then there might a problem.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_desired)
-
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kafka_brokers)
> 0

Leadership

Let’s get the number of Kafka leaders. There should always be one leader. If not, a Kafka misconfiguration or a networking issue might be the problem.

sum(kafka_controller_active_controller) < 1

If there are more than one leader, that might be a temporal situation while the leadership is changing. If this case doesn’t get fixed by itslef over time, a split-brain situation might be happening.

sum(kafka_controller_active_controller) > 1

Offline, Under Replicated and In-Sync Under Replicated Partitions

When a Broker goes down, the other Brokers in the cluster will take leadership of the partitions it was leading. If several brokers go down, or just a few but the topic had a low replication factor, there will be Offline partitions. These partitions don’t have an active leader and are hence not writable or readable, which will most likely dangerous for business.

Let’s check if there are offline partitions:

sum(kafka_controller_offline_partitions) > 0

If other Brokers had replicas of those partitions, one of them will take leadership and the service won’t be down. In this situation there will be Under Replicated partitions. If there are enough Brokers where these partitions can be replicated, the situation will be fixed by itself over time. If there aren’t enough Brokers, the situation will only be fixed once the Brokers which went down come up again.

The following expression is used to get the under replication partitions:

sum(kafka_server_under_replicated_partitions) > 0

But there is a situation when having no Offline partitons but having Under Replicated partitions might pose a real problem. That’s the case of topics with ‘Minimum In-Sync Replicas’, and Kafka Producers with the configuration ‘acks=all’.

If one of this topics has any partition with less replicas than its ‘Minimum In-Sync Replicas’ configuration, and there is Producer with ‘acks=all’, that Producer won’t be able to produce messages into that partition, since ‘acks=all’ means that it waits for the produced messages to be replicated in all the minimum replicas in the Kafka cluster.

If the Producers have any configuration different than ‘acks=all’, then there won’t be any problem.

This is how Under In-Sync Replicated partitions can be checked:

sum(kafka_server_under_isr_partitions) > 0

Network

Broker Bytes In

Let’s get the amount of bytes produced into each Broker:

sum by (kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_server_bytes_in)

Broker Bytes Out

Now the same, but for bytes consumed from each Broker:

sum by (kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_server_bytes_out)

Broker Messages In

And similar, but for number of messages produced into each Broker:

sum by (kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_server_messages_in)

Topics

Topic Size

This query returns the size of a topic in the whole Kafka cluster. It also includes the size of all replicas, so increasing the replication factor of a topic will increase the overall size across the Kafka cluster.

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_log_size)

In case of needing the size of a topic in each Broker, use the following query:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_log_size)

In a situation where the Broker disk space is running low, the retention of the topics can be decreased to free up some space. Let’s get the top 10 biggest topics:

topk(10,sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_log_size))

If this “low disk space” situation happened out of the blue, there might be a problem in a topic with a Producer filling it with unwanted messages. The following query will help find which topics increased their size the most in the past few hours, which will allow to find the responsible of the sudden increase of messages. It wouldn’t be the first time an exhausted developer wanted to perform a stress test in a topic in a Staging environment, but accidentally did it in Production.

topk(10,sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(delta(kafka_log_size[$__interval])))

Topic Messages

Calculating the number of messages inside a topic is as easy as substracting the offset of the newest message minus the offset of the oldest message:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_topic_partition_current_offset) - sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_topic_partition_oldest_offset)

But it’s very important to acknowledge that this is only true for topics with ‘compaction’ disabled, since compacted topics might have deleted messages in the middle. To get the number of messages in a compacted topic, a new Consumer must consume all the messages in that topic to count them.

It’s also quite easy to calculate the rate per second of messages being produced into a topic:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(rate(kafka_topic_partition_current_offset[$__interval]))

ConsumerGroup

ConsumerGroup Lag

Let’s check the ConsumerGroup lag of a Consumer in each partition of a topic:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic, partition)(kafka_consumergroup_lag)

If the lag of a ConsumerGroup is constantly increasing and never decreases, it might have different causes. The Consumers of the ConsumerGroups might be down, one of them might be failing to process the messages and continuously retrying, or their consumption rate might be lower than the production rate of messages.

A non-stop increasing lag can be detected using the following expression:

(sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic)(kafka_consumergroup_lag) > 0)
and
(sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic)(delta(kafka_consumergroup_lag[2m])) >= 0)

ConsumerGroup Consumption Rate

It might be useful to get the consumption speed of the Consumers of a ConsumerGroup, to detect any issues while processing messages, like internal issues related to the messages, or external issues related to the business. For example, the Consumers might want to send the processed messages to another microservice or another database, but there might be networking issues, or the database performance might be degraded so it slows down the Consumer.

Here you can check the consumption rate:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic, partition)(rate(kafka_consumergroup_current_offset[$__interval]))

ConsumerGroup Members

It might be also help to know the number of Consumers in a ConsumerGroup, in case there are less than expected:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup)(kafka_consumergroup_members)

Quotas

Kafka has the option to enforce quotas on requests to control the Broker resources used by clients (Producers and Consumers).

Quotas can be applied to user, client-id or both groups at the same time.

Each client can utilize this quota per Broker before it gets throttled. Throttling means that the client will need to wait some time before being able to produce or consume messages again.

Production/Consumption Rate

Depending if the client is a Consumer or a Producer, or if the quota is applied at cliend-id or user level, or both at the same time, a different metric will be used:

  • kafka_server_producer_client_byterate
  • kafka_server_producer_user_byterate
  • kafka_server_producer_user_client_byterate
  • kafka_server_consumer_client_byterate
  • kafka_server_consumer_user_byterate
  • kafka_server_consumer_user_client_byterate

Let’s check for example the production rate of a Producer using both user and client-id:

sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name, user, client_id)(kafka_server_producer_user_client_byterate)

Production/Consumption Throttle Time

Similar to the rate, there are throttle time for the same combinations of clients and quota groups:

  • kafka_server_producer_client_throttle_time
  • kafka_server_producer_user_throttle_time
  • kafka_server_producer_user_client_throttle_time
  • kafka_server_consumer_client_throttle_time
  • kafka_server_consumer_user_throttle_time
  • kafka_server_consumer_user_client_throttle_time

Let’s see in this case if the throtte time of a Consumer using user and client-id is higher than one second, at least in one Broker:

max by(kube_cluster_name, kube_namespace_name, kube_workload_name, user, client_id)(kafka_server_consumer_user_client_throttle_time) > 1000

Agent Configuration

These are the default agent jobs for this integration:

- job_name: 'kafka-exporter-default'
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (kafka-exporter);(.{0}$)
    replacement: kafka
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (kafka-exporter);(kafka)
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
    target_label: kube_workload_type
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
    target_label: kube_workload_name
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: (kafka_brokers|kafka_consumergroup_current_offset|kafka_consumergroup_lag|kafka_consumergroup_members|kafka_topic_partition_current_offset|kafka_topic_partition_oldest_offset|kube_workload_status_desired)
      action: keep
- job_name: 'kafka-jmx-default'
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (kafka-jmx-exporter);(kafka)
    replacement: kafka
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (kafka-jmx-exporter);(kafka)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: (kafka_controller_active_controller|kafka_controller_offline_partitions|kafka_log_size|kafka_network_consumer_request_time_milliseconds|kafka_network_fetch_follower_time_milliseconds|kafka_network_producer_request_time_milliseconds|kafka_server_bytes_in|kafka_server_bytes_out|kafka_server_consumer_client_byterate|kafka_server_consumer_client_throttle_time|kafka_server_consumer_user_byterate|kafka_server_consumer_user_client_byterate|kafka_server_consumer_user_client_throttle_time|kafka_server_consumer_user_throttle_time|kafka_server_messages_in|kafka_server_partition_leader_count|kafka_server_producer_client_byterate|kafka_server_producer_client_throttle_time|kafka_server_producer_user_byterate|kafka_server_producer_user_client_byterate|kafka_server_producer_user_client_throttle_time|kafka_server_producer_user_throttle_time|kafka_server_under_isr_partitions|kafka_server_under_replicated_partitions|kafka_server_zookeeper_auth_failures|kafka_server_zookeeper_disconnections|kafka_server_zookeeper_expired_sessions|kafka_server_zookeeper_read_only_connections|kafka_server_zookeeper_sasl_authentications|kafka_server_zookeeper_sync_connections)
      action: keep

15 - KEDA

Metrics, Dashboards, Alerts and more for KEDA Integration in Sysdig Monitor.
KEDA

This integration is enabled by default.

Versions supported: > v2.0

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 6 metrics.

Timeseries generated: 3 metrics per Keda deployment + 1 metric per API metric timeseries

List of Alerts

AlertDescriptionFormat
[Keda] Errors in Scaled ObjectErrors detected in scaled objectPrometheus

List of Dashboards

Keda

The dashboard provides information on the errors, values of the metrics generated and replicas of the scaled object. Keda

List of Metrics

Metric name
keda_metrics_adapter_scaled_object_errors
keda_metrics_adapter_scaler_metrics_value
kubernetes.hpa.replicas.current
kubernetes.hpa.replicas.desired
kubernetes.hpa.replicas.max
kubernetes.hpa.replicas.min

Preparing the Integration

Enable Prometheus Metrics

Keda instruments Prometheus metrics and annotates the metrics API pod with Prometheus annotations.

Make sure that the prometheus metrics are activated. If you install Keda with Helm you need to use the following flag:

--set prometheus.metricServer.enabled=true

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting KEDA

This document describes important metrics and queries that you can use to monitor and troubleshoot KEDA.

Tracking metrics status

You can track KEDA metrics status with following alerts: Exporter proccess is not serving metrics

# [KEDA] Exporter Process Down
absent(keda_metrics_adapter_scaled_object_errors{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: keda-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    regex: true
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (keda-operator-metrics-apiserver);(.{0}$)
    replacement: keda
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "keda"
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

16 - Kube State Metrics OSS

Metrics, Dashboards, Alerts and more for Kube State Metrics OSS Integration in Sysdig Monitor.
Kube State Metrics OSS

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

This integration has 19 metrics.

This integration refers to the official OSS KSM exporter for Kubernetes.

List of Dashboards

KSM Pod Status & Performance

The dashboard provides information on the Pod Status and Performance. KSM Pod Status & Performance

KSM Workload Status & Performance

The dashboard provides information on the Workload Status and Performance. KSM Workload Status & Performance

KSM Container Resource Usage & Troubleshooting

The dashboard provides information on the Container Resource Usage and Troubleshooting. KSM Container Resource Usage & Troubleshooting

KSM Cluster / Namespace Available Resources

The dashboard provides information on the Cluster and Namespace Available Resources. KSM Cluster / Namespace Available Resources

List of Metrics

Metric name
ksm_container_cpu_cores_used
ksm_container_cpu_quota_used_percent
ksm_container_info
ksm_container_memory_limit_used_percent
ksm_container_memory_used_bytes
ksm_kube_node_status_allocatable
ksm_kube_node_status_capacity
ksm_kube_pod_container_status_restarts_total
ksm_kube_pod_container_status_terminated_reason
ksm_kube_pod_container_status_waiting_reason
ksm_kube_pod_status_ready
ksm_kube_pod_status_reason
ksm_kube_resourcequota
ksm_workload_status_desired
ksm_workload_status_ready
kube_pod_container_cpu_request
kube_pod_container_memory_request
kube_pod_container_resource_limits_cpu_cores
kube_pod_container_resource_limits_memory_bytes

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This integration has no default agent job.

17 - Kubernetes

Metrics, Dashboards, Alerts and more for Kubernetes Integration in Sysdig Monitor.
Kubernetes

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

This integration has 70 metrics.

List of Alerts

AlertDescriptionFormat
[Kubernetes] Container WaitingContainer in waiting status for long time (CrashLoopBackOff, ImagePullErr…)Prometheus
[Kubernetes] Container RestartingContainer restartingPrometheus
[Kubernetes] Pod Not ReadyPod in not ready statusPrometheus
[Kubernetes] Init Container Waiting For a Long TimeInit container in waiting state (CrashLoopBackOff, ImagePullErr…)Prometheus
[Kubernetes] Pod Container Creating For a Long TimePod is stuck in ContainerCreating statePrometheus
[Kubernetes] Pod Container Terminated With ErrorPod Container Terminated With Error (OOMKilled, Error…)Prometheus
[Kubernetes] Init Container Terminated With ErrorInit Container Terminated With Error (OOMKilled, Error…)Prometheus
[Kubernetes] Workload with Pods not ReadyWorkload with Pods not Ready (Evicted, NodeLost, UnexpectedAdmissionError)Prometheus
[Kubernetes] Workload Replicas MismatchThere are pod in the workload that could not startPrometheus
[Kubernetes] Pod Not Scheduled For DaemonSetPods cannot be scheduled for DaemonSetPrometheus
[Kubernetes] Pods In DaemonSet Incorrectly ScheduledThere are pods from a DaemonSet that should not be runningPrometheus
[Kubernetes] CPU OvercommitCPU OverCommit in cluster. If one node fails, the cluster will not be able to schedule all the current pods.Prometheus
[Kubernetes] Memory OvercommitMemory OverCommit in cluster. If one node fails, the cluster will not be able to schedule all the current pods.Prometheus
[Kubernetes] CPU OverUsageCPU OverUsage in cluster. If one node fails, the cluster will not have enough CPU to run all the current pods.Prometheus
[Kubernetes] Memory OverUsageMemory OverUsage in cluster. If one node fails, the cluster will not have enough memory to run all the current pods.Prometheus
[Kubernetes] Container CPU ThrottlingContainer CPU usage next to limit. Possible CPU Throttling.Prometheus
[Kubernetes] Container Memory Next To LimitContainer memory usage next to limit. Risk of Out Of Memory Kill.Prometheus
[Kubernetes] Container CPU UnusedContainer unused CPU higher than 85% of request for 8 hours.Prometheus
[Kubernetes] Container Memory UnusedContainer unused Memory higher than 85% of request for 8 hours.Prometheus
[Kubernetes] Node Not ReadyNode in Not-Ready conditionPrometheus
[Kubernetes] Not All Nodes Are ReadyNot all nodes are in Ready condition.Prometheus
[Kubernetes] Too Many Pods In NodeNode close to its limits of pods.Prometheus
[Kubernetes] Node Readiness FlappingNode availability is unstable.Prometheus
[Kubernetes] Nodes DisappearedLess nodes in cluster than 30 minutes before.Prometheus
[Kubernetes] All Nodes Gone In ClusterAll Nodes Gone In Cluster.Prometheus
[Kubernetes] Node CPU High UsageHigh usage of CPU in node.Prometheus
[Kubernetes] Node Memory High UsageHigh usage of memory in node. Risk of pod eviction.Prometheus
[Kubernetes] Node Root File System Almost FullRoot file system in node almost full. To include other file systems, change the value of the device label from ‘.root.’ to your device namePrometheus
[Kubernetes] Max Schedulable Pod Less Than 1 CPU CoreThe maximum schedulable CPU request in a pod is less than 1 core.Prometheus
[Kubernetes] Max Schedulable Pod Less Than 512Mb MemoryThe maximum schedulable memory request in a pod is less than 512Mb.Prometheus
[Kubernetes] HPA Desired Scale Up Replicas UnreachedHPA could not reach the desired scaled up replicas for long time.Prometheus
[Kubernetes] HPA Desired Scale Down Replicas UnreachedHPA could not reach the desired scaled down replicas for long time.Prometheus
[Kubernetes] Job failed to completeJob failed to completePrometheus
[Kubernetes] Cluster is reaching maximum pod capacity (95%)Review cluster pod capacity to ensure pods can be scheduled.Prometheus

List of Dashboards

Workload Status & Performance

The dashboard provides information on the Workload Status and Performance. Workload Status & Performance

Pod Status & Performance

The dashboard provides information on the Pod Status and Performance. Pod Status & Performance

Cluster / Namespace Available Resources

The dashboard provides information on the Cluster and Namespace Available Resources. Cluster / Namespace Available Resources

Cluster Capacity Planning

Dashboard used for Cluster Capacity Planning. Cluster Capacity Planning

Container Resource Usage & Troubleshooting

The dashboard provides information on the Container Resource Usage and Troubleshooting. Container Resource Usage & Troubleshooting

Node Status & Performance

The dashboard provides information on the Node Status and Performance. Node Status & Performance

Pod Rightsizing & Workload Capacity Optimization

Dashboard used for Pod Rightsizing and Workload Capacity Optimization. Pod Rightsizing & Workload Capacity Optimization

Pod Scheduling Troubleshooting

Dashboard used for Pod Scheduling Troubleshooting. Pod Scheduling Troubleshooting

Horizontal Pod Autoscaler

The dashboard provides information on the Horizontal Pod Autoscalers. Horizontal Pod Autoscaler

Kubernetes Jobs

The dashboard provides information on the Kubernetes Jobs. Kubernetes Jobs

List of Metrics

Metric name
container.image
container.image.tag
kube_cronjob_next_schedule_time
kube_cronjob_status_active
kube_cronjob_status_last_schedule_time
kube_daemonset_status_current_number_scheduled
kube_daemonset_status_desired_number_scheduled
kube_daemonset_status_number_misscheduled
kube_daemonset_status_number_ready
kube_hpa_status_current_replicas
kube_hpa_status_desired_replicas
kube_job_complete
kube_job_failed
kube_job_spec_completions
kube_job_status_active
kube_namespace_labels
kube_node_info
kube_node_status_allocatable
kube_node_status_allocatable_cpu_cores
kube_node_status_allocatable_memory_bytes
kube_node_status_capacity
kube_node_status_capacity_cpu_cores
kube_node_status_capacity_memory_bytes
kube_node_status_capacity_pods
kube_node_status_condition
kube_node_sysdig_host
kube_pod_container_info
kube_pod_container_resource_limits
kube_pod_container_resource_requests
kube_pod_container_status_restarts_total
kube_pod_container_status_terminated_reason
kube_pod_container_status_waiting_reason
kube_pod_info
kube_pod_init_container_status_terminated_reason
kube_pod_init_container_status_waiting_reason
kube_pod_status_ready
kube_resourcequota
kube_workload_pods_status_reason
kube_workload_status_desired
kube_workload_status_ready
kubernetes.hpa.replicas.current
kubernetes.hpa.replicas.desired
kubernetes.hpa.replicas.max
kubernetes.hpa.replicas.min
sysdig_container_cpu_cores_used
sysdig_container_cpu_quota_used_percent
sysdig_container_info
sysdig_container_memory_limit_used_percent
sysdig_container_memory_used_bytes
sysdig_container_net_connection_in_count
sysdig_container_net_connection_out_count
sysdig_container_net_connection_total_count
sysdig_container_net_error_count
sysdig_container_net_http_error_count
sysdig_container_net_http_request_time
sysdig_container_net_http_statuscode_request_count
sysdig_container_net_in_bytes
sysdig_container_net_out_bytes
sysdig_container_net_request_count
sysdig_container_net_request_time
sysdig_fs_free_bytes
sysdig_fs_inodes_used_percent
sysdig_fs_total_bytes
sysdig_fs_used_bytes
sysdig_fs_used_percent
sysdig_program_cpu_cores_used
sysdig_program_cpu_used_percent
sysdig_program_memory_used_bytes
sysdig_program_net_connection_total_count
sysdig_program_net_total_bytes

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This integration has no default agent job.

18 - Kubernetes API server

Metrics, Dashboards, Alerts and more for Kubernetes API server Integration in Sysdig Monitor.
Kubernetes API server

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 41 metrics.

List of Alerts

AlertDescriptionFormat
[Kubernetes API Server] Deprecated APIsAPI-Server Deprecated APIsPrometheus
[Kubernetes API Server] Certificate ExpiryAPI-Server Certificate ExpiryPrometheus
[Kubernetes API Server] Admission Controller High LatencyAPI-Server Admission Controller High LatencyPrometheus
[Kubernetes API Server] Webhook Admission Controller High LatencyAPI-Server Webhook Admission Controller High LatencyPrometheus
[Kubernetes API Server] High 4xx RequestError RateAPIS-Server High 4xx Request Error RatePrometheus
[Kubernetes API Server] High 5xx RequestError RateAPIS-Server High 5xx Request Error RatePrometheus
[Kubernetes API Server] High Request LatencyAPIS-Server High Request LatencyPrometheus

List of Dashboards

Kubernetes API Server

The dashboard provides information on the Kubernetes API Server. Kubernetes API Server

List of Metrics

Metric name
apiserver_admission_controller_admission_duration_seconds_count
apiserver_admission_controller_admission_duration_seconds_sum
apiserver_admission_webhook_admission_duration_seconds_count
apiserver_admission_webhook_admission_duration_seconds_sum
apiserver_client_certificate_expiration_seconds_bucket
apiserver_client_certificate_expiration_seconds_count
apiserver_request_duration_seconds_count
apiserver_request_duration_seconds_sum
apiserver_request_total
apiserver_requested_deprecated_apis
apiserver_response_sizes_count
apiserver_response_sizes_sum
go_build_info
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
process_cpu_seconds_total
process_max_fds
process_open_fds
process_resident_memory_bytes
workqueue_adds_total
workqueue_depth

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This is the default agent job for this integration:

- job_name: kubernetes-apiservers-default
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    regex: kube-system;kube-apiserver
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_container_name
  - source_labels:
    - __address__
    action: replace
    target_label: __address__
    regex: (.+)(:\d.+)
    replacement: $1:443
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - action: replace
    source_labels: 
    - __name__
    - resource
    target_label: k8sresource
    regex: (apiserver_requested_deprecated_apis);(.+)
    replacement: $2
  - action: labeldrop
    regex: "^(resource|resourcescope|subresource)$"
  - source_labels: [__name__]
    regex: (apiserver_admission_controller_admission_duration_seconds_count|apiserver_admission_controller_admission_duration_seconds_sum|apiserver_admission_webhook_admission_duration_seconds_count|apiserver_admission_webhook_admission_duration_seconds_sum|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_request_duration_seconds_count|apiserver_request_duration_seconds_sum|apiserver_request_total|apiserver_requested_deprecated_apis|apiserver_response_sizes_count|apiserver_response_sizes_sum|go_build_info|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_goroutines|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads|process_cpu_seconds_total|process_max_fds|process_open_fds|process_resident_memory_bytes|workqueue_adds_total|workqueue_depth)
    action: keep

19 - Kubernetes controller manager

Metrics, Dashboards, Alerts and more for Kubernetes controller manager Integration in Sysdig Monitor.
Kubernetes controller manager

This integration is enabled by default.

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 42 metrics.

List of Alerts

AlertDescriptionFormat
[Kubernetes controller manager] High 4xx RequestError RateKubernetes Controller Manager High 4xx Request Error RatePrometheus
[Kubernetes controller manager] High 5xx RequestError RateKubernetes Controller Manager High 5xx Request Error RatePrometheus

List of Dashboards

Kubernetes Controller Manager

The dashboard provides information on the Kubernetes Controller Manager. Kubernetes Controller Manager

List of Metrics

Metric name
cloudprovider_aws_api_request_duration_seconds_count
cloudprovider_aws_api_request_duration_seconds_sum
cloudprovider_aws_api_request_errors
go_build_info
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
process_cpu_seconds_total
process_max_fds
process_open_fds
rest_client_request_duration_seconds_count
rest_client_request_duration_seconds_sum
rest_client_requests_total
sysdig_container_cpu_cores_used
sysdig_container_memory_used_bytes
workqueue_adds_total
workqueue_depth
workqueue_queue_duration_seconds_count
workqueue_queue_duration_seconds_sum
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds_count
workqueue_work_duration_seconds_sum

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This is the default agent job for this integration:

- job_name: kube-controller-manager-default
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'kube-system/kube-controller-manager.+'
  - source_labels:
    - __address__
    action: replace
    target_label: __address__
    regex: (.+?)(\\:\\d)?
    replacement: $1:10257
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (cloudprovider_aws_api_request_duration_seconds_count|cloudprovider_aws_api_request_duration_seconds_sum|cloudprovider_aws_api_request_errors|go_goroutines|rest_client_request_duration_seconds_count|rest_client_request_duration_seconds_sum|rest_client_requests_total|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_goroutines|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_sum|apiserver_client_certificate_expiration_seconds_count)
    action: keep

20 - Kubernetes CoreDNS

Metrics, Dashboards, Alerts and more for Kubernetes CoreDNS Integration in Sysdig Monitor.
Kubernetes CoreDNS

This integration is enabled by default.

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 37 metrics.

List of Alerts

AlertDescriptionFormat
[CoreDNS] Error HighHigh Request DurationPrometheus
[CoreDNS] Latency HighLatency HighPrometheus

List of Dashboards

Kubernetes CoreDNS

The dashboard provides information on the Kubernetes CoreDNS. Kubernetes CoreDNS

List of Metrics

Metric name
coredns_cache_hits_total
coredns_cache_misses_total
coredns_dns_request_duration_seconds_bucket
coredns_dns_request_size_bytes_bucket
coredns_dns_requests_total
coredns_dns_response_size_bytes_bucket
coredns_dns_responses_total
coredns_forward_request_duration_seconds_bucket
coredns_panics_total
coredns_plugin_enabled
go_build_info
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
process_cpu_seconds_total
process_max_fds
process_open_fds
process_resident_memory_bytes

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This is the default agent job for this integration:

- job_name: kube-dns-default
  honor_labels: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'kube-system/coredns.+'
  - source_labels:
    - __address__
    action: keep
    regex: (.*:9153)
  - source_labels:
    - __meta_kubernetes_pod_name
    action: replace
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

21 - Kubernetes etcd

Metrics, Dashboards, Alerts and more for Kubernetes etcd Integration in Sysdig Monitor.
Kubernetes etcd

This integration is enabled by default.

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 54 metrics.

List of Alerts

AlertDescriptionFormat
[Etcd] Etcd Members DownThere are members down.Prometheus
[Etcd] Etcd Insufficient MembersEtcd cluster has insufficient membersPrometheus
[Etcd] Etcd No LeaderMember has no leader.Prometheus
[Etcd] Etcd High Number Of Leader ChangesLeader changes within the last 15 minutes.Prometheus
[Etcd] Etcd High Number Of Failed GRPC RequestsHigh number of failed grpc requestsPrometheus
[Etcd] Etcd GRPC Requests SlowgRPC requests are taking too much timePrometheus
[Etcd] Etcd High Number Of Failed ProposalsHigh number of proposal failures within the last 30 minutes on etcd instancePrometheus
[Etcd] Etcd High Fsync Durations99th percentile fync durations are too highPrometheus
[Etcd] Etcd High Commit Durations99th percentile commit durations are too highPrometheus
[Etcd] Etcd HighNumber Of Failed HTTP RequestsHigh number of failed http requestsPrometheus
[Etcd] Etcd HTTP Requests SlowHttps request are slowPrometheus

List of Dashboards

Kubernetes Etcd

The dashboard provides information on the Kubernetes Etcd. Kubernetes Etcd

List of Metrics

Metric name
etcd_debugging_mvcc_db_total_size_in_bytes
etcd_disk_backend_commit_duration_seconds_bucket
etcd_disk_wal_fsync_duration_seconds_bucket
etcd_grpc_proxy_cache_hits_total
etcd_grpc_proxy_cache_misses_total
etcd_http_failed_total
etcd_http_received_total
etcd_http_successful_duration_seconds_bucket
etcd_mvcc_db_total_size_in_bytes
etcd_network_client_grpc_received_bytes_total
etcd_network_client_grpc_sent_bytes_total
etcd_network_peer_received_bytes_total
etcd_network_peer_received_failures_total
etcd_network_peer_round_trip_time_seconds_bucket
etcd_network_peer_sent_bytes_total
etcd_network_peer_sent_failures_total
etcd_server_has_leader
etcd_server_id
etcd_server_leader_changes_seen_total
etcd_server_proposals_applied_total
etcd_server_proposals_committed_total
etcd_server_proposals_failed_total
etcd_server_proposals_pending
go_build_info
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
grpc_server_handled_total
grpc_server_handling_seconds_bucket
grpc_server_started_total
process_cpu_seconds_total
process_max_fds
process_open_fds
sysdig_container_cpu_cores_used
sysdig_container_memory_used_bytes

Preparing the Integration

Add Certificate for Sysdig Agent

Disclaimer: This patch only works in vanilla Kubernetes

kubectl -n sysdig-agent patch ds sysdig-agent -p '{"spec":{"template":{"spec":{"volumes":[{"hostPath":{"path":"/etc/kubernetes/pki/etcd-manager-main","type":"DirectoryOrCreate"},"name":"etcd-certificates"}]}}}}'
kubectl -n sysdig-agent patch ds sysdig-agent -p '{"spec":{"template":{"spec":{"containers":[{"name":"sysdig","volumeMounts": [{"mountPath": "/etc/kubernetes/pki/etcd-manager","name": "etcd-certificates"}]}]}}}}'

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This is the default agent job for this integration:

- job_name: etcd-default
  scheme: https
  tls_config:
    insecure_skip_verify: true
    cert_file: /etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt
    key_file: /etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'kube-system/etcd-manager-main.+'
  - source_labels:
    - __address__
    action: replace
    target_label: __address__
    regex: (.+?)(\\:\\d)?
    replacement: $1:4001
    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (etcd_http_failed_total|etcd_http_received_total|etcd_http_successful_duration_seconds_bucket|etcd_server_has_leader|etcd_server_leader_changes_seen_total|etcd_server_proposals_failed_total|go_build_info|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads|process_cpu_seconds_total|grpc_server_started_total|grpc_server_started_total|grpc_server_started_total|grpc_server_handled_total|etcd_debugging_mvcc_db_total_size_in_bytes|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_disk_backend_commit_duration_seconds_bucket|sysdig_container_memory_used_bytes|etcd_server_proposals_committed_total|etcd_server_proposals_applied_total|sysdig_container_cpu_cores_used|go_goroutines|grpc_server_handled_total|grpc_server_handled_total|etcd_server_id|etcd_disk_backend_commit_duration_seconds_bucket|etcd_grpc_proxy_cache_hits_total|etcd_grpc_proxy_cache_misses_total|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|process_max_fds|process_open_fds|etcd_server_proposals_pending|etcd_network_peer_sent_failures_total|etcd_network_peer_received_failures_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_client_grpc_sent_bytes_total|etcd_network_client_grpc_received_bytes_total|etcd_network_peer_sent_bytes_total|etcd_network_peer_received_bytes_total|grpc_server_handling_seconds_bucket|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_sum|apiserver_client_certificate_expiration_seconds_count|etcd_mvcc_db_total_size_in_bytes)
    action: keep

22 - Kubernetes kube-proxy

Metrics, Dashboards, Alerts and more for Kubernetes kube-proxy Integration in Sysdig Monitor.
Kubernetes kube-proxy

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 10 metrics.

List of Alerts

AlertDescriptionFormat
[KubeProxy] Kube Proxy DownKubeProxy detected downPrometheus
[KubeProxy] High Rest Client LatencyHigh Rest Client Latency detectedPrometheus
[KubeProxy] High Rule Sync LatencyHigh Rule Sync Latency detectedPrometheus
[KubeProxy] Too Many 500 CodeToo Many 500 Code detectedPrometheus

List of Dashboards

Kubernetes Proxy

The dashboard provides information on the Kubernetes Proxy. Kubernetes Proxy

List of Metrics

Metric name
go_goroutines
kube_node_info
kubeproxy_network_programming_duration_seconds_bucket
kubeproxy_network_programming_duration_seconds_count
kubeproxy_sync_proxy_rules_duration_seconds_bucket
kubeproxy_sync_proxy_rules_duration_seconds_count
process_cpu_seconds_total
process_resident_memory_bytes
rest_client_request_duration_seconds_bucket
rest_client_requests_total

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This is the default agent job for this integration:

- job_name: kubernetes-kube-proxy-default
  honor_labels: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'kube-system/kube-proxy.+'
  - source_labels:
    - __address__
    action: replace
    target_label: __address__
    regex: (.+?)(\\:\\d+)?
    replacement: $1:10249
  - source_labels:
    - __meta_kubernetes_pod_name
    action: replace
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (up|kubeproxy_sync_proxy_rules_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|kubeproxy_network_programming_duration_seconds_count|kubeproxy_network_programming_duration_seconds_bucket|rest_client_requests_total|rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_bucket|process_resident_memory_bytes|process_cpu_seconds_total|go_goroutines|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_sum|apiserver_client_certificate_expiration_seconds_count)
    action: keep

23 - Kubernetes kubelet

Metrics, Dashboards, Alerts and more for Kubernetes kubelet Integration in Sysdig Monitor.
Kubernetes kubelet

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 25 metrics.

List of Alerts

AlertDescriptionFormat
[k8s-kubelet] Kubelet Too Many PodsKubelet Too Many PodsPrometheus
[k8s-kubelet] Kubelet Pod Lifecycle Event Generator Duration HighKubelet Pod Lifecycle Event Generator Duration HighPrometheus
[k8s-kubelet] Kubelet Pod StartUp Latency HighKubelet Pod StartUp Latency HighPrometheus
[k8s-kubelet] Kubelet DownKubelet DownPrometheus

List of Dashboards

Kubernetes Kubelet

The dashboard provides information on the Kubernetes Kubelet. Kubernetes Kubelet

List of Metrics

Metric name
go_goroutines
kube_node_status_capacity_pods
kube_node_status_condition
kubelet_cgroup_manager_duration_seconds_bucket
kubelet_cgroup_manager_duration_seconds_count
kubelet_node_config_error
kubelet_pleg_relist_duration_seconds_bucket
kubelet_pleg_relist_interval_seconds_bucket
kubelet_pod_start_duration_seconds_bucket
kubelet_pod_start_duration_seconds_count
kubelet_pod_worker_duration_seconds_bucket
kubelet_pod_worker_duration_seconds_count
kubelet_running_containers
kubelet_running_pod_count
kubelet_running_pods
kubelet_runtime_operations_duration_seconds_bucket
kubelet_runtime_operations_errors_total
kubelet_runtime_operations_total
process_cpu_seconds_total
process_resident_memory_bytes
rest_client_request_duration_seconds_bucket
rest_client_requests_total
storage_operation_duration_seconds_bucket
storage_operation_duration_seconds_count
volume_manager_total_volumes

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This is the default agent job for this integration:

- job_name: k8s-kubelet-default
  scrape_interval: 60s
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_node_address_InternalIP]
    regex: __HOSTIPS__
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
    replacement: kube_node_label_$1
  - replacement: localhost:10250
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_node_name]
    target_label: kube_node_name
  - action: replace
    source_labels: [__meta_kubernetes_namespace]
    target_label: kube_namespace_name
  metric_relabel_configs:
  # - source_labels: [__name__]
  #   regex: "kubelet_volume(.+)|storage(.+)"
  #   action: drop
  - source_labels: [__name__]
    regex: (go_goroutines|kube_node_status_capacity_pods|kube_node_status_condition|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_containers|kubelet_running_pods|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubernetes_build_info|process_cpu_seconds_total|process_resident_memory_bytes|rest_client_request_duration_seconds_bucket|rest_client_requests_total|volume_manager_total_volumes)
    action: keep

24 - Kubernetes PVC

Metrics, Dashboards, Alerts and more for Kubernetes PVC Integration in Sysdig Monitor.
Kubernetes PVC

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 9 metrics.

List of Alerts

AlertDescriptionFormat
[k8s-pvc] PV Not AvailablePersistent Volume not availablePrometheus
[k8s-pvc] PVC Pending For a Long TimePersistent Volume Claim not availablePrometheus
[k8s-pvc] PVC LostPersistent Volume Claim lostPrometheus
[k8s-pvc] PVC Storage Usage Is Reaching The LimitPersistent Volume Claim storage at 95%Prometheus
[k8s-pvc] PVC Inodes Usage Is Reaching The LimitPVC inodes Usage Is Reaching The LimitPrometheus
[k8s-pvc] PV Full In Four DaysPersistent Volume Full In Four DaysPrometheus

List of Dashboards

PVC and Storage

The dashboard provides information on the Kubernetes PVC and Storage. PVC and Storage

List of Metrics

Metric name
kube_persistentvolume_status_phase
kube_persistentvolumeclaim_status_phase
kubelet_volume_stats_available_bytes
kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_used
kubelet_volume_stats_used_bytes
storage_operation_duration_seconds_bucket
storage_operation_duration_seconds_count

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This is the default agent job for this integration:

- job_name: k8s-pvc-default
  scrape_interval: 60s
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_node_address_InternalIP]
    regex: __HOSTIPS__
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
    replacement: kube_node_label_$1
  - replacement: localhost:10250
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_node_name]
    target_label: kube_node_name
  metric_relabel_configs:
  # - source_labels: [__name__]
  #   regex: "kubelet_volume(.+)"
  #   action: keep
  - source_labels: [__name__]
    regex: (kube_persistentvolume_status_phase|kube_persistentvolumeclaim_status_phase|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubelet_volume_stats_used_bytes)
    action: keep
  - action: replace
    source_labels: [namespace]
    target_label: kube_namespace_name

25 - Kubernetes Scheduler

Metrics, Dashboards, Alerts and more for Kubernetes Scheduler Integration in Sysdig Monitor.
Kubernetes Scheduler

This integration is enabled by default.

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 45 metrics.

List of Alerts

AlertDescriptionFormat
[Kubernetes Scheduler] Failed Attempts to Schedule PodsThe error rate of attempts to schedule pods is high.Prometheus

List of Dashboards

Kubernetes Scheduler

The dashboard provides information on the Kubernetes Scheduler. Kubernetes Scheduler

List of Metrics

Metric name
go_build_info
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
process_cpu_seconds_total
process_max_fds
process_open_fds
rest_client_request_duration_seconds_count
rest_client_request_duration_seconds_sum
rest_client_requests_total
scheduler_e2e_scheduling_duration_seconds_count
scheduler_e2e_scheduling_duration_seconds_sum
scheduler_pending_pods
scheduler_pod_scheduling_attempts_count
scheduler_pod_scheduling_attempts_sum
scheduler_schedule_attempts_total
sysdig_container_cpu_cores_used
sysdig_container_memory_used_bytes
workqueue_adds_total
workqueue_depth
workqueue_queue_duration_seconds_count
workqueue_queue_duration_seconds_sum
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds_count
workqueue_work_duration_seconds_sum

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This is the default agent job for this integration:

- job_name: kube-scheduler-default
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'kube-system/kube-scheduler.+'
  - source_labels:
    - __address__
    action: replace
    target_label: __address__
    regex: (.+?)(\\:\\d)?
    replacement: $1:10259
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (rest_client_request_duration_seconds_count|rest_client_request_duration_seconds_sum|rest_client_requests_total|scheduler_e2e_scheduling_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_pending_pods|scheduler_pod_scheduling_attempts_count|scheduler_pod_scheduling_attempts_sum|scheduler_schedule_attempts_total|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_goroutines|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_sum|apiserver_client_certificate_expiration_seconds_count)
    action: keep

26 - Kubernetes storage

Metrics, Dashboards, Alerts and more for Kubernetes storage Integration in Sysdig Monitor.
Kubernetes storage

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 8 metrics.

List of Alerts

AlertDescriptionFormat
[k8s-storage] High Storage Error RateHigh Storage Error RatePrometheus
[k8s-storage] High Storage LatencyHigh Storage LatencyPrometheus

List of Metrics

Metric name
kube_persistentvolume_status_phase
kube_persistentvolumeclaim_status_phase
kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_inodes
kubelet_volume_stats_inodes_used
kubelet_volume_stats_used_bytes
storage_operation_duration_seconds_bucket
storage_operation_duration_seconds_count

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This is the default agent job for this integration:

- job_name: k8s-storage-default
  scrape_interval: 60s
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_node_address_InternalIP]
    regex: __HOSTIPS__
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
    replacement: kube_node_label_$1
  - replacement: localhost:10250
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_node_name]
    target_label: kube_node_name
  metric_relabel_configs:
  # - source_labels: [__name__]
  #   regex: "storage(.+)"
  #   action: keep
  - source_labels: [__name__]
    regex: (storage_operation_duration_seconds_bucket|storage_operation_duration_seconds_count)
    action: keep
  - action: replace
    source_labels: [namespace]
    target_label: kube_namespace_name

27 - Linux

Metrics, Dashboards, Alerts and more for Linux Integration in Sysdig Monitor.
Linux

This integration is enabled by default.

This integration has 19 metrics.

List of Alerts

AlertDescriptionFormat
[Linux] High CPU UsageThe CPU of the Linux instance reached 95% of usePrometheus
[Linux] High Disk UsageDisk full over 95% in hostPrometheus
[Linux] Disk Will Fill In 12 HoursDisk full in 12h in hostPrometheus
[Linux] High Physical Memory UsageHigh physical memory usage in instancePrometheus

List of Dashboards

Linux Host Overview

The dashboard provides a general overview for a regular Linux host. Linux Host Overview

List of Metrics

Metric name
sysdig_fs_free_percent
sysdig_fs_used_percent
sysdig_host_cpu_cores_used_percent
sysdig_host_cpu_system_percent
sysdig_host_file_open_count
sysdig_host_file_total_bytes
sysdig_host_fs_free_bytes
sysdig_host_fs_used_percent
sysdig_host_memory_available_bytes
sysdig_host_memory_used_percent
sysdig_host_net_connection_in_count
sysdig_host_net_connection_out_count
sysdig_host_net_total_bytes
sysdig_program_cpu_used_percent
sysdig_program_file_open_count
sysdig_program_memory_used_percent
sysdig_program_net_connection_total_count
sysdig_program_net_request_in_count
sysdig_program_net_total_bytes

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting Linux

The Linux integration uses the out-of-the-box sysdig_host_* and sysdig_program_* metrics in the dashboards and alerts.

Agent Configuration

This integration has no default agent job.

28 - Memcached

Metrics, Dashboards, Alerts and more for Memcached Integration in Sysdig Monitor.
Memcached

This integration is enabled by default.

Versions supported: > v1.5

This integration uses a sidecar exporter that is available in UBI or scratch base image.

This integration has 13 metrics.

Timeseries generated: 20 series per instance

List of Alerts

AlertDescriptionFormat
[Memcached] Instance DownInstance is not reachablePrometheus
[Memcached] Low UpTimeUptime of less than 1 hour in a Memcached instancePrometheus
[Memcached] Connection ThrottledConnection throttled because max number of requests per event process reachedPrometheus
[Memcached] Connections Close To The Limit 85%The mumber of connections are close to the limitPrometheus
[Memcached] Connections Limit ReachedReached the number of maximum connections and caused a connection errorPrometheus

List of Dashboards

Memcached

The dashboard provides information on the status and performance of the Memcached instance. Memcached

List of Metrics

Metric name
memcached_commands_total
memcached_connections_listener_disabled_total
memcached_connections_yielded_total
memcached_current_bytes
memcached_current_connections
memcached_current_items
memcached_items_evicted_total
memcached_items_reclaimed_total
memcached_items_total
memcached_limit_bytes
memcached_max_connections
memcached_up
memcached_uptime_seconds

Preparing the Integration

No preparations are required for this integration.

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/memcached-exporter

Agent Configuration

This is the default agent job for this integration:

- job_name: memcached-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "memcached"
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (memcached_commands_total|memcached_connections_listener_disabled_total|memcached_connections_yielded_total|memcached_current_bytes|memcached_current_connections|memcached_current_items|memcached_items_evicted_total|memcached_items_reclaimed_total|memcached_items_total|memcached_limit_bytes|memcached_max_connections|memcached_up|memcached_uptime_seconds)
    action: keep

29 - MongoDB

Metrics, Dashboards, Alerts and more for MongoDB Integration in Sysdig Monitor.
MongoDB

This integration is enabled by default.

Versions supported: > v4.2

This integration uses a standalone exporter that is available in UBI or scratch base image.

This integration has 28 metrics.

Timeseries generated: 500 series per instance

List of Alerts

AlertDescriptionFormat
[MongoDB] Instance DownMongo server detected down by instancePrometheus
[MongoDB] Uptime less than one hourMongo server detected down by instancePrometheus
[MongoDB] Asserts detectedMongo server detected down by instancePrometheus
[MongoDB] High LatencyHigh latency in instancePrometheus
[MongoDB] High Ticket UtilizationTicket usage over 75% in instancePrometheus
[MongoDB] Recurrent Cursor TimeoutRecurrent cursors timeout in instancePrometheus
[MongoDB] Recurrent Memory Page FaultsRecurrent cursors timeout in instancePrometheus

List of Dashboards

MongoDB Instance Health

The dashboard provides information on the connections, cache hit rate, error rate, latency and traffic of one of the databases of the MongoDB instance. MongoDB Instance Health

MongoDB Database Details

The dashboard provides information on the status, error rate and resource usage of a MongoDB instance. MongoDB Database Details

List of Metrics

Metric name
mongodb_asserts_total
mongodb_connections
mongodb_extra_info_page_faults_total
mongodb_instance_uptime_seconds
mongodb_memory
mongodb_mongod_db_collections_total
mongodb_mongod_db_data_size_bytes
mongodb_mongod_db_index_size_bytes
mongodb_mongod_db_indexes_total
mongodb_mongod_db_objects_total
mongodb_mongod_global_lock_client
mongodb_mongod_global_lock_current_queue
mongodb_mongod_global_lock_ratio
mongodb_mongod_metrics_cursor_open
mongodb_mongod_metrics_cursor_timed_out_total
mongodb_mongod_op_latencies_latency_total
mongodb_mongod_op_latencies_ops_total
mongodb_mongod_wiredtiger_cache_bytes
mongodb_mongod_wiredtiger_cache_bytes_total
mongodb_mongod_wiredtiger_cache_evicted_total
mongodb_mongod_wiredtiger_cache_pages
mongodb_mongod_wiredtiger_concurrent_transactions_out_tickets
mongodb_mongod_wiredtiger_concurrent_transactions_total_tickets
mongodb_network_bytes_total
mongodb_network_metrics_num_requests_total
mongodb_op_counters_total
mongodb_up
net.error.count

Preparing the Integration

Create Credentials for MongoDB Exporter

If you want to use a non-admin user for the exporter, you will have to create a user and grant the roles to be able to scrape statistics.

In the mongo shell:

use admin
db.auth("<YOUR-ADMIN-USER>", "<YOUR-ADMIN-PASSWORD>")
db.createUser(
   {
      user: "<YOUR-EXPORTER-USER>",
      pwd: "<YOUR-EXPORTER-PASSWORD>",
      roles: [
        { role: "clusterMonitor", db: "admin" },
        { role: "read", db: "admin" },
        { role: "read", db: "local" }
      ]
   }
)

Create Kubernetes Secret for Connection and Authentication

To configure authentication, do the following:

  1. Create a text file with the connection string (mongodb-uri) for your MongoDB by using these examples:
# Basic authentication
mongodb://<YOUR-EXPORTER-USER>:<YOUR-EXPORTER-PASSWORD>@<YOUR-MONGODB-HOST>:<PORT>

# TLS
mongodb://<YOUR-EXPORTER-USER>:<YOUR-EXPORTER-PASSWORD>@<YOUR-MONGODB-HOST>:<PORT>/admin?tls=true&tlsCertificateKeyFile=/etc/mongodb/mongodb-exporter-key.pem&tlsAllowInvalidCertificates=true&tlsCAFile=/etc/mongodb/mongodb-exporter-ca.pem

# SSL
mongodb://<YOUR-EXPORTER-USER>:<YOUR-EXPORTER-PASSWORD>@<YOUR-MONGODB-HOST>:<PORT>/admin?ssl=true&sslclientcertificatekeyfile=/etc/mongodb/mongodb-exporter-key.pem&sslinsecure=true&sslcertificateauthorityfile=/etc/mongodb/mongodb-exporter-ca.pem
  1. Create the secret for the connection string:
kubectl create secret -n Your-Exporter-Namespace generic Your-Mongodb-Uri-Secret-Name \
  --from-file=mongodb-uri=<route-to-file-with-mongodb-uri.txt>
  1. In case of TLS or SSL authentication, create the secret with the private key and the certificate authority (CA). If you do not have a CA file, you can use an empty file instead:
kubectl create secret -n Your-Exporter-Namespace generic mongodb-exporter-auth \
  --from-file=mongodb-key=<route-to-your-private-key.pem> \
  --from-file=mongodb-ca=<route-to-your-ca.pem>

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/mongodb-exporter

Monitoring and Troubleshooting MongoDB

This document describes important metrics and queries that you can use to monitor and troubleshoot MongoDB.

Tracking metrics status

You can track MongoDB metrics status with following alerts: Exporter proccess is not serving metrics

# [MongoDB] Exporter Process Down
absent(mongodb_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Exporter proccess is not serving metrics

# [MongoDB] Exporter Process Down
absent(mongodb_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: mongodb-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "mongodb"
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
    target_label: kube_workload_type
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
    target_label: kube_workload_name
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

30 - MySQL

Metrics, Dashboards, Alerts and more for MySQL Integration in Sysdig Monitor.
MySQL

This integration is enabled by default.

Versions supported: > v5.7

This integration uses a standalone exporter that is available in UBI or scratch base image.

This integration has 47 metrics.

Timeseries generated: 1005 series per instance

List of Alerts

AlertDescriptionFormat
[MySQL] Mysql DownMySQL instance is downPrometheus
[MySQL] Mysql RestartedMySQL has just been restarted, less than one minute agoPrometheus
[MySQL] Mysql Too many Connections (>80%)More than 80% of MySQL connections are in usePrometheus
[MySQL] Mysql High Threads RunningMore than 60% of MySQL connections are in running statePrometheus
[MySQL] Mysql HighOpen FilesMore than 80% of MySQL files openPrometheus
[MySQL] Mysql Slow QueriesMySQL server mysql has some new slow queryPrometheus
[MySQL] Mysql Innodb Log WaitsMySQL innodb log writes stallingPrometheus
[MySQL] Mysql Slave Io Thread Not RunningMySQL Slave IO thread not runningPrometheus
[MySQL] Mysql Slave Sql Thread Not RunningMySQL Slave SQL thread not runningPrometheus
[MySQL] Mysql Slave Replication LagMySQL Slave replication lagPrometheus

List of Dashboards

MySQL

The dashboard provides information on the status, error rate and resource usage of a MySQL instance. MySQL

List of Metrics

Metric name
mysql_global_status_aborted_clients
mysql_global_status_aborted_connects
mysql_global_status_buffer_pool_pages
mysql_global_status_bytes_received
mysql_global_status_bytes_sent
mysql_global_status_commands_total
mysql_global_status_connection_errors_total
mysql_global_status_innodb_buffer_pool_read_requests
mysql_global_status_innodb_buffer_pool_reads
mysql_global_status_innodb_log_waits
mysql_global_status_innodb_mem_adaptive_hash
mysql_global_status_innodb_mem_dictionary
mysql_global_status_innodb_page_size
mysql_global_status_questions
mysql_global_status_select_full_join
mysql_global_status_select_full_range_join
mysql_global_status_select_range_check
mysql_global_status_select_scan
mysql_global_status_slow_queries
mysql_global_status_sort_merge_passes
mysql_global_status_sort_range
mysql_global_status_sort_rows
mysql_global_status_sort_scan
mysql_global_status_table_locks_immediate
mysql_global_status_table_locks_waited
mysql_global_status_table_open_cache_hits
mysql_global_status_table_open_cache_misses
mysql_global_status_threads_cached
mysql_global_status_threads_connected
mysql_global_status_threads_created
mysql_global_status_threads_running
mysql_global_status_uptime
mysql_global_variables_innodb_additional_mem_pool_size
mysql_global_variables_innodb_log_buffer_size
mysql_global_variables_innodb_open_files
mysql_global_variables_key_buffer_size
mysql_global_variables_max_connections
mysql_global_variables_open_files_limit
mysql_global_variables_query_cache_size
mysql_global_variables_thread_cache_size
mysql_global_variables_tokudb_cache_size
mysql_slave_status_master_server_id
mysql_slave_status_seconds_behind_master
mysql_slave_status_slave_io_running
mysql_slave_status_slave_sql_running
mysql_slave_status_sql_delay
mysql_up

Preparing the Integration

Create Credentials for MySQL Exporter

  1. Create the user and password for the exporter in the database:
CREATE USER 'exporter' IDENTIFIED BY 'YOUR-PASSWORD' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter';

Replace the user name and the password in the SQL sentence for your custom ones.

  1. Create a mysql-exporter.cnf file with the credentials of the exporter:
[client]
user = exporter
password = "YOUR-PASSWORD"
host=YOUR-DB-IP
  1. In your cluster, create the secret with the mysql-exporter.cnf file. This file will be mounted in the exporter to authenticate with the database:
kubectl create secret -n Your-Application-Namespace generic mysql-exporter \
  --from-file=.my.cnf=./mysql-exporter.cnf

Using SSL Authentication

If your database requires SSL authentication, you need to create secrets with the certificates. To do so, create the secret with SSL certificates for the exporter:

kubectl create secret -n Your-Application-Namespace generic mysql-exporter \
  --from-file=.my.cnf=./mysql-exporter.cnf
  --from-file=ca.pem=./certs/ca.pem \
  --from-file=client-key.pem=./certs/client-key.pem \
  --from-file=client-cert.pem=./certs/client-cert.pem

In the mysql-exporter.cnf file, include the following lines to route to the certificates in the exporter:

[client]
user = exporter
password = "YOUR-PASSWORD"
host=YOUR-DB-IP
ssl-ca=/lib/cert/ca.pem
ssl-key=/lib/cert/client-key.pem
ssl-cert=/lib/cert/client-cert.pem

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/mysql-exporter

Monitoring and Troubleshooting MySQL

This document describes important metrics and queries that you can use to monitor and troubleshoot MySQL.

Tracking metrics status

You can track MySQL metrics status with following alerts: Exporter proccess is not serving metrics

# [MySQL] Exporter Process Down
absent(mysql_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: mysql-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "mysql"
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
    target_label: kube_workload_type
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
    target_label: kube_workload_name
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (mysql_global_status_aborted_clients|mysql_global_status_aborted_connects|mysql_global_status_buffer_pool_pages|mysql_global_status_bytes_received|mysql_global_status_bytes_sent|mysql_global_status_commands_total|mysql_global_status_connection_errors_total|mysql_global_status_innodb_buffer_pool_read_requests|mysql_global_status_innodb_buffer_pool_reads|mysql_global_status_innodb_log_waits|mysql_global_status_innodb_mem_adaptive_hash|mysql_global_status_innodb_mem_dictionary|mysql_global_status_innodb_page_size|mysql_global_status_questions|mysql_global_status_select_full_join|mysql_global_status_select_full_range_join|mysql_global_status_select_range_check|mysql_global_status_select_scan|mysql_global_status_slow_queries|mysql_global_status_sort_merge_passes|mysql_global_status_sort_range|mysql_global_status_sort_rows|mysql_global_status_sort_scan|mysql_global_status_table_locks_immediate|mysql_global_status_table_locks_waited|mysql_global_status_table_open_cache_hits|mysql_global_status_table_open_cache_misses|mysql_global_status_threads_cached|mysql_global_status_threads_connected|mysql_global_status_threads_created|mysql_global_status_threads_running|mysql_global_status_uptime|mysql_global_variables_innodb_additional_mem_pool_size|mysql_global_variables_innodb_log_buffer_size|mysql_global_variables_innodb_open_files|mysql_global_variables_key_buffer_size|mysql_global_variables_max_connections|mysql_global_variables_open_files_limit|mysql_global_variables_query_cache_size|mysql_global_variables_thread_cache_size|mysql_global_variables_tokudb_cache_size|mysql_slave_status_master_server_id|mysql_slave_status_seconds_behind_master|mysql_slave_status_slave_io_running|mysql_slave_status_slave_sql_running|mysql_slave_status_sql_delay|mysql_up)
    action: keep

31 - NGINX

Metrics, Dashboards, Alerts and more for NGINX Integration in Sysdig Monitor.
NGINX

This integration is enabled by default.

Versions supported: > v12

This integration uses a sidecar exporter that is available in UBI or scratch base image.

This integration has 12 metrics.

Timeseries generated: 8 series per nginx container

List of Alerts

AlertDescriptionFormat
[Nginx] No Intances UpNo Nginx instances UpPrometheus

List of Dashboards

Nginx

The dashboard provides information on the status of the NGINX server and Golden Signals. Nginx

List of Metrics

Metric name
net.bytes.in
net.bytes.out
net.http.error.count
net.http.request.count
net.http.request.time
nginx_connections_accepted
nginx_connections_active
nginx_connections_handled
nginx_connections_reading
nginx_connections_waiting
nginx_connections_writing
nginx_up

Preparing the Integration

Enable Nginx stub_status Module

The exporter can be installed as a sidecar of the pod with the Nginx server. To make Nginx expose an endpoint for scraping metrics, enable the stub_status module. If your Nginx configuration is defined inside a Kubernetes ConfigMap, add the following snippet to enable the stub_status module:

server {
  listen       80;
  server_name  localhost;
  location /nginx_status {
    stub_status on;
    access_log  on;
    allow all;  # REPLACE with your access policy
  }
}

This is how the ConfigMap would look after adding this snippet:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
data:
  nginx.conf: |
    server {
      listen       80;
      server_name  localhost;
      location /nginx_status {
        stub_status on;
        access_log  on;
        allow all;  # REPLACE with your access policy
      }
    }    

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/nginx-exporter

Monitoring and Troubleshooting NGINX

This document describes important metrics and queries that you can use to monitor and troubleshoot NGINX.

Tracking metrics status

You can track NGINX metrics status with following alerts: Exporter proccess is not serving metrics

# [NGINX] Exporter Process Down
absent(nginx_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: nginx-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "nginx"
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

32 - NGINX Ingress

Metrics, Dashboards, Alerts and more for NGINX Ingress Integration in Sysdig Monitor.
NGINX Ingress

This integration is enabled by default.

Versions supported: > v1.9.0

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 42 metrics.

Timeseries generated: 1500

This integration specifically supports kubernetes/ingress-nginx, and not other NGINX-Ingress versions like nginxinc/kubernetes-ingress.

List of Alerts

AlertDescriptionFormat
[Nginx-Ingress] High Http 4xx Error RateToo many HTTP requests with status 4xx (> 5%)Prometheus
[Nginx-Ingress] High Http 5xx Error RateToo many HTTP requests with status 5xx (> 5%)Prometheus
[Nginx-Ingress] High LatencyNginx p99 latency is higher than 10 secondsPrometheus
[Nginx-Ingress] Ingress Certificate ExpiryNginx Ingress Certificate will expire in less than 14 daysPrometheus

List of Dashboards

Nginx Ingress

The dashboard provides information on the error rate, resource usage, traffic and certificate expiration of the NGINX ingress. Nginx Ingress

List of Metrics

Metric name
go_build_info
go_gc_duration_seconds
go_gc_duration_seconds_count
go_gc_duration_seconds_sum
go_goroutines
go_memstats_buck_hash_sys_bytes
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
nginx_ingress_controller_config_last_reload_successful
nginx_ingress_controller_config_last_reload_successful_timestamp_seconds
nginx_ingress_controller_ingress_upstream_latency_seconds_count
nginx_ingress_controller_ingress_upstream_latency_seconds_sum
nginx_ingress_controller_nginx_process_connections
nginx_ingress_controller_nginx_process_cpu_seconds_total
nginx_ingress_controller_nginx_process_resident_memory_bytes
nginx_ingress_controller_request_duration_seconds_bucket
nginx_ingress_controller_request_duration_seconds_count
nginx_ingress_controller_request_duration_seconds_sum
nginx_ingress_controller_request_size_sum
nginx_ingress_controller_requests
nginx_ingress_controller_response_duration_seconds_count
nginx_ingress_controller_response_duration_seconds_sum
nginx_ingress_controller_response_size_sum
nginx_ingress_controller_ssl_expire_time_seconds
process_cpu_seconds_total
process_max_fds
process_open_fds

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting NGINX Ingress

This document describes important metrics and queries that you can use to monitor and troubleshoot NGINX Ingress.

Tracking metrics status

You can track NGINX Ingress metrics status with following alerts: Exporter proccess is not serving metrics

# [NGINX Ingress] Exporter Process Down
absent(nginx_ingress_controller_nginx_process_cpu_seconds_total{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: nginx-ingress-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    regex: true
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (controller|nginx-ingress-controller);(.{0}$)
    replacement: nginx-ingress
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "nginx-ingress"
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (go_build_info|nginx_ingress_controller_config_last_reload_successful|nginx_ingress_controller_config_last_reload_successful_timestamp_seconds|nginx_ingress_controller_ingress_upstream_latency_seconds_count|nginx_ingress_controller_ingress_upstream_latency_seconds_sum|nginx_ingress_controller_nginx_process_connections|nginx_ingress_controller_nginx_process_cpu_seconds_total|process_max_fds|process_open_fds|nginx_ingress_controller_nginx_process_resident_memory_bytes|nginx_ingress_controller_request_duration_seconds_bucket|nginx_ingress_controller_request_duration_seconds_count|nginx_ingress_controller_request_duration_seconds_sum|nginx_ingress_controller_request_size_sum|nginx_ingress_controller_requests|nginx_ingress_controller_response_duration_seconds_count|nginx_ingress_controller_response_duration_seconds_sum|nginx_ingress_controller_response_size_sum|nginx_ingress_controller_ssl_expire_time_seconds|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_goroutines|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads)
    action: keep

33 - NTP

Metrics, Dashboards, Alerts and more for NTP Integration in Sysdig Monitor.
NTP

This integration is enabled by default.

Versions supported: > v2

This integration uses a standalone exporter that is available in UBI or scratch base image.

This integration has 1 metrics.

Timeseries generated: 4 series per node

List of Alerts

AlertDescriptionFormat
[Ntp] Drift is too highDrift is too highPrometheus

List of Dashboards

NTP

The dashboard provides information on the drift of each node. NTP

List of Metrics

Metric name
ntp_drift_seconds

Preparing the Integration

No preparations are required for this integration.

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/ntp-exporter

Monitoring and Troubleshooting NTP

This document describes important metrics and queries that you can use to monitor and troubleshoot NTP.

Tracking metrics status

You can track NTP metrics status with following alerts: Exporter proccess is not serving metrics

# [NTP] Exporter Process Down
absent(ntp_drift_seconds{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: ntp-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "ntp"
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
    target_label: kube_workload_type
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
    target_label: kube_workload_name
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  - action: replace
    source_labels: [__meta_kubernetes_pod_node_name]
    target_label: kube_node_name
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

34 - OPA

Metrics, Dashboards, Alerts and more for OPA Integration in Sysdig Monitor.
OPA

This integration is enabled by default.

Versions supported: > v3.5.1

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 10 metrics.

Timeseries generated: 150 series for each Gatekeeper

List of Alerts

AlertDescriptionFormat
[Opa gatekeeper] Too much time since the last auditThere was more than 120 second since the last auditPrometheus
[Opa gatekeeper] Spike of violationsThere was more than 30 violationsPrometheus

List of Dashboards

OPA Gatekeeper

The dashboard provides information on the requests rate, latency, violations rate per constraint. OPA Gatekeeper

List of Metrics

Metric name
gatekeeper_audit_duration_seconds_bucket
gatekeeper_audit_last_run_time
gatekeeper_constraint_template_ingestion_count
gatekeeper_constraint_template_ingestion_duration_seconds_bucket
gatekeeper_constraint_templates
gatekeeper_constraints
gatekeeper_request_count
gatekeeper_request_duration_seconds_bucket
gatekeeper_request_duration_seconds_count
gatekeeper_violations

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting OPA

This document describes important metrics and queries that you can use to monitor and troubleshoot OPA.

Tracking metrics status

You can track OPA metrics status with following alerts: Exporter proccess is not serving metrics

# [OPA] Exporter Process Down
absent(gatekeeper_request_count{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: opa-default
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (manager);(.{0}$)
    replacement: opa-gatekeeper
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "opa-gatekeeper"
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_container_port_name
    regex: "metrics"
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels: [__address__,__meta_kubernetes_pod_container_port_name]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name

35 - OpenShift API-Server

Metrics, Dashboards, Alerts and more for OpenShift API-Server Integration in Sysdig Monitor.
OpenShift API-Server

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

Versions supported: > v4.8

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 17 metrics.

Timeseries generated: API Server generates ~5k timeseries

List of Alerts

AlertDescriptionFormat
[OpenShift API Server] Deprecated APIsAPI-Server Deprecated APIsPrometheus
[OpenShift API Server] Certificate ExpiryAPI-Server Certificate ExpiryPrometheus
[OpenShift API Server] Admission Controller High LatencyAPI-Server Admission Controller High LatencyPrometheus
[OpenShift API Server] Webhook Admission Controller High LatencyAPI-Server Webhook Admission Controller High LatencyPrometheus
[OpenShift API Server] High 4xx RequestError RateAPIS-Server High 4xx Request Error RatePrometheus
[OpenShift API Server] High 5xx RequestError RateAPIS-Server High 5xx Request Error RatePrometheus
[OpenShift API Server] High Request LatencyAPIS-Server High Request LatencyPrometheus

List of Dashboards

OpenShift v4 API Server

The dashboard provides information on the K8s API Server and OpenShift API Server. OpenShift v4 API Server

List of Metrics

Metric name
apiserver_admission_controller_admission_duration_seconds_count
apiserver_admission_controller_admission_duration_seconds_sum
apiserver_admission_webhook_admission_duration_seconds_count
apiserver_admission_webhook_admission_duration_seconds_sum
apiserver_client_certificate_expiration_seconds_bucket
apiserver_client_certificate_expiration_seconds_count
apiserver_request_duration_seconds_count
apiserver_request_duration_seconds_sum
apiserver_request_total
apiserver_requested_deprecated_apis
apiserver_response_sizes_count
apiserver_response_sizes_sum
go_goroutines
process_cpu_seconds_total
process_resident_memory_bytes
workqueue_adds_total
workqueue_depth

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting OpenShift API Server

Because OpenShift 4.X comes with both Prometheus and API servers ready to use, no additional installation is required. The OpenShift API server metrics are exposed using the \federated endpoint.

Learning how to monitor Kubernetes API server is vital when running Kubernetes in production. Monitoring kube-apiserver will help you detect and troubleshoot latency and errors, and validate whether the service performs as expected.

Here are some interesting queries to run and metrics to monitor for troubleshooting the OpenShift API Server.

Deprecated APIs

To check if deprecated API versions are used, use the following query:

sum by (kube_cluster_name, resource, removed_release,version)(apiserver_requested_deprecated_apis)

Certificate Expiration

Certificates are used to authenticate to the API server, and you can check with the following query if a certificate is expiring next week:

apiserver_client_certificate_expiration_seconds_count > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket[5m]))) < 7*24*60*60

API Server Latency

Latency spike is typically a sign of overload in the API server. Probably your cluster has a high load and the API server needs to be scaled out. Use the following query to check for latency spikes in the last 10 minutes.

sum by (kube_cluster_name,verb,apiserver)(rate(apiserver_request_duration_seconds_sum{verb!="WATCH"}[10m]))/sum by (kube_cluster_name,verb,apiserver)(rate(apiserver_request_duration_seconds_count{verb!="WATCH"}[10m]))

Request Error Rate

Request errror rate means that the API server is responding with 5xx errors. Check the CPU and memory of your api-server pods.

sum by(kube_cluster_name)(rate(apiserver_request_total{code=~"5..",kube_cluster_name=~$cluster}[5m])) / sum by(kube_cluster_name)(rate(apiserver_request_total{kube_cluster_name=~$cluster}[5m])) > 0.05

Agent Configuration

This is the default agent job for this integration:

- job_name: openshift-apiserver-default
  honor_labels: true
  scheme: https
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{__name__=~"apiserver_request_total|apiserver_request_duration_seconds_sum|apiserver_request_duration_seconds_count|workqueue_adds_total|workqueue_depth|apiserver_response_sizes_sum|apiserver_response_sizes_count|apiserver_requested_deprecated_apis|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_admission_controller_admission_duration_seconds_sum|apiserver_admission_controller_admission_duration_seconds_count|apiserver_admission_webhook_admission_duration_seconds_sum|apiserver_admission_webhook_admission_duration_seconds_count|apiserver_tls_handshake_errors_total|go_goroutines|process_resident_memory_bytes|process_cpu_seconds_total",code!="0"}'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:     
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'openshift-monitoring/prometheus-k8s-0'
    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (apiserver_request_total|apiserver_request_duration_seconds_sum|apiserver_request_duration_seconds_count|workqueue_adds_total|workqueue_depth|apiserver_response_sizes_sum|apiserver_response_sizes_count|apiserver_requested_deprecated_apis|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_admission_controller_admission_duration_seconds_sum|apiserver_admission_controller_admission_duration_seconds_count|apiserver_admission_webhook_admission_duration_seconds_sum|apiserver_admission_webhook_admission_duration_seconds_count|apiserver_tls_handshake_errors_total|go_goroutines|process_resident_memory_bytes|process_cpu_seconds_total)
    action: keep
  - action: replace
    source_labels: [namespace]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [pod]
    target_label: kube_pod_name

36 - OpenShift Controller Manager

Metrics, Dashboards, Alerts and more for OpenShift Controller Manager Integration in Sysdig Monitor.
OpenShift Controller Manager

This integration is enabled by default.

Versions supported: > v4.8

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 12 metrics.

Timeseries generated: Controller Manager generates ~650 timeseries

List of Alerts

AlertDescriptionFormat
[OpenShift Controller Manager] Process DownController Manager has disappeared from target discovery.Prometheus
[OpenShift Controller Manager] High 4xx RequestError RateOpenShift Controller Manager High 4xx Request Error RatePrometheus
[OpenShift Controller Manager] High 5xx RequestError RateOpenShift Controller Manager High 5xx Request Error RatePrometheus

List of Dashboards

OpenShift v4 Controller Manager

The dashboard provides information on the K8s and OpenShift Controller Manager. OpenShift v4 Controller Manager

List of Metrics

Metric name
go_goroutines
rest_client_requests_total
sysdig_container_cpu_cores_used
sysdig_container_memory_used_bytes
workqueue_adds_total
workqueue_depth
workqueue_queue_duration_seconds_count
workqueue_queue_duration_seconds_sum
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds_count
workqueue_work_duration_seconds_sum

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting OpenShift Controller Manager

Because OpenShift 4.X comes with both Prometheus and Controller Manager ready to use, no additional installation is required. The OpenShift Controller Manager metrics are exposed using a federated endpoint.

Here are some interesting queries to run and metrics to monitor for troubleshooting the OpenShift Controller Manager.

Work Queue

Work Queue Retries

The total number of retries that have been handled by the work queue. This value should be near 0.

topk(30,rate(workqueue_retries_total{job="openshift-controller-default"}[10m]))

Work Queue Latency

Queue latency is the time tasks spend in the queue before being processed

topk(30,rate(workqueue_queue_duration_seconds_sum{job="openshift-controller-default"}[10m]) / rate(workqueue_queue_duration_seconds_count{job="openshift-controller-default"}[10m]))

Work Queue Depth

This query checks the depth of the queue. High values can indicate the saturation of the controller manager.

topk(30,rate(workqueue_depth{job="openshift-controller-default"}[10m]))

Scheduler API Requests

Kube API Requests By Code

Check that there are no 5xx or 4xx error codes in the scheduler requests.

sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{job="openshift-controller-default",code=~"4.."}[10m]))
sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{job="openshift-controller-default",code=~"5.."}[10m]))

Agent Configuration

This is the default agent job for this integration:

- job_name: openshift-controller-manager-default
  honor_labels: true
  scheme: https
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"kube-controller-manager|controller-manager",__name__=~"workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_queue_duration_seconds_count|workqueue_work_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_work_duration_seconds_sum|workqueue_depth|workqueue_adds_total|rest_client_requests_total|go_goroutines"}'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:     
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'openshift-monitoring/prometheus-k8s-1'
    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
    # Remove extended labelset
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (go_goroutines|rest_client_requests_total|sysdig_container_cpu_cores_used|sysdig_container_memory_used_bytes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum)
    action: keep
  - action: replace
    source_labels: [namespace]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [pod]
    target_label: kube_pod_name
  - source_labels: [job]
    target_label: controller
  - source_labels: [job]
    action: replace
    regex: (.*)
    target_label: job
    replacement: 'openshift-controller-default'
  - action: replace
    source_labels: [controller]
    regex: '(controller-manager)'
    target_label: controller
    replacement: 'openshift-$1'

37 - OpenShift CoreDNS

Metrics, Dashboards, Alerts and more for OpenShift CoreDNS Integration in Sysdig Monitor.
OpenShift CoreDNS

This integration is enabled by default.

Versions supported: > v4.8

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 13 metrics.

Timeseries generated: CoreDNS generates ~230 timeseries per dns-default pod

List of Alerts

AlertDescriptionFormat
[OpenShift CoreDNS] Process DownCoreDNS has disappeared from target discovery.Prometheus
[OpenShift CoreDNS] High Failed ResponsesCoreDNS is returning failed responses.Prometheus
[OpenShift CoreDNS] High LatencyCoreDNS responses latency is higher than 60ms.Prometheus
[OpenShift CoreDNS] Panics ObservedCoreDNS Panics Observed.Prometheus

List of Dashboards

OpenShift v4 CoreDNS

The dashboard provides information on the OpenShift CoreDNS. OpenShift v4 CoreDNS

List of Metrics

Metric name
coredns_cache_hits_total
coredns_cache_misses_total
coredns_dns_request_duration_seconds_bucket
coredns_dns_request_size_bytes_bucket
coredns_dns_requests_total
coredns_dns_response_size_bytes_bucket
coredns_dns_responses_total
coredns_forward_request_duration_seconds_bucket
coredns_panics_total
coredns_plugin_enabled
go_goroutines
process_cpu_seconds_total
process_resident_memory_bytes

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting OpenShift CoreDNS

Because OpenShift 4.X comes with both Prometheus and CoreDNS ready to use, no additional installation is required. OpenShift CoreDNS metrics are exposed on the SSL port 9154.

Here are some interesting queries to run and metrics to monitor for troubleshooting OpenShift 4.

CoreDNS Panics

Number of Panics

To check the CoreDNS number of panics, use the following query:

sum(coredns_panics_total)

See the CoreDNS pods logs when you see this number growing.

DNS Requests

By Type

To filter DNS request types, use the following query:

(sum(rate(coredns_dns_requests_total[$__interval])) by (type,kube_cluster_name,kube_pod_name))

By Protocol

To filter DNS request types by protocol, use the following query:

(sum(rate(coredns_dns_requests_total[$__interval]) ) by (proto,kube_cluster_name,kube_pod_name))

By Zone

To filter DNS request types by zone, use the following query:

(sum(rate(coredns_dns_requests_total[$__interval]) ) by (zone,kube_cluster_name,kube_pod_name))

By Latency

This metrics detects any degradation in the service. With the following query, you can compare percentile 99 against average.

histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by(server, zone, le))

Error Rate

Watch carefully for this metric as you can filter depending on the status code: 200,404,400,500.

sum by (server, status)(coredns_dns_https_responses_total{server, status})

Cache

Cache Hit

To check the cache hit rate, use the following query:

sum(rate(coredns_cache_hits_total[$__interval])) by (type,kube_cluster_name,kube_pod_name)

Cache Miss

To check the cache miss rate, use the following query:

sum(rate(coredns_cache_misses_total[$__interval])) by(server,kube_cluster_name,kube_pod_name)

Agent Configuration

This is the default agent job for this integration:

- job_name: openshift-dns-default
  honor_labels: true
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https        
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'openshift-dns/dns-default.+'
  - source_labels:
    - __address__
    action: keep
    regex: (.*:9154)
  - source_labels:
    - __meta_kubernetes_pod_name
    action: replace
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (coredns_cache_hits_total|coredns_cache_misses_total|coredns_dns_request_duration_seconds_bucket|coredns_dns_request_size_bytes_bucket|coredns_dns_requests_total|coredns_dns_response_size_bytes_bucket|coredns_dns_responses_total|coredns_forward_request_duration_seconds_bucket|coredns_panics_total|coredns_plugin_enabled|go_goroutines|process_cpu_seconds_total|process_resident_memory_bytes)
    action: keep

38 - OpenShift Etcd

Metrics, Dashboards, Alerts and more for OpenShift Etcd Integration in Sysdig Monitor.
OpenShift Etcd

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

Versions supported: > v4.8

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 32 metrics.

Timeseries generated: Etcd generates ~1200 timeseries per etcd-ip pod

List of Alerts

AlertDescriptionFormat
[OpenShiftEtcd] Etcd Insufficient MembersEtcd cluster has insufficient members.Prometheus
[OpenShiftEtcd] Etcd No LeaderMember has no leader.Prometheus
[OpenShiftEtcd] Etcd High Number Of Leader ChangesLeader changes within the last 15 minutes.Prometheus
[OpenShiftEtcd] Etcd High Number Of Failed GRPC Requests.High number of failed grpc requestsPrometheus
[OpenShiftEtcd] Etcd GRPC Requests SlowgRPC requests are taking too much time.Prometheus
[OpenShiftEtcd] Etcd High Number Of Failed ProposalsHigh number of proposal failures within the last 30 minutes on etcd instance.Prometheus
[OpenShiftEtcd] Etcd High Fsync Durations99th percentile fync durations are too high.Prometheus
[OpenShiftEtcd] Etcd High Commit Durations99th percentile commit durations are too high.Prometheus
[OpenShiftEtcd] Etcd HighNumber Of Failed HTTP RequestsHigh number of failed http requestsPrometheus
[OpenShiftEtcd] Etcd HTTP Requests SlowThere are slow HTTP request.Prometheus
[OpenShiftEtcd] Etcd Excesive Database GrowthEtcd cluster database is growing very fast.Prometheus

List of Dashboards

OpenShift v4 Etcd

The dashboard provides information on the OpenShift Etcd. OpenShift v4 Etcd

List of Metrics

Metric name
etcd_debugging_mvcc_db_total_size_in_bytes
etcd_disk_backend_commit_duration_seconds_bucket
etcd_disk_wal_fsync_duration_seconds_bucket
etcd_grpc_proxy_cache_hits_total
etcd_grpc_proxy_cache_misses_total
etcd_http_failed_total
etcd_http_received_total
etcd_http_successful_duration_seconds_bucket
etcd_mvcc_db_total_size_in_bytes
etcd_network_client_grpc_received_bytes_total
etcd_network_client_grpc_sent_bytes_total
etcd_network_peer_received_bytes_total
etcd_network_peer_received_failures_total
etcd_network_peer_round_trip_time_seconds_bucket
etcd_network_peer_sent_bytes_total
etcd_network_peer_sent_failures_total
etcd_server_has_leader
etcd_server_id
etcd_server_leader_changes_seen_total
etcd_server_proposals_applied_total
etcd_server_proposals_committed_total
etcd_server_proposals_failed_total
etcd_server_proposals_pending
etcd_server_quota_backend_bytes
go_goroutines
grpc_server_handled_total
grpc_server_handling_seconds_bucket
grpc_server_started_total
process_max_fds
process_open_fds
sysdig_container_cpu_cores_used
sysdig_container_memory_used_bytes

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting OpenShift Etcd

Because OpenShift 4.X comes with both Prometheus and API servers ready to use, no additional installation is required. OpenShift Etcd metrics are exposed using the \federated endpoint.

Here are some interesting queries to run and metrics to monitor for troubleshooting OpenShift Etcd.

Etcd Consensus & Leader

Problems in the leader and consensus of the etcd cluster can cause outages in the cluster.

Etcd Leader

  • If a member does not have a leader, it is totally unavailable.
  • If all the members in a cluster do not have any leader, the entire cluster is totally unavailable.

Check the leader using this query:

count(etcd_server_id) % 2

If they query returns 1, etcd has a leader.

Leader Changes

Rapid leadership changes impact the performance of etcd significantly and it can also mean that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

Check for leader changes in the last hour:

max(increase(etcd_server_leader_changes_seen_total[60m]))

Failed Proposals

Check if etcd has failed proposals. Failing proposals are caused by two issues:

  • Temporary failures related to a leader election
  • Longer downtime caused by a loss of quorum in the cluster
max(rate(etcd_server_proposals_failed_total[60m]))

Pending Proposals

Rising pending proposals suggests that client load is high or the member cannot commit proposals.

sum(etcd_server_proposals_pending)

Total Number of Consensus Proposals Commited

The etcd server applies every committed proposal asynchronously.

Check if the difference between proposals committed and proposals applied is small within a few thousands even under high load:

  • If the difference between them continues to rise, the etcd server is overloaded.
  • This might happen when applying expensive queries like heavy range queries or large txn operations.

Proposals commited:

sum(rate(etcd_server_proposals_committed_total[60m])) by (kube_cluster_name)

Proposals applied:

sum(rate(etcd_server_proposals_applied_total[60m])) by (kube_cluster_name)

gRPC

Error Rate

Check the gRPC error rate. These errors are most likely related to networking issues.

sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary",grpc_code!="OK"}[10m])) by (kube_cluster_name,kube_pod_name)
/
sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary"}[10m])) by (kube_cluster_name,kube_pod_name)

gRPC Traffic

Check for unusual spikes in the traffic. They could be related to networking issues.

rate(etcd_network_client_grpc_received_bytes_total[10m])
rate(etcd_network_client_grpc_sent_bytes_total[10m])

Disk

Disk Sync

Check if the fsync and commit latencies are below limits:

  • High disk operation latencies often indicate disk issues.
  • It may cause high request latency or make the cluster unstable
histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))
histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))

DB Size

Check for DB size if it keeps increasing. You should defrag etcd to decrease the DB size.

etcd_debugging_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"} or etcd_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"}

Networking Between Peers

This is only applicable to multi-master.

Errors from / to Peer

Check the total number of failures sent from peers:

rate(etcd_network_peer_sent_failures_total{container_name=~".*etcd.*|http"}[10m])

Check the total number of failures received by peers:

rate(etcd_network_peer_received_failures_total{container_name=~".*etcd.*|http"}[10m])

Agent Configuration

This is the default agent job for this integration:

- job_name: openshift-etcd-default
  honor_labels: true
  scheme: https
  bearer_token_file: /run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"etcd"}'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:     
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'openshift-monitoring/prometheus-k8s-1'
    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
    # Remove extended labelset
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (etcd_debugging_mvcc_db_total_size_in_bytes|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_grpc_proxy_cache_hits_total|etcd_grpc_proxy_cache_misses_total|etcd_http_failed_total|etcd_http_received_total|etcd_http_successful_duration_seconds_bucket|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_received_failures_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_bytes_total|etcd_network_peer_sent_failures_total|etcd_server_has_leader|etcd_server_id|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|process_max_fds|process_open_fds|etcd_mvcc_db_total_size_in_bytes|etcd_server_quota_backend_bytes)
    action: keep
  - action: replace
    source_labels: [namespace]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [pod]
    target_label: kube_pod_name
  - action: replace
    source_labels: [endpoint]
    target_label: container_name

39 - OpenShift Kubelet

Metrics, Dashboards, Alerts and more for OpenShift Kubelet Integration in Sysdig Monitor.
OpenShift Kubelet

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

Versions supported: > v4.7

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 25 metrics.

Timeseries generated: Kubelet generates ~1200 timeseries per node

List of Alerts

AlertDescriptionFormat
[openshift-kubelet] Kubelet Too Many PodsKubelet Too Many PodsPrometheus
[openshift-kubelet] Kubelet Pod Lifecycle Event Generator Duration HighKubelet Pod Lifecycle Event Generator Duration HighPrometheus
[openshift-kubelet] Kubelet Pod StartUp Latency HighKubelet Pod StartUp Latency HighPrometheus
[openshift-kubelet] Kubelet DownKubelet DownPrometheus

List of Dashboards

OpenShift v4 Kubelet

The dashboard provides information on the OpenShift Kubelet. OpenShift v4 Kubelet

List of Metrics

Metric name
go_goroutines
kube_node_status_capacity_pods
kube_node_status_condition
kubelet_cgroup_manager_duration_seconds_bucket
kubelet_cgroup_manager_duration_seconds_count
kubelet_node_config_error
kubelet_pleg_relist_duration_seconds_bucket
kubelet_pleg_relist_interval_seconds_bucket
kubelet_pod_start_duration_seconds_bucket
kubelet_pod_start_duration_seconds_count
kubelet_pod_worker_duration_seconds_bucket
kubelet_pod_worker_duration_seconds_count
kubelet_running_containers
kubelet_running_pod_count
kubelet_running_pods
kubelet_runtime_operations_duration_seconds_bucket
kubelet_runtime_operations_errors_total
kubelet_runtime_operations_total
process_cpu_seconds_total
process_resident_memory_bytes
rest_client_request_duration_seconds_bucket
rest_client_requests_total
storage_operation_duration_seconds_bucket
storage_operation_duration_seconds_count
volume_manager_total_volumes

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Agent Configuration

This integration has no default agent job.

40 - OpenShift Scheduler

Metrics, Dashboards, Alerts and more for OpenShift Scheduler Integration in Sysdig Monitor.
OpenShift Scheduler

This integration is enabled by default.

Versions supported: > v4.7

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 20 metrics.

Timeseries generated: Scheduler generates ~300 timeseries

List of Alerts

AlertDescriptionFormat
[OpenShift Scheduler] Process DownScheduler has disappeared from target discovery.Prometheus
[OpenShift Scheduler] Failed Attempts to Schedule PodsScheduler Failed Attempts to Schedule Pods.Prometheus
[OpenShift Scheduler] High 4xx RequestError RateScheduler High 4xx Request Error Rate.Prometheus
[OpenShift Scheduler] High 5xx RequestError RateScheduler High 5xx Request Error Rate.Prometheus

List of Dashboards

OpenShift v4 Scheduler

The dashboard provides information on the OpenShift Scheduler. OpenShift v4 Scheduler

List of Metrics

Metric name
go_goroutines
rest_client_request_duration_seconds_count
rest_client_request_duration_seconds_sum
rest_client_requests_total
scheduler_e2e_scheduling_duration_seconds_count
scheduler_e2e_scheduling_duration_seconds_sum
scheduler_pending_pods
scheduler_pod_scheduling_attempts_count
scheduler_pod_scheduling_attempts_sum
scheduler_schedule_attempts_total
sysdig_container_cpu_cores_used
sysdig_container_memory_used_bytes
workqueue_adds_total
workqueue_depth
workqueue_queue_duration_seconds_count
workqueue_queue_duration_seconds_sum
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds_count
workqueue_work_duration_seconds_sum

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

How to monitor OpenShift Scheduler with Sysdig agent

No further installation is needed, since OpenShift 4.X comes with both Prometheus and Scheduler ready to use. OpenShift Scheduler metrics are exposed using /federate endpoint.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift Scheduler.

Scheduling

Failed attempts to Schedule pods

Unschedulable pods means that a pod could not be scheduled, use this query to check for failed attempts:

sum by (kube_cluster_name,kube_pod_name,result) (rate(scheduler_schedule_attempts_total{result!~"scheduled"}[10m])) / ignoring(result) group_left sum by (kube_cluster_name,kube_pod_name)(rate(scheduler_schedule_attempts_total[10m]))

Pending pods

Check that there are no pods in pending queues with this query:

topk(30,rate(scheduler_pending_pods[10m]))

Work Queue

Work Queue Retries

The total number of retries that have been handled by the work queue. This value should be near 0.

topk(30,rate(workqueue_retries_total{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]))

Work Queue Latency

Queue latency is the time tasks spend in the queue before being processed

topk(30,rate(workqueue_queue_duration_seconds_sum{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]) / rate(workqueue_queue_duration_seconds_count{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]))

Work Queue Depth

Check the depth of the queue. High values can indicate the saturation of the controller manager.

topk(30,rate(workqueue_depth{container_name=~".*kube-scheduler.*"}[10m]))

Scheduler API Requests

Kube API Requests by code

Check that there are no 5xx or 4xx error codes in the scheduler requests.

sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{container_name=~".*kube-scheduler.*",code=~"4.."}[10m]))
sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{container_name=~".*kube-scheduler.*",code=~"5.."}[10m]))

Agent Configuration

This is the default agent job for this integration:

- job_name: openshift-scheduler-default
  honor_labels: true
  scheme: https
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"scheduler",__name__=~"scheduler_schedule_attempts_total|scheduler_pod_scheduling_attempts_sum|scheduler_pod_scheduling_attempts_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_e2e_scheduling_duration_seconds_count|scheduler_pending_pods|workqueue_retries_total|workqueue_work_duration_seconds_sum|workqueue_work_duration_seconds_count|workqueue_unfinished_work_seconds|workqueue_queue_duration_seconds_sum|workqueue_queue_duration_seconds_count|workqueue_depth|workqueue_adds_total|rest_client_requests_total|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count|go_goroutines"}'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:     
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'openshift-monitoring/prometheus-k8s-0'
    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
    # Remove extended labelset
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (go_goroutines|rest_client_request_duration_seconds_count|rest_client_request_duration_seconds_sum|rest_client_requests_total|scheduler_e2e_scheduling_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_pending_pods|scheduler_pod_scheduling_attempts_count|scheduler_pod_scheduling_attempts_sum|scheduler_schedule_attempts_total|sysdig_container_cpu_cores_used|sysdig_container_memory_used_bytes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum)
    action: keep
  - action: replace
    source_labels: [namespace]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [pod]
    target_label: kube_pod_name
  - action: replace
    source_labels: [container]
    target_label: container_name
  - action: replace
    source_labels: [job]
    regex: '(.*)'
    target_label: job
    replacement: 'openshift-$1-default'

41 - OpenShift State Metrics

Metrics, Dashboards, Alerts and more for OpenShift State Metrics Integration in Sysdig Monitor.
OpenShift State Metrics

This integration is enabled by default.

Versions supported: > v4.7

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 4 metrics.

Timeseries generated: 30 timeseries + 4 series per route

List of Alerts

AlertDescriptionFormat
[OpenShift-state-metrics] CPU Resource Request Quota UsageResource request CPU usage is over 90% resource quota.Prometheus
[OpenShift-state-metrics] CPU Resource Limit Quota UsageResource limit CPU usage is over 90% resource limit quota.Prometheus
[OpenShift-state-metrics] Memory Resource Request Quota UsageResource request memory usage is over 90% resource quota.Prometheus
[OpenShift-state-metrics] Memory Resource Limit Quota UsageResource limit memory usage is over 90% resource limit quota.Prometheus
[OpenShift-state-metrics] Routes with issuesA route status is in error and is having issues.Prometheus
[OpenShift-state-metrics] Buid Processes with issuesA build process is in error or failed status.Prometheus

List of Dashboards

OpenShift v4 State Metrics

The dashboard provides information on the special OpenShift-state-metrics. OpenShift v4 State Metrics

List of Metrics

Metric name
openshift_build_created_timestamp_seconds
openshift_build_status_phase_total
openshift_clusterresourcequota_usage
openshift_route_status

Preparing the Integration

No preparations are required for this integration.

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting OpenShift State Metrics

No further installation is needed, since OKD4 comes with both Prometheus and OSM ready to use.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift 4.

Resource Quotas

Resource Quotas Requests

% CPU Used vs Request Quota

Let’s get what’s the % of CPU used vs the request quota.

sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="used"}) / sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="hard"}) > 0
% Memory Used vs Request Quota

Now, the same but for the memory.

sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="hard"}) > 0

These queries return one time series for each resource quota deployed in the cluster.

Please, not that if your requests are near 100%, you can use the Pod Rightsizing & Workload Capacity Optimization dashboard to fix it. You can also talk to your cluster administrator to check your resource quota. Also, if your requests are too low, the resource quota could be rightsized.

Resource Quotas Limits

% CPU Used vs Limit Quota

Let’s get what’s the % of CPU used vs the limit quota.

sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="hard"}) > 0
% Memory Used vs Limit Quota

Now, the same but for the memory.

sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="hard"}) > 0

These queries return one time series for each resource quota deployed in the cluster.

Please, note that quota limits are normally higher than the quota requests. If your limits are too close to 100%, you might face scheduling issues. The Pod Scheduling Troubleshooting dashboard might help you to troubleshoot this scenario. Also, if limit usage is too low, the resource quota could be rightsized.

Routes

List the Routes

Let’s get a list of all the routes present in the cluster, aggregated by host and namespace

sum by (route, host, namespace) (openshift_route_info)

Duplicated Routes

Now, let’s find our duplicated routes:

sum by (host) (openshift_route_info) > 1

This query will return the duplicated hosts. If you want the full information for the duplicated routes, try this one:

openshift_route_info * on (host) group_left(host_name) label_replace((sum by (host) (openshift_route_info) > 1), "host_name", "$0", "host", ".+")

Why the label_replace? because to get the full info, joining the openshift_route_info metric with itself was necessary, but, as both sides of the join have the same labels, there wasn’t any extra label to join by.

What you can do is to perform a label_replace to create a new label host_name with the content of the host label and the join will work.

Routes with Issues

Let’s get what are the routes with issues (a.k.a. routes with a False status)

openshift_route_status{status == 'False'} > 0

Builds

New Builds, by Processing Time

Let’s list the new builds, by how many time they have been processing. This query can be useful to detect slow processes.

time() - (openshift_build_created_timestamp_seconds) * on (build) group_left(build_phase) (openshift_build_status_phase_total{build_phase="new"} == 1)

Builds with Errors

Use this query to get builds that are in failed or error state.

sum by (build, buildconfig, kube_namespace_name, kube_cluster_name) (openshift_build_status_phase_total{build_phase=~"failed|error"}) > 0

Agent Configuration

This is the default agent job for this integration:

- job_name: 'openshift-state-metrics'
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (openshift-state-metrics);(.{0}$)
    replacement: openshift-state-metrics
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "openshift-state-metrics"
  - action: replace
    source_labels: [__address__]
    regex: ([^:]+)(?::\d+)?
    replacement: $1:8443
    target_label: __address__
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (openshift_build_created_timestamp_seconds|openshift_build_status_phase_total|openshift_clusterresourcequota_usage|openshift_route_status)
    action: keep
  - action: replace
    source_labels: [namespace]
    target_label: kube_namespace_name

42 - PHP-FPM

Metrics, Dashboards, Alerts and more for PHP-FPM Integration in Sysdig Monitor.
PHP-FPM

This integration is enabled by default.

Versions supported: > 7.2

This integration uses a sidecar exporter that is available in UBI or scratch base image.

This integration has 12 metrics.

Timeseries generated: 167 timeseries

List of Alerts

AlertDescriptionFormat
[Php-Fpm] Percentage of instances lowLess than 75% of instances are upPrometheus
[Php-Fpm] Recently rebootInstances have been recently rebootPrometheus
[Php-Fpm] Limit of child proccess exceededNumber of childs process have been exceededPrometheus
[Php-Fpm] Reaching limit of queue processBuffer of queue requests reaching its limitPrometheus
[Php-Fpm] Too slow requ