This the multi-page printable view of this section. Click here to print.
Application Integrations
- 1: Apache
- 2: Calico
- 3: Cassandra
- 4: Ceph
- 5: Consul
- 6: Elasticsearch
- 7: Fluentd
- 8: Go
- 9: HAProxy Ingress
- 10: HAProxy Ingress OpenShift
- 11: Harbor
- 12: Istio
- 13: Istio Envoy
- 14: Kafka
- 15: KEDA
- 16: Kube State Metrics OSS
- 17: Kubernetes
- 18: Kubernetes API server
- 19: Kubernetes controller manager
- 20: Kubernetes CoreDNS
- 21: Kubernetes etcd
- 22: Kubernetes kube-proxy
- 23: Kubernetes kubelet
- 24: Kubernetes PVC
- 25: Kubernetes Scheduler
- 26: Kubernetes storage
- 27: Linux
- 28: Memcached
- 29: MongoDB
- 30: MySQL
- 31: NGINX
- 32: NGINX Ingress
- 33: NTP
- 34: OPA
- 35: OpenShift API-Server
- 36: OpenShift Controller Manager
- 37: OpenShift CoreDNS
- 38: OpenShift Etcd
- 39: OpenShift Kubelet
- 40: OpenShift Scheduler
- 41: OpenShift State Metrics
- 42: PHP-FPM
- 43: Portworx
- 44: PostgreSQL
- 45: RabbitMQ
- 46: Redis
- 47: Sysdig Admission Controller
- 48: Sysdig Monitor
- 49: Windows
1 - Apache
This integration is enabled by default.
Versions supported: 2.4
This integration uses a sidecar exporter that is available in UBI or scratch base image.
This integration has 11 metrics.
Timeseries generated: 100 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Apache] No Instance Up | No instances up | Prometheus |
[Apache] Up Time Less Than One Hour | Instance with UpTime less than one hour | Prometheus |
[Apache] Time Since Last OK Request More Than One Hour | Time since last OK request higher than one hour | Prometheus |
[Apache] High Error Rate | High error rate | Prometheus |
[Apache] High Rate Of Busy Workers In Instance | Low workers in open_slot state | Prometheus |
List of Dashboards
Apache App Overview
The dashboard provides information on the status of the Apache resources.
List of Metrics
Metric name |
---|
apache_accesses_total |
apache_connections |
apache_cpuload |
apache_duration_ms_total |
apache_http_last_request_seconds |
apache_http_response_codes_total |
apache_scoreboard |
apache_sent_kilobytes_total |
apache_up |
apache_uptime_seconds_total |
apache_workers |
Preparing the Integration
Create Grok Configuration
You need to add the Grok configuration in order to parse Apache logs and get metrics from them.
Install It Directly In Your Cluster
helm install -n Your-Application-Namespace apache-exporter --repo https://sysdiglabs.github.io/integrations-charts --set configmap=true
Download and Apply
You can download the file and execute the next command
kubectl -n Your-Application-Namespace apply -f grok-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grok-config
data:
config.yml: |
global:
config_version: 3
input:
type: file
path: /tmp/logs/accesss.log
fail_on_missing_logfile: false
readall: true
imports:
- type: grok_patterns
dir: ./patterns
metrics:
- type: counter
name: apache_http_response_codes_total
help: HTTP requests to Apache
match: '%{COMMONAPACHELOG}'
labels:
code: '{{.response}}'
method: '{{.verb}}'
- type: gauge
name: apache_http_response_bytes_total
help: Size of HTTP responses
match: '%{COMMONAPACHELOG}'
value: '{{.bytes}}'
cumulative: true
labels:
code: '{{.response}}'
method: '{{.verb}}'
- type: gauge
name: apache_http_last_request_seconds
help: Timestamp of the last HTTP request
match: '%{COMMONAPACHELOG}'
value: '{{timestamp "02/Jan/2006:15:04:05 -0700" .timestamp}}'
labels:
code: '{{.response}}'
method: '{{.verb}}'
server:
protocol: http
Check Apache Configuration
Apache provides metrics in its own format via its ServerStatus module. To enable this module, include (or uncomment) the following line in your apache configuration file:
LoadModule status_module modules/mod_status.so
<Location "/server-status">
SetHandler server-status
</Location>
To configure Apache server to produce common logs, include (or uncomment) the following in your Apache configuration file:
<IfModule log_config_module>
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
CustomLog /usr/local/apache2/logs/accesss.log common
</IfModule>
Installing
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/apache-exporter
Monitoring and Troubleshooting Apache
This document describes important metrics and queries that you can use to monitor and troubleshoot Apache.
Tracking metrics status
You can track Apache metrics status with following alerts: Exporter proccess is not serving metrics
# [Apache] Exporter Process Down
absent(apache_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
These are the default agent jobs for this integration:
- job_name: apache-exporter-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "apache"
- action: replace
source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:9117
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
- job_name: apache-grok-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "apache"
- action: replace
source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:9144
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
2 - Calico
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
Versions supported: 3.23.3
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 22 metrics.
Timeseries generated: 838 Timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Calico-Node] Dataplane Updates Are Failing and Retrying | The update actions for dataplane are failing and retrying several times | Prometheus |
[Calico-Node] IP Set Command Failures | Encountered a number of ipset command failures | Prometheus |
[Calico-Node] IP Tables Restore Failures | Encountered a number of iptable restore failures | Prometheus |
[Calico-Node] IP Tables Save Failures | Encountered a number of iptable restore failures | Prometheus |
[Calico-Node] Errors While Logging | Encountered a number of errors while logging | Prometheus |
[Calico-Node] Latency Increase in Datastore OnUpdate Call | The duration of datastore OnUpdate calls are increasing | Prometheus |
[Calico-Node] Latency Increase in Dataplane Update | Increased response time for dataplane updates | Prometheus |
[Calico-Node] Latency Increase in Acquire Iptables Lock | Increased response time for dataplane updates | Prometheus |
[Calico-Node] Latency Increase While Listing All the Interfaces during a Resync | Increased response time for interface listing during a resync | Prometheus |
[Calico-Node] Latency Increase in Interface Resync | Increased response time for interface resync | Prometheus |
[Calico-Node] Fork/Exec Child Processes Results in High Latency | Increased response time for Fork/Exec child processes | Prometheus |
List of Dashboards
Calico
The dashboard provides information on the Calico integration.
List of Metrics
Metric name |
---|
felix_calc_graph_update_time_seconds |
felix_cluster_num_hosts |
felix_cluster_num_policies |
felix_cluster_num_profiles |
felix_exec_time_micros |
felix_int_dataplane_addr_msg_batch_size |
felix_int_dataplane_apply_time_seconds |
felix_int_dataplane_failures |
felix_int_dataplane_iface_msg_batch_size |
felix_int_dataplane_msg_batch_size |
felix_ipset_calls |
felix_ipset_errors |
felix_ipset_lines_executed |
felix_iptables_lines_executed |
felix_iptables_lock_acquire_secs |
felix_iptables_restore_calls |
felix_iptables_restore_errors |
felix_iptables_save_calls |
felix_iptables_save_errors |
felix_log_errors |
felix_route_table_list_seconds |
felix_route_table_per_iface_sync_seconds |
Preparing the Integration
Enable Calico Prometheus Metrics
Calico can expose Prometheus metrics natively, however, this is an option that is not always enabled.
You can use the following command to turn Prometheus metrics on:
kubectl patch felixconfiguration default --type merge --patch '{"spec":{"prometheusMetricsEnabled": true}}'
You should see and output like below:
felixconfiguration.projectcalico.org/default patched
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting Calico
Here are some interesting metrics and queries to monitor and troubleshoot Calico.
About the Calico User
Hosts
A host endpoint resource (HostEndpoint) represents one or more real or virtual interfaces attached to a host that is running Calico. It enforces Calico policy on the traffic that is entering or leaving the host’s default network namespace through those interfaces.
A host endpoint with interfaceName: * represents all of a host’s real or virtual interfaces.
A host endpoint for one specific real interface is configured by interfaceName:
, for example interfaceName: eth0, or by leaving interfaceName empty and including one of the interface’s IPs in expectedIPs.
Each host endpoint may include a set of labels and list of profiles that Calico will use to apply policy to the interface.
Profiles
Profiles provide a way to group multiple endpoints so that they inherit a shared set of labels. For historic reasons, Profiles can also include policy rules, but that feature is deprecated in favor of the much more flexible NetworkPolicy and GlobalNetworkPolicy resources.
Each Calico endpoint or host endpoint can be assigned to zero or more profiles.
Policies
If you are new to Kubernetes, start with “Kubernetes policy” and learn the basics of enforcing policy for pod traffic. The good news is, Kubernetes and Calico policies are very similar and work alongside each other – so managing both types is easy.
Kubernetes network policy lets developers secure access to and from their applications using the same simple language they use to deploy them. Developers can focus on their applications without understanding low-level networking concepts. Enabling developers to easily secure their applications using network policies supports a shift left DevOps environment.
Errors
Dataplane Updates Failures and Retries
Dataplane is base of work for Calico. It has three different types of Dataplanes (Linux eBPF, Standard Linux and Windows HNS). Dataplane is responsible for main important parts in Calico: base networking, network policy and IP address management capabilities. So be aware of possible errors in dataplane is keystone for Calico monitoring.
rate(felix_int_dataplane_failures[5m])
Ipset Command Failures
IP sets are stored collections of IP addresses, network ranges, MAC addresses, port numbers, and network interface names. The iptables tool can leverage IP sets for more efficient rule matching.
For example, let’s say you want to drop traffic that originates from one of several IP address ranges that you know to be malicious. Instead of configuring rules for each range in iptables directly, you can create an IP set and then reference that set in an iptables rule. This makes your rule sets dynamic and therefore easier to configure; whenever you need to add or swap out network identifiers that are handled by the firewall, you simply change the IP set.
For that reason we need to monitor failures fot his kind of command in calico.
rate(felix_ipset_errors[5m])
Iptables Save Failures and Iptables Restore Failures
The actual iptables rules are created and customized on the command line with the command iptables
for IPv4 and ip6tables
for IPv6.
These can be saved in a file with the command iptables-save
for IPv4.
Debian/Ubuntu: iptables-save > /etc/iptables/rules.v4
RHEL/CentOS: iptables-save > /etc/sysconfig/iptables
These files can be loaded again with the command iptables-restore
for IPv4.
Debian/Ubuntu: iptables-restore < /etc/iptables/rules.v4
RHEL/CentOS: iptables-restore < /etc/sysconfig/iptables
This is basically the main purpose of calico, so monitor failures of the features is very important.
rate(felix_iptables_save_errors[5m])
rate(felix_iptables_restore_errors[5m])
Latency
Most usefull way to inform about latency is show some alert with quantiles.
Calico metrics does not provides buckets, it summarizes all that info with specific labels. For Latency metrics Calico provides quantile labels 0.5, 0.9 and 0.99.
Latency in Datastore OnUpdate Call
# Latency on dataplane update
felix_calc_graph_update_time_seconds{quantile="0.99"}
# Latency on acquire iptables lock
felix_int_dataplane_apply_time_seconds{quantile="0.99"}
# Latency to list all the interfaces during a resync
felix_iptables_lock_acquire_secs{quantile="0.99"}
Saturation
The way to monitor saturation in Calico is batch size. Here we can analyze three kinds of batches and also analyze them by quantiles.
# Number of messages processed in each batch
felix_int_dataplane_msg_batch_size{quantile="0.99"}
# Interface state messages processed in each batch
felix_int_dataplane_iface_msg_batch_size{quantile="0.99"}
# Interface address messages processed in each batch
felix_int_dataplane_addr_msg_batch_size{quantile="0.99"}
Traffic
One of the four golden signals we have to monitor to is traffic, in this case for calico, we need to monitor the most core network requests.
Ipset
and Iptables
commands are the lowest level interaction in calico, in order to create that traffic Calico needs to create, destroy and update any policy network.
# Number of ipset commands executed.
rate(felix_ipset_calls[5m])
# Number of ipset operations executed.
rate(felix_ipset_lines_executed[5m])
# Number of iptables rule updates executed.
rate(felix_iptables_lines_executed[5m])
# Number of iptables-restore calls.
rate(felix_iptables_restore_calls[5m])
# Number of iptables-save calls.
rate(felix_iptables_save_calls[$__interval])
Agent Configuration
These are the default agent jobs for this integration:
- job_name: 'calico-node-default'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (calico-node);(.{0}$)
replacement: calico
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "calico"
- action: replace
source_labels: [__address__]
regex: ([^:]+)(?::\d+)?
replacement: $1:9091
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (felix_calc_graph_update_time_seconds|felix_cluster_num_hosts|felix_cluster_num_policies|felix_cluster_num_profiles|felix_exec_time_micros|felix_int_dataplane_addr_msg_batch_size|felix_int_dataplane_apply_time_seconds|felix_int_dataplane_failures|felix_int_dataplane_iface_msg_batch_size|felix_int_dataplane_msg_batch_size|felix_ipset_calls|felix_ipset_errors|felix_ipset_lines_executed|felix_iptables_lines_executed|felix_iptables_lock_acquire_secs|felix_iptables_restore_calls|felix_iptables_restore_errors|felix_iptables_save_calls|felix_iptables_save_errors|felix_log_errors|felix_route_table_list_seconds|felix_route_table_per_iface_sync_seconds)
action: keep
- job_name: 'calico-controller-default'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
separator: ;
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (calico-kube-controllers);(.{0}$)
replacement: calico-controller
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "calico-controller"
- action: replace
source_labels: [__address__]
regex: ([^:]+)(?::\d+)?
replacement: $1:9094
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
3 - Cassandra
This integration is enabled by default.
Versions supported: > v3.x
This integration uses a sidecar exporter that is available in UBI or scratch base image.
This integration has 30 metrics.
Timeseries generated: The JMX-Exporter generates ~850 timeseries (the number of keyspaces and tables).
List of Alerts
Alert | Description | Format |
---|---|---|
[Cassandra] Compaction Task Pending | There are many Cassandra compaction tasks pending. | Prometheus |
[Cassandra] Commitlog Pending Tasks | There are many Cassandra Commitlog tasks pending. | Prometheus |
[Cassandra] Compaction Executor Blocked Tasks | There are many Cassandra compaction executor blocked tasks. | Prometheus |
[Cassandra] Flush Writer Blocked Tasks | There are many Cassandra flush writer blocked tasks. | Prometheus |
[Cassandra] Storage Exceptions | There are storage exceptions in Cassandra node. | Prometheus |
[Cassandra] High Tombstones Scanned | There is a high number of tombstones scanned. | Prometheus |
[Cassandra] JVM Heap Memory | High JVM Heap Memory. | Prometheus |
List of Dashboards
Cassandra
The dashboard provides information on the status of Cassandra.
List of Metrics
Metric name |
---|
cassandra_bufferpool_misses_total |
cassandra_bufferpool_size_total |
cassandra_client_connected_clients |
cassandra_client_request_read_latency |
cassandra_client_request_read_timeouts |
cassandra_client_request_read_unavailables |
cassandra_client_request_write_latency |
cassandra_client_request_write_timeouts |
cassandra_client_request_write_unavailables |
cassandra_commitlog_completed_tasks |
cassandra_commitlog_pending_tasks |
cassandra_commitlog_total_size |
cassandra_compaction_compacted_bytes_total |
cassandra_compaction_completed_tasks |
cassandra_compaction_pending_tasks |
cassandra_cql_prepared_statements_executed_total |
cassandra_cql_regular_statements_executed_total |
cassandra_dropped_messages_mutation |
cassandra_dropped_messages_read |
cassandra_jvm_gc_collection_count |
cassandra_jvm_gc_duration_seconds |
cassandra_jvm_memory_usage_max_bytes |
cassandra_jvm_memory_usage_used_bytes |
cassandra_storage_internal_exceptions_total |
cassandra_storage_load_bytes_total |
cassandra_table_read_requests_per_second |
cassandra_table_tombstoned_scanned |
cassandra_table_total_disk_space_used |
cassandra_table_write_requests_per_second |
cassandra_threadpool_blocked_tasks_total |
Preparing the Integration
Create ConfigMap for the JMX-Exporter
The JMX-Exporter requires a ConfigMap with the Cassandra JXM configurations, which can be easily installed using a simple command. The following example is for a Cassandra cluster which exposes the jmx port 7199 and it’s deployed in the ‘cassandra’ namespace (modify the jmx port and the namespace as per your needs):
helm repo add promcat-charts https://sysdiglabs.github.io/integrations-charts
helm repo update
helm -n cassandra install cassandra-jmx-exporter promcat-charts/jmx-exporter --set jmx_port=7199 --set integrationType=cassandra --set onlyCreateJMXConfigMap=true
Installing
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/jmx-exporter
Monitoring and Troubleshooting Cassandra
Here are some interesting metrics and queries to monitor and troubleshoot Cassandra.
General Stats
Node Down
Let’s get the number of expected of nodes, and the actual number of nodes up and running. If the number is not the same, then there might a problem.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_desired)
-
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_ready)
> 0
Dropped Messages
Dropped Messages Mutation
If there are dropped mutation messages then we probably have write/read failures due to timeouts.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_mutation)
Dropped Messages Read
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_read)
Buffer Pool
Buffer Pool Size
This buffer is allocated as off-heap in addition to the memory allocated for heap. Memory is allocated when needed. Check if miss rate is high.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_size_total)
Buffer Pool Misses
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_misses_total)
CQL Statements
CQL Prepared Statements
Use prepared statements (query with bound variables) as they are more secure and can be cached.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_prepared_statements_executed_total[$__interval]))
CQL Regular Statements
This value should be as low as possible if you are looking for good performance.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_regular_statements_executed_total[$__interval]))
Connected Clients
The number of current client connections in each node.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_connected_clients)
Client Request Latency
Write Latency
95th percentile client request write latency.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_write_latency{quantile="0.95"})
Read Latency
95th percentile client request read latency.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_read_latency{quantile="0.95"})
Unavailable Exceptions
Number of exceptions encountered in regular reads / writes. This number should be near 0 in a healthy cluster.
Read Unavailable Exceptions
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_unavailables[$__interval]))
Write Unavailable Exceptions
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_write_unavailables[$__interval]))
Write Unavailable Exceptions
Write / read request timeouts in Cassandra nodes. If there are timeouts, check for:
1.- ‘read_request_timeout_in_ms’ value in cassandra.yaml in case it is too low. 2.- Check tombstones that can degrade performance. You can find tombstones query below
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)
Client Request Read Timeout
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_timeouts[$__interval]))
Client Request Write Timeout
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_write_request_read_timeouts[$__interval]))
Threadpool Blocked Tasks
Compaction Blocked Tasks
Pending compactions that are blocked. This metric could deviate from “pending compactions” which includes an estimate of tasks that these pending tasks might create after completion.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="CompactionExecutor"}[$__interval]))
Flush Writer Blocked Tasks
The writer flush defines the number of parallel writes on disk. This value should be near 0. Check your “memtable_flush_writers” value to match with your number of cores if you are using SSD disks.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="MemtableFlushWriter"}[$__interval]))
Compactions
Pending Compactions
Compactions that are queued. This value should be as low as possible. If it reaches more than 50 you can start having CPU and Memory pressure.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_compaction_pending_tasks)
Total Size Compacted
Cassandra triggers minor compactions automatically so the compacted size should be low unless you trigger a major compaction across the node.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_compaction_compacted_bytes_total[$__interval]))
Commit Log
Commit Log Pending Tasks
This value should be under 15-20 for performance purposes.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_commitlog_pending_tasks)
Storage
Storage Exceptions
Look carefully at this value as any storage error over 0 is critical for Cassandra.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_storage_internal_exceptions_total)
JVM and GC
JVM Heap Usage
If you want to tune your Heap memory you can use this query.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="Heap"})
If you want to know the maximum heap memory you can use this query.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_max_bytes{area="Heap"})
JVM NonHeap Usage
Use this query for NonHeap memory.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="NonHeap"})
GC Info
If there is memory pressure the max GC duration will start increasing.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_gc_duration_seconds)
Keyspaces and Tables
Keyspace Size
This query gives you information of all keyspaces.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_table_total_disk_space_used)
Table Size
This query gives you information of all tables.
Table Highest Increase Size
Very useful to know what tables are growing too fast.
topk(10,sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(delta(cassandra_table_total_disk_space_used[$__interval])))
Tombstones Scanned
Cassandra does not delete data from disk at once. Instead, it writes a tombstone with a value that indicates the data has been deleted.
A high value (more than 1000) can cause GC pauses, latency and read failures. Sometimes you need to issue a manual compaction from nodetool.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)
Agent Configuration
This is the default agent job for this integration:
- job_name: 'cassandra-default'
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (cassandra-exporter);(.{0}$)
replacement: cassandra
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "cassandra"
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (cassandra_bufferpool_misses_total|cassandra_bufferpool_size_total|cassandra_client_connected_clients|cassandra_client_request_read_latency|cassandra_client_request_read_timeouts|cassandra_client_request_read_unavailables|cassandra_client_request_write_latency|cassandra_client_request_write_timeouts|cassandra_client_request_write_unavailables|cassandra_commitlog_completed_tasks|cassandra_commitlog_pending_tasks|cassandra_commitlog_total_size|cassandra_compaction_compacted_bytes_total|cassandra_compaction_completed_tasks|cassandra_compaction_pending_tasks|cassandra_cql_prepared_statements_executed_total|cassandra_cql_regular_statements_executed_total|cassandra_dropped_messages_mutation|cassandra_dropped_messages_read|cassandra_jvm_gc_collection_count|cassandra_jvm_gc_duration_seconds|cassandra_jvm_memory_usage_max_bytes|cassandra_jvm_memory_usage_used_bytes|cassandra_storage_internal_exceptions_total|cassandra_storage_load_bytes_total|cassandra_table_read_requests_per_second|cassandra_table_tombstoned_scanned|cassandra_table_total_disk_space_used|cassandra_table_write_requests_per_second|cassandra_threadpool_blocked_tasks_total)
action: keep
4 - Ceph
This integration is enabled by default.
Versions supported: > v15.2.12
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 24 metrics.
Timeseries generated: 600 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Ceph] Ceph Manager is absent | Ceph Manager has disappeared from Prometheus target discovery. | Prometheus |
[Ceph] Ceph Manager is missing replicas | Ceph Manager is missing replicas. | Prometheus |
[Ceph] Ceph quorum at risk | Storage cluster quorum is low. Contact Support. | Prometheus |
[Ceph] High number of leader changes | Ceph Monitor has seen a lot of leader changes per minute recently. | Prometheus |
List of Dashboards
Ceph
The dashboard provides information on the status, capacity, latency and throughput of Ceph.
List of Metrics
Metric name |
---|
ceph_cluster_total_bytes |
ceph_cluster_total_used_bytes |
ceph_health_status |
ceph_mgr_status |
ceph_mon_metadata |
ceph_mon_num_elections |
ceph_mon_quorum_status |
ceph_osd_apply_latency_ms |
ceph_osd_commit_latency_ms |
ceph_osd_in |
ceph_osd_metadata |
ceph_osd_numpg |
ceph_osd_op_r |
ceph_osd_op_r_latency_count |
ceph_osd_op_r_latency_sum |
ceph_osd_op_r_out_bytes |
ceph_osd_op_w |
ceph_osd_op_w_in_bytes |
ceph_osd_op_w_latency_count |
ceph_osd_op_w_latency_sum |
ceph_osd_recovery_bytes |
ceph_osd_recovery_ops |
ceph_osd_up |
ceph_pool_max_avail |
Preparing the Integration
Enable Prometheus Module
Ceph instruments Prometheus metrics and annotates the manager pod with Prometheus annotations.
Make sure that the Prometheus module is activated in the Ceph cluster by running the following command:
ceph mgr module enable prometheus
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting Ceph
This document describes important metrics and queries that you can use to monitor and troubleshoot Ceph.
Tracking metrics status
You can track Ceph metrics status with following alerts: Exporter proccess is not serving metrics
# [Ceph] Exporter Process Down
absent(ceph_health_status{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Related Blog Posts
Agent Configuration
This is the default agent job for this integration:
- job_name: ceph-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: keep
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_prometheus_io_port
regex: mgr;9283
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
5 - Consul
This integration is enabled by default.
Versions supported: > 1.11.1
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 64 metrics.
Timeseries generated: 1800 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Consul] KV Store update time anomaly | KV Store update time anomaly | Prometheus |
[Consul] Transaction time anomaly | Transaction time anomaly | Prometheus |
[Consul] Raft transactions count anomaly | Raft transactions count anomaly | Prometheus |
[Consul] Raft commit time anomaly | Raft commit time anomaly | Prometheus |
[Consul] Leader time to contact followers too high | Leader time to contact followers too high | Prometheus |
[Consul] Flapping leadership | Flapping leadership | Prometheus |
[Consul] Too many elections | Too many elections | Prometheus |
[Consul] Server cluster unhealthy | Server cluster unhealthy | Prometheus |
[Consul] Zero failure tolerance | Zero failure tolerance | Prometheus |
[Consul] Client RPC requests anomaly | Consul client RPC requests anomaly | Prometheus |
[Consul] Client RPC requests rate limit exceeded | Consul client RPC requests rate limit exceeded | Prometheus |
[Consul] Client RPC requests failed | Consul client RPC requests failed | Prometheus |
[Consul] License Expiry | Consul License Expiry | Prometheus |
[Consul] Garbage Collection pause high | Consul Garbage Collection pause high | Prometheus |
[Consul] Garbage Collection pause too high | Consul Garbage Collection pause too high | Prometheus |
[Consul] Raft restore duration too high | Consul Raft restore duration too high | Prometheus |
[Consul] RPC requests error rate is high | Consul RPC requests error rate is high | Prometheus |
[Consul] Cache hit rate is low | Consul Cache hit rate is low | Prometheus |
[Consul] High 4xx RequestError Rate | High 4xx RequestError Rate | Prometheus |
[Consul] High Request Latency | Envoy High Request Latency | Prometheus |
[Consul] High Response Latency | Envoy High Response Latency | Prometheus |
[Consul] Certificate close to expire | Certificate close to expire | Prometheus |
List of Dashboards
Consul
The dashboard provides information on the status and latency of Consul.
Consul Envoy
The dashboard provides information on the Consul Envoy proxies.
List of Metrics
Metric name |
---|
consul_autopilot_failure_tolerance |
consul_autopilot_healthy |
consul_client_rpc |
consul_client_rpc_exceeded |
consul_client_rpc_failed |
consul_consul_cache_bypass |
consul_consul_cache_entries_count |
consul_consul_cache_evict_expired |
consul_consul_cache_fetch_error |
consul_consul_cache_fetch_success |
consul_kvs_apply_sum |
consul_raft_apply |
consul_raft_commitTime_sum |
consul_raft_fsm_lastRestoreDuration |
consul_raft_leader_lastContact |
consul_raft_leader_oldestLogAge |
consul_raft_rpc_installSnapshot |
consul_raft_state_candidate |
consul_raft_state_leader |
consul_rpc_cross_dc |
consul_rpc_queries_blocking |
consul_rpc_query |
consul_rpc_request |
consul_rpc_request_error |
consul_runtime_gc_pause_ns |
consul_runtime_gc_pause_ns_sum |
consul_system_licenseExpiration |
consul_txn_apply_sum |
envoy_cluster_membership_change |
envoy_cluster_membership_healthy |
envoy_cluster_membership_total |
envoy_cluster_upstream_cx_active |
envoy_cluster_upstream_cx_connect_ms_bucket |
envoy_cluster_upstream_rq_active |
envoy_cluster_upstream_rq_pending_active |
envoy_cluster_upstream_rq_time_bucket |
envoy_cluster_upstream_rq_xx |
envoy_server_days_until_first_cert_expiring |
go_build_info |
go_gc_duration_seconds |
go_gc_duration_seconds_count |
go_gc_duration_seconds_sum |
go_goroutines |
go_memstats_buck_hash_sys_bytes |
go_memstats_gc_sys_bytes |
go_memstats_heap_alloc_bytes |
go_memstats_heap_idle_bytes |
go_memstats_heap_inuse_bytes |
go_memstats_heap_released_bytes |
go_memstats_heap_sys_bytes |
go_memstats_lookups_total |
go_memstats_mallocs_total |
go_memstats_mcache_inuse_bytes |
go_memstats_mcache_sys_bytes |
go_memstats_mspan_inuse_bytes |
go_memstats_mspan_sys_bytes |
go_memstats_next_gc_bytes |
go_memstats_stack_inuse_bytes |
go_memstats_stack_sys_bytes |
go_memstats_sys_bytes |
go_threads |
process_cpu_seconds_total |
process_max_fds |
process_open_fds |
Preparing the Integration
Enable Prometheus Metrics and Disable Hostname in Metrics
As seen in Consul documentation pages Helm Global Metrics and Prometheus Retention Time, to make Consul expose an endpoint for scraping metrics, you need to enable a few global.metrics configurations. You also need to enable the telemetry.disable_hostname “extra configurations” in the Consul Server and Client, so the metrics don’t contain the name of the instances.
If you install Consul with Helm, you need to use the following flags:
--set 'global.metrics.enabled=true'
--set 'global.metrics.enableAgentMetrics=true'
--set 'server.extraConfig="{"telemetry": {"disable_hostname": true}}"'
--set 'client.extraConfig="{"telemetry": {"disable_hostname": true}}"'
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting Consul
This document describes important metrics and queries that you can use to monitor and troubleshoot Consul.
Tracking metrics status
You can track Consul metrics status with following alerts: Exporter proccess is not serving metrics
# [Consul] Exporter Process Down
absent(consul_autopilot_healthy{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Exporter proccess is not serving metrics
# [Consul] Exporter Process Down
absent(envoy_cluster_upstream_cx_active{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Related Blog Posts
Agent Configuration
These are the default agent jobs for this integration:
- job_name: 'consul-server-default'
metrics_path: '/v1/agent/metrics'
params:
format: ['prometheus']
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: true
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (consul);(.{0}$)
replacement: consul
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "consul"
- action: keep
source_labels: [__address__]
regex: (.*:8500)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
- job_name: 'consul-envoy-default'
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: true
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (envoy-sidecar);(.{0}$)
replacement: consul
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "consul"
- action: replace
source_labels: [__address__]
regex: (.+?)(\\:\\d)?
replacement: $1:20200
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (envoy_cluster_upstream_cx_active|envoy_cluster_upstream_rq_active|envoy_cluster_upstream_rq_pending_active|envoy_cluster_membership_total|envoy_cluster_membership_healthy|envoy_cluster_membership_change|envoy_cluster_upstream_rq_xx|envoy_cluster_upstream_cx_connect_ms_bucket|envoy_server_days_until_first_cert_expiring|envoy_cluster_upstream_rq_time_bucket)
action: keep
6 - Elasticsearch
This integration is enabled by default.
Versions supported: > v6.8
This integration uses a standalone exporter that is available in UBI or scratch base image.
This integration has 28 metrics.
Timeseries generated: 400 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Elasticsearch] Heap Usage Too High | The heap usage is over 90% | Prometheus |
[Elasticsearch] Heap Usage Warning | The heap usage is over 80% | Prometheus |
[Elasticsearch] Disk Space Low | Disk available less than 20% | Prometheus |
[Elasticsearch] Disk Out Of Space | Disk available less than 10% | Prometheus |
[Elasticsearch] Cluster Red | Cluster in Red status | Prometheus |
[Elasticsearch] Cluster Yellow | Cluster in Yellow status | Prometheus |
[Elasticsearch] Relocation Shards | Relocating shards for too long | Prometheus |
[Elasticsearch] Initializing Shards | Initializing shards takes too long | Prometheus |
[Elasticsearch] Unassigned Shards | Unassigned shards for long time | Prometheus |
[Elasticsearch] Pending Tasks | Elasticsearch has a high number of pending tasks | Prometheus |
[Elasticsearch] No New Documents | Elasticsearch has no new documents for a period of time | Prometheus |
List of Dashboards
ElasticSearch Cluster
The dashboard provides information on the status of the ElasticSearch cluster health and its usage of resources.
ElasticSearch Infra
The dashboard provides information on the usage of CPU, memory, disk and networking of ElasticSearch.
List of Metrics
Metric name |
---|
elasticsearch_cluster_health_active_primary_shards |
elasticsearch_cluster_health_active_shards |
elasticsearch_cluster_health_initializing_shards |
elasticsearch_cluster_health_number_of_data_nodes |
elasticsearch_cluster_health_number_of_nodes |
elasticsearch_cluster_health_number_of_pending_tasks |
elasticsearch_cluster_health_relocating_shards |
elasticsearch_cluster_health_status |
elasticsearch_cluster_health_unassigned_shards |
elasticsearch_filesystem_data_available_bytes |
elasticsearch_filesystem_data_size_bytes |
elasticsearch_indices_docs |
elasticsearch_indices_indexing_index_time_seconds_total |
elasticsearch_indices_indexing_index_total |
elasticsearch_indices_merges_total_time_seconds_total |
elasticsearch_indices_search_query_time_seconds |
elasticsearch_indices_store_throttle_time_seconds_total |
elasticsearch_jvm_gc_collection_seconds_count |
elasticsearch_jvm_gc_collection_seconds_sum |
elasticsearch_jvm_memory_committed_bytes |
elasticsearch_jvm_memory_max_bytes |
elasticsearch_jvm_memory_used_bytes |
elasticsearch_os_load1 |
elasticsearch_os_load15 |
elasticsearch_os_load5 |
elasticsearch_process_cpu_percent |
elasticsearch_transport_rx_size_bytes_total |
elasticsearch_transport_tx_size_bytes_total |
Preparing the Integration
Create the Secrets
Keep in mind:
- If your ElasticSearch cluster is using basic authentication, the secret that contains the url must have the user and password.
- The secrets need to be created in the same namespace where the exporter will be deployed.
- Use the same user name and password that you used for the api.
- You can change the name of the secret. If you do this, you will need to select it in the next steps of the integration.
Create the Secret for the URL
Without Authentication
kubectl -n Your-Application-Namespace create secret generic elastic-url-secret \
--from-literal=url='http://SERVICE:PORT'
With Basic Auth
kubectl -n Your-Application-Namespace create secret generic elastic-url-secret \
--from-literal=url='https://USERNAME:PASSWORD@SERVICE:PORT'
NOTE: You can use either http or https in the URL.
Create the Secret for the TLS Certs
If you are using HTTPS with custom certificates, follow the instructions given below.
kubectl create -n Your-Application-Namespace secret generic elastic-tls-secret \
--from-file=root-ca.crt=/path/to/tls/ca-cert \
--from-file=root-ca.key=/path/to/tls/ca-key \
--from-file=root-ca.pem=/path/to/tls/ca-pem
Installing
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/elasticsearch-exporter
Monitoring and Troubleshooting Elasticsearch
This document describes important metrics and queries that you can use to monitor and troubleshoot Elasticsearch.
Tracking metrics status
You can track Elasticsearch metrics status with following alerts: Exporter proccess is not serving metrics
# [Elasticsearch] Exporter Process Down
absent(elasticsearch_cluster_health_status{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Exporter proccess is not serving metrics
# [Elasticsearch] Exporter Process Down
absent(elasticsearch_process_cpu_percent{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: elasticsearch-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "elasticsearch"
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
target_label: kube_namespace_name
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
target_label: kube_workload_type
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
target_label: kube_workload_name
- action: replace
replacement: true
target_label: sysdig_omit_source
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (elasticsearch_cluster_health_active_primary_shards|elasticsearch_cluster_health_active_shards|elasticsearch_cluster_health_initializing_shards|elasticsearch_cluster_health_number_of_data_nodes|elasticsearch_cluster_health_number_of_nodes|elasticsearch_cluster_health_number_of_pending_tasks|elasticsearch_cluster_health_relocating_shards|elasticsearch_cluster_health_status|elasticsearch_cluster_health_unassigned_shards|elasticsearch_filesystem_data_available_bytes|elasticsearch_filesystem_data_size_bytes|elasticsearch_indices_docs|elasticsearch_indices_indexing_index_time_seconds_total|elasticsearch_indices_indexing_index_total|elasticsearch_indices_merges_total_time_seconds_total|elasticsearch_indices_search_query_time_seconds|elasticsearch_indices_store_throttle_time_seconds_total|elasticsearch_jvm_gc_collection_seconds_count|elasticsearch_jvm_gc_collection_seconds_sum|elasticsearch_jvm_memory_committed_bytes|elasticsearch_jvm_memory_max_bytes|elasticsearch_jvm_memory_pool_peak_used_bytes|elasticsearch_jvm_memory_used_bytes|elasticsearch_os_load1|elasticsearch_os_load15|elasticsearch_os_load5|elasticsearch_process_cpu_percent|elasticsearch_transport_rx_size_bytes_total|elasticsearch_transport_tx_size_bytes_total)
action: keep
7 - Fluentd
This integration is enabled by default.
Versions supported: > v1.12.4
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 12 metrics.
Timeseries generated: 640 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Fluentd] No Input From Container | No Input From Container. This alert does not work in OpenShift. | Prometheus |
[Fluentd] High Error Ratio | High Error Ratio. | Prometheus |
[Fluentd] High Retry Ratio | High Retry Ratio. | Prometheus |
[Fluentd] High Retry Wait | High Retry Wait. | Prometheus |
[Fluentd] Low Buffer Available Space | Low Buffer Available Space. | Prometheus |
[Fluentd] Buffer Queue Length Increasing | Buffer Queue Length Increasing. | Prometheus |
[Fluentd] Buffer Total Bytes Increasing | Buffer Total Bytes Increasing. | Prometheus |
[Fluentd] High Slow Flush Ratio | High Slow Flush Ratio. | Prometheus |
[Fluentd] No Output Records From Plugin | No Output Records From Plugin. | Prometheus |
List of Dashboards
Fluentd
The dashboard provides information on the status of Fluentd.
List of Metrics
Metric name |
---|
fluentd_input_status_num_records_total |
fluentd_output_status_buffer_available_space_ratio |
fluentd_output_status_buffer_queue_length |
fluentd_output_status_buffer_total_bytes |
fluentd_output_status_emit_count |
fluentd_output_status_emit_records |
fluentd_output_status_flush_time_count |
fluentd_output_status_num_errors |
fluentd_output_status_retry_count |
fluentd_output_status_retry_wait |
fluentd_output_status_rollback_count |
fluentd_output_status_slow_flush_count |
Preparing the Integration
OpenShift
If you have installed Fluentd using the OpenShift Logging Operator, no further action is required to enable monitoring.
Kubernetes
Enable Prometheus Metrics
For Fluentd to expose Prometheus metrics, enable the following plugins:
- ‘prometheus’ input plugin
- ‘prometheus_monitor’ input plugin
- ‘prometheus_output_monitor’ input plugin
As seen in the official plugin documentation, you can enable them with the following configurations:
<source>
@type prometheus
@id in_prometheus
bind "0.0.0.0"
port 24231
metrics_path "/metrics"
</source>
<source>
@type prometheus_monitor
@id in_prometheus_monitor
</source>
<source>
@type prometheus_output_monitor
@id in_prometheus_output_monitor
</source>
If you are deploying Fluentd using the official Helm chart, it already has these plugins enabled by default in its configuration, so no additional actions are needed.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting Fluentd
This document describes important metrics and queries that you can use to monitor and troubleshoot Fluentd.
Tracking metrics status
You can track Fluentd metrics status with following alerts: Exporter proccess is not serving metrics
# [Fluentd] Exporter Process Down
absent(fluentd_output_status_buffer_available_space_ratio{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Related Blog Posts
Agent Configuration
These are the default agent jobs for this integration:
- job_name: 'fluentd-default'
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (fluentd);(.{0}$)
replacement: fluentd
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "fluentd"
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- action: replace
source_labels:
- __name__
- tag
regex: fluentd_input_status_num_records_total;kubernetes.var.log.containers.([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)-[a-zA-Z0-9]+.log
target_label: input_pod
replacement: $1
- action: replace
source_labels:
- __name__
- tag
regex: fluentd_input_status_num_records_total;kubernetes.var.log.containers.([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)-[a-zA-Z0-9]+.log
target_label: input_namespace
replacement: $2
- action: replace
source_labels:
- __name__
- tag
regex: fluentd_input_status_num_records_total;kubernetes.var.log.containers.([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)_([a-zA-Z0-9 \d\.-]+)-[a-zA-Z0-9]+.log
target_label: input_container
replacement: $3
- job_name: openshift-fluentd-default
scheme: https
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (collector);(.{0}$)
replacement: collector
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "collector"
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (fluentd_output_status_buffer_available_space_ratio|fluentd_output_status_buffer_queue_length|fluentd_output_status_buffer_total_bytes|fluentd_output_status_emit_count|fluentd_output_status_emit_records|fluentd_output_status_flush_time_count|fluentd_output_status_num_errors|fluentd_output_status_retry_count|fluentd_output_status_retry_wait|fluentd_output_status_rollback_count|fluentd_output_status_slow_flush_count)
action: keep
8 - Go
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
This integration has 26 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[Go] Slow Garbage Collector | Garbage collector took too long. | Prometheus |
[Go] Few Free File Descriptors | Few free file descriptors. | Prometheus |
List of Dashboards
Go Internals
The dashboard provides information on the Go integration.
List of Metrics
Metric name |
---|
go_build_info |
go_gc_duration_seconds |
go_gc_duration_seconds_count |
go_gc_duration_seconds_sum |
go_goroutines |
go_memstats_buck_hash_sys_bytes |
go_memstats_gc_sys_bytes |
go_memstats_heap_alloc_bytes |
go_memstats_heap_idle_bytes |
go_memstats_heap_inuse_bytes |
go_memstats_heap_released_bytes |
go_memstats_heap_sys_bytes |
go_memstats_lookups_total |
go_memstats_mallocs_total |
go_memstats_mcache_inuse_bytes |
go_memstats_mcache_sys_bytes |
go_memstats_mspan_inuse_bytes |
go_memstats_mspan_sys_bytes |
go_memstats_next_gc_bytes |
go_memstats_stack_inuse_bytes |
go_memstats_stack_sys_bytes |
go_memstats_sys_bytes |
go_threads |
process_cpu_seconds_total |
process_max_fds |
process_open_fds |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This integration has no default agent job.
9 - HAProxy Ingress
This integration is enabled by default.
Versions supported: > v0.13
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 31 metrics.
Timeseries generated: 150x number of ingress pods, 50x number of ingress pods x ingress resources
List of Alerts
Alert | Description | Format |
---|---|---|
[Haproxy-Ingress] Uptime less than 1 hour | This alert detects when all of the instances of the ingress controller have an uptime of less than 1 hour. | Prometheus |
[Haproxy-Ingress] Frontend Down | This alert detects when a frontend has all of its instances down for more than 10 minutes. | Prometheus |
[Haproxy-Ingress] Backend Down | This alert detects when a backend has all of its instances down for more than 10 minutes. | Prometheus |
[Haproxy-Ingress] High Sessions Usage | This alert triggers when the backend sessions overpass the 85% of the sessions capacity for 10 minutes. | Prometheus |
[Haproxy-Ingress] High Error Rate | This alert triggers when there is an error rate over 15% for over 10 minutes in a proxy. | Prometheus |
[Haproxy-Ingress] High Request Denied Rate | These alerts detect when there is a denied rate of requests over 10% for over 10 minutes in a proxy. | Prometheus |
[Haproxy-Ingress] High Response Denied Rate | These alerts detect when there is a denied rate of responses over 10% for over 10 minutes in a proxy. | Prometheus |
[Haproxy-Ingress] High Response Rate | This alert triggers when a proxy has a mean response time higher than 250ms for over 10 minutes. | Prometheus |
List of Dashboards
HAProxy Ingress Overview
The dashboard provides information on the HAProxy Ingress Overview.
HAProxy Ingress Service Details
The dashboard provides information on the HAProxy Ingress Service Details.
List of Metrics
Metric name |
---|
haproxy_backend_bytes_in_total |
haproxy_backend_bytes_out_total |
haproxy_backend_client_aborts_total |
haproxy_backend_connect_time_average_seconds |
haproxy_backend_current_queue |
haproxy_backend_http_requests_total |
haproxy_backend_http_responses_total |
haproxy_backend_limit_sessions |
haproxy_backend_queue_time_average_seconds |
haproxy_backend_requests_denied_total |
haproxy_backend_response_time_average_seconds |
haproxy_backend_responses_denied_total |
haproxy_backend_sessions_total |
haproxy_backend_status |
haproxy_frontend_bytes_in_total |
haproxy_frontend_bytes_out_total |
haproxy_frontend_connections_total |
haproxy_frontend_denied_connections_total |
haproxy_frontend_denied_sessions_total |
haproxy_frontend_request_errors_total |
haproxy_frontend_requests_denied_total |
haproxy_frontend_responses_denied_total |
haproxy_frontend_status |
haproxy_process_active_peers |
haproxy_process_current_connection_rate |
haproxy_process_current_run_queue |
haproxy_process_current_session_rate |
haproxy_process_current_tasks |
haproxy_process_jobs |
haproxy_process_ssl_connections_total |
haproxy_process_start_time_seconds |
Preparing the Integration
Enable Prometheus Metrics
For HAProxy to expose Prometheus metrics, the following options must be enabled:
- controller.metrics.enabled = true
- controller.stats.enabled = true
You can check all the properties in the official web page.
If you are deploying HAProxy using the official Helm chart, they can be enabled with the following configurations:
helm install haproxy-ingress haproxy-ingress/haproxy-ingress \
--set-string "controller.stats.enabled = true" \
--set-string "controller.metrics.enabled = true"
This configuration creates the following section in haproxy.cfg file
frontend prometheus
mode http
bind :9101
http-request use-service prometheus-exporter if { path /metrics }
http-request use-service lua.send-prometheus-root if { path / }
http-request use-service lua.send-404
no log
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting HAProxy Ingress
This document describes important metrics and queries that you can use to monitor and troubleshoot HAProxy Ingress.
Tracking metrics status
You can track HAProxy Ingress metrics status with following alerts: Exporter proccess is not serving metrics
# [HAProxy Ingress] Exporter Process Down
absent(haproxy_frontend_status{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Exporter proccess is not serving metrics
# [HAProxy Ingress] Exporter Process Down
absent(haproxy_backend_status{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: 'haproxy-default'
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (haproxy-ingress);(.{0}$)
replacement: haproxy-ingress
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "haproxy-ingress"
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (haproxy_backend_bytes_in_total|haproxy_backend_bytes_out_total|haproxy_backend_client_aborts_total|haproxy_backend_connect_time_average_seconds|haproxy_backend_current_queue|haproxy_backend_http_requests_total|haproxy_backend_http_responses_total|haproxy_backend_limit_sessions|haproxy_backend_queue_time_average_seconds|haproxy_backend_requests_denied_total|haproxy_backend_response_time_average_seconds|haproxy_backend_responses_denied_total|haproxy_backend_sessions_total|haproxy_backend_status|haproxy_frontend_bytes_in_total|haproxy_frontend_bytes_out_total|haproxy_frontend_connections_total|haproxy_frontend_denied_connections_total|haproxy_frontend_denied_sessions_total|haproxy_frontend_request_errors_total|haproxy_frontend_requests_denied_total|haproxy_frontend_responses_denied_total|haproxy_frontend_status|haproxy_process_active_peers|haproxy_process_current_connection_rate|haproxy_process_current_run_queue|haproxy_process_current_session_rate|haproxy_process_current_tasks|haproxy_process_jobs|haproxy_process_ssl_connections_total|haproxy_process_start_time_seconds)
action: keep
10 - HAProxy Ingress OpenShift
This integration is enabled by default.
Versions supported: > v3.11
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 28 metrics.
Timeseries generated: The HAProxy ingress router generates ~400 time series per HAProxy router pod.
List of Alerts
Alert | Description | Format |
---|---|---|
[OpenShift-HAProxy-Router] Router Down | Router HAProxy down. No instances running. | Prometheus |
[OpenShift-HAProxy-Router] HAProxy Down | HAProxy down on a pod. | Prometheus |
[OpenShift-HAProxy-Router] HAProxy Reload Failure | HAProxy reloads are failing. New configurations will not be applied. | Prometheus |
[OpenShift-HAProxy-Router] Percentage of routers low | Less than 75% Routers are up. | Prometheus |
[OpenShift-HAProxy-Router] Route Down | This alert detects if all servers are down in a route | Prometheus |
[OpenShift-HAProxy-Router] High Latency | This alert detects high latency in at least one server of the route. | Prometheus |
[OpenShift-HAProxy-Router] Pod Health Check Failure | This alert triggers when there is a recurrent pod health check failure. | Prometheus |
[OpenShift-HAProxy-Router] Queue not empty in route | This alert triggers when a queue is not empty in a route. | Prometheus |
[OpenShift-HAProxy-Router] High error rate in route | This alert triggers when the error rate in a route is higher than 15%. | Prometheus |
[OpenShift-HAProxy-Router] Connection errors in route | This alert triggers when there are recurring connection errors in a route. | Prometheus |
List of Dashboards
OpenShift HAProxy Ingress Overview
The dashboard provides information on the OpenShift HAProxy Ingress overview.
OpenShift HAProxy Ingress Service Details
The dashboard provides information on the OpenShift HAProxy Ingress Service golden signals.
List of Metrics
Metric name |
---|
haproxy_backend_http_average_connect_latency_milliseconds |
haproxy_backend_http_average_queue_latency_milliseconds |
haproxy_backend_http_average_response_latency_milliseconds |
haproxy_backend_up |
haproxy_frontend_bytes_in_total |
haproxy_frontend_bytes_out_total |
haproxy_frontend_connections_total |
haproxy_frontend_current_session_rate |
haproxy_frontend_http_responses_total |
haproxy_process_cpu_seconds_total |
haproxy_process_max_fds |
haproxy_process_resident_memory_bytes |
haproxy_process_start_time_seconds |
haproxy_process_virtual_memory_bytes |
haproxy_server_bytes_in_total |
haproxy_server_bytes_out_total |
haproxy_server_check_failures_total |
haproxy_server_connection_errors_total |
haproxy_server_connections_total |
haproxy_server_current_queue |
haproxy_server_current_sessions |
haproxy_server_downtime_seconds_total |
haproxy_server_http_average_response_latency_milliseconds |
haproxy_server_http_responses_total |
haproxy_server_up |
haproxy_up |
kube_workload_status_desired |
template_router_reload_failure |
Preparing the Integration
Openshift 3.11
Once the Sysdig agent is deployed, check if it is running on all nodes (compute, master, and infra):
oc get nodes
oc get pods -n sysdig-agent -o wide
Apply this patch in case the Agent is not running on infra/master.
oc patch namespace sysdig-agent --patch-file='sysdig-agent-namespace-patch.yaml'
sysdig-agent-namespace-patch.yaml file
apiVersion: v1
kind: Namespace
metadata:
annotations:
openshift.io/node-selector: ""
OpenShift integrates security by default. Therefore, if you want Sysdig agent to scrape HAProxy router metrics, provide it with the necessary permissions. To do so:
oc apply -f router-clusterrolebinding-sysdig-agent-oc3.yaml
router-clusterrolebinding-sysdig-agent-oc3.yaml file
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: haproxy-route-monitoring
rules:
- apiGroups:
- route.openshift.io
resources:
- routers/metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app: sysdig-agent
name: sysdig-router-monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: haproxy-route-monitoring
subjects:
- kind: ServiceAccount
name: sysdig-agent
namespace: sysdig-agent # Remember to change to the namespace where you have the Sysdig agents deployed
Openshift 4.X
OpenShift integrates security by default. Therefore, if you want Sysdig agent to scrape HAProxy router metrics, provide it with the necessary permissions. To do so:
oc apply -f router-clusterrolebinding-sysdig-agent-oc4.yaml
router-clusterrolebinding-sysdig-agent-oc4.yaml file
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: router-monitoring-sysdig-agent
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: router-monitoring
subjects:
- kind: ServiceAccount
name: sysdig-agent
namespace: sysdig-agent # Remember to change to the namespace where you have the Sysdig agents deployed
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting HAProxy Ingress OpenShift
This document describes important metrics and queries that you can use to monitor and troubleshoot HAProxy Ingress OpenShift.
Tracking metrics status
You can track HAProxy Ingress OpenShift metrics status with following alerts: Exporter proccess is not serving metrics
# [HAProxy Ingress OpenShift] Exporter Process Down
absent(haproxy_process_start_time_seconds{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Exporter proccess is not serving metrics
# [HAProxy Ingress OpenShift] Exporter Process Down
absent(haproxy_server_http_average_response_latency_milliseconds{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: 'haproxy-router'
scheme: https
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: replace
source_labels: [__address__]
regex: ([^:]+)(?::\d+)?
replacement: $1:1936
target_label: __address__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (router);(.{0}$)
replacement: openshift-haproxy
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "openshift-haproxy"
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (haproxy_backend_http_average_connect_latency_milliseconds|haproxy_backend_http_average_queue_latency_milliseconds|haproxy_backend_http_average_response_latency_milliseconds|haproxy_backend_up|haproxy_frontend_bytes_in_total|haproxy_frontend_bytes_out_total|haproxy_frontend_connections_total|haproxy_frontend_current_session_rate|haproxy_frontend_http_responses_total|haproxy_process_cpu_seconds_total|haproxy_process_max_fds|haproxy_process_resident_memory_bytes|haproxy_process_start_time_seconds|haproxy_process_virtual_memory_bytes|haproxy_server_bytes_in_total|haproxy_server_bytes_out_total|haproxy_server_check_failures_total|haproxy_server_connection_errors_total|haproxy_server_connections_total|haproxy_server_current_queue|haproxy_server_current_sessions|haproxy_server_downtime_seconds_total|haproxy_server_http_average_response_latency_milliseconds|haproxy_server_http_responses_total|haproxy_server_up|haproxy_up|template_router_reload_failure)
action: keep
11 - Harbor
This integration is enabled by default.
Versions supported: > v2.3
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 44 metrics.
Timeseries generated: 800 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Harbor] Harbor Core Is Down | Harbor Core Is Down | Prometheus |
[Harbor] Harbor Database Is Down | Harbor Database Is Down | Prometheus |
[Harbor] Harbor Registry Is Down | Harbor Registry Is Down | Prometheus |
[Harbor] Harbor Redis Is Down | Harbor Redis Is Down | Prometheus |
[Harbor] Harbor Trivy Is Down | Harbor Trivy Is Down | Prometheus |
[Harbor] Harbor JobService Is Down | Harbor JobService Is Down | Prometheus |
[Harbor] Project Quota Is Raising The Limit | Project Quota Is Raising The Limit | Prometheus |
[Harbor] Harbor p99 latency is higher than 10 seconds | Harbor p99 latency is higher than 10 seconds | Prometheus |
[Harbor] Harbor Error Rate is High | Harbor Error Rate is High | Prometheus |
List of Dashboards
Harbor
The dashboard provides information on the Harbour instance status, storage usage, projects and tasks.
List of Metrics
Metric name |
---|
go_build_info |
go_gc_duration_seconds |
go_gc_duration_seconds_count |
go_gc_duration_seconds_sum |
go_goroutines |
go_memstats_buck_hash_sys_bytes |
go_memstats_gc_sys_bytes |
go_memstats_heap_alloc_bytes |
go_memstats_heap_idle_bytes |
go_memstats_heap_inuse_bytes |
go_memstats_heap_released_bytes |
go_memstats_heap_sys_bytes |
go_memstats_lookups_total |
go_memstats_mallocs_total |
go_memstats_mcache_inuse_bytes |
go_memstats_mcache_sys_bytes |
go_memstats_mspan_inuse_bytes |
go_memstats_mspan_sys_bytes |
go_memstats_next_gc_bytes |
go_memstats_stack_inuse_bytes |
go_memstats_stack_sys_bytes |
go_memstats_sys_bytes |
go_threads |
harbor_artifact_pulled |
harbor_core_http_request_duration_seconds |
harbor_jobservice_task_process_time_seconds |
harbor_project_member_total |
harbor_project_quota_byte |
harbor_project_quota_usage_byte |
harbor_project_repo_total |
harbor_project_total |
harbor_quotas_size_bytes |
harbor_task_concurrency |
harbor_task_queue_latency |
harbor_task_queue_size |
harbor_up |
process_cpu_seconds_total |
process_max_fds |
process_open_fds |
registry_http_request_duration_seconds_bucket |
registry_http_request_size_bytes_bucket |
registry_http_requests_total |
registry_http_response_size_bytes_bucket |
registry_storage_action_seconds_bucket |
Preparing the Integration
Enable Prometheus Metrics
As seen in the Harbor documentation page Configure the Harbor YML File, to make Harbor expose an endpoint for scraping metrics, you need to set the ‘metric.enabled’ configuration to ’true’.
If you install Harbor with Helm, you need to use the following flag:
--set 'metrics.enabled=true'
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting Harbor
This document describes important metrics and queries that you can use to monitor and troubleshoot Harbor.
Tracking metrics status
You can track Harbor metrics status with following alerts: Exporter proccess is not serving metrics
# [Harbor] Exporter Process Down
absent(harbor_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
These are the default agent jobs for this integration:
- job_name: harbor-exporter-default
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: keep
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_container_port_number
regex: exporter;8080
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:8001
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
- job_name: harbor-core-default
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: keep
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_container_port_number
regex: core;8080
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:8001
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
- job_name: harbor-registry-default
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: keep
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_container_port_number
regex: registry;5000
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:8001
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
- job_name: harbor-jobservice-default
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: keep
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_container_port_number
regex: jobservice;8080
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels: [__address__,__meta_kubernetes_pod_container_port_number]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:8001
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
12 - Istio
This integration is enabled by default.
Versions supported: 1.14
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 28 metrics.
Timeseries generated: 15 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Istio-Citadel] CSR without success | Some of the Certificate Signing Request (CSR) were not correctly requested | Prometheus |
[Istio-Pilot] Inbound listener rules conflicts | There are some conflict with inbound listener rules | Prometheus |
[Istio-Pilot] Endpoint found in unready state | Endpoint found in unready state | Prometheus |
[Istio] Unstable requests for sidecar injections | Sidecar injections requests are failing | Prometheus |
[Istiod] Istiod Uptime issue | Istiod UpTime is taking more time than usual | Prometheus |
List of Dashboards
Istio v1.14 Control Plane
The dashboard provides information on the Istio Control Plane, Pilot, Galley, Mixer and Citadel.
List of Metrics
Metric name |
---|
citadel_server_csr_count |
citadel_server_success_cert_issuance_count |
galley_validation_failed |
galley_validation_passed |
istiod_uptime_seconds |
pilot_conflict_inbound_listener |
pilot_conflict_outbound_listener_http_over_current_tcp |
pilot_conflict_outbound_listener_tcp_over_current_http |
pilot_conflict_outbound_listener_tcp_over_current_tcp |
pilot_endpoint_not_ready |
pilot_services |
pilot_total_xds_internal_errors |
pilot_total_xds_rejects |
pilot_virt_services |
pilot_xds |
pilot_xds_cds_reject |
pilot_xds_config_size_bytes_bucket |
pilot_xds_eds_reject |
pilot_xds_lds_reject |
pilot_xds_push_context_errors |
pilot_xds_push_time_bucket |
pilot_xds_pushes |
pilot_xds_rds_reject |
pilot_xds_send_time_bucket |
pilot_xds_write_timeout |
sidecar_injection_failure_total |
sidecar_injection_requests_total |
sidecar_injection_success_total |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting Istio
This document describes resumed alarms and dashboards for Istio Service. Istio Services are based on network rules as the foundation, so all the alarms and dashboards monitor any problem related to traffic and connections from source and destination.
Alarms
Most of the alarms associated with Istio configuration notifies problems with the Pilot
or Citadel
server. These servers are responsible for important Istio configuration.
Citadel
controls authentication and identity management between services, and manages certificates in every workload.
Pilot
accepts the rules created for traffic behavior provided by the control plane, and converts them into configurations applied by Envoy, based on how configuration aspects are managed locally. Basically, Pilot
is responsible for iptables configuration in every workload.
CSR Without Success
Alarms are defined to notify you of faulty Certificate Signing Requests (CSRs). In order to collect that information, the following metrics are used:
citadel_server_csr_count
citadel_server_success_cert_issuance_count
rate(citadel_server_csr_count[5m]) - rate(citadel_server_success_cert_issuance_count[5m]) > 0
What is CSR: A certificate signing request (CSR) is one of the first steps towards getting your own SSL/TLS certificate. Generated on the same server you plan to install the certificate on, the CSR contains information such as common name, organization, and country. The Certificate Authority (CA) will use CSR to create your certificate. CSR also contains the public key that will be included in your certificate and is signed with the corresponding private key.
Inbound Listener Rules Conflicts
Because Istio works with networking rules, and configures IP addresses, ports, sockets, and so on to send or received traffic. The term listeners
refers to these configurable values. Be aware of possible errors or conflicts with these rules.
pilot_conflict_inbound_listener > 0
Endpoint Found in Unready State
In order to have a stable platform, you need to verify that all endpoints in your network are perfectly working. Use the following alarm to collect that information:
pilot_endpoint_not_ready > 0
Unstable Requests for Sidecar Injections
Istio configures sidecar containers in every pod, and use this sidecar as the frontend server for all the requests that goes to or from that workload. To check if this sidecar injection is properly work, use the following query:
rate(sidecar_injection_requests_total [5m]) - rate(sidecar_injection_success_total [5m]) > 0
Dashboards
Traffic
Traffic is the first golden signal that has to be gathered. Because Istio provides traffic management itself the information it provides will be detailed. Istio has three different parts that you can monitor and specify different metrics: control plane, envoy, and service itself.
This example shows gathering information about Istio service traffic.
Use the istio_requests_total
with relevant labels to colloect wideband of information on different panels.
Client Request Volume and Server Request Volume
The istio_requests_total
metric shows the total request traffic from both sides of the connection, using the reporter
label.
The reporter
label identifies the reporter of the request. It is set to destination if report is from an Istio proxy server. It will be set to source if the report is from a Istio proxy client or a gateway.
sum (irate(istio_requests_total{reporter="source"}[5m]))
sum (irate(istio_requests_total{reporter="destination"}[5m]))
Incoming Request by Source/Destination and Response Code
This dashboard shows the requests received by both source and destination using the reporter
label. The following query segments the HTTP codes with the response_code
label.
sum(irate(istio_requests_total{reporter="source"}[5m])) by (source_workload, source_workload_namespace, response_code)
sum(irate(istio_requests_total{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, response_code)
Client/Server Success Rate (non-5xx responses)
The following query builds a dashboard to monitor all the traffic except related to the internal server errors. The reporter
label is used to segment on both source and destination.
100*(sum by (destination_service_name)(irate(istio_requests_total{reporter="destination",response_code!~"5.*"}[5m])) / sum by (destination_service_name)(irate(istio_requests_total{reporter="destination"}[5m])))
100*(sum by (destination_service_name)(irate(istio_requests_total{reporter="source",response_code!~"5.*"}[5m])) / sum by (destination_service_name)(irate(istio_requests_total{reporter="source"}[5m])))
Errors
The errors summarized in these dashboards are related with HTTP traffic managed by Istio proxies.
4xx Response Code by Source/Destination
The following query builds a dashboard that reports all the bad requests. It uses the reporter
label on both source and destination.
sum by (response_code)(irate(istio_requests_total{reporter="source", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="source", response_code=~"4.*"}) -1
sum by (response_code)(irate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="destination", response_code=~"4.*"}) -1
5xx Response Code by Source/Destination
The following query builds a dashboard to show all the internal server errors requests. The query uses the reporter
label on both source and destination.
sum by (response_code)(irate(istio_requests_total{reporter="source", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="source", response_code=~"4.*"}) -1
sum by (response_code)(irate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="destination", response_code=~"4.*"}) -1
Latency and Saturation
Both latency and saturation are reported on these dashboards because both are related to request duration and package size.
Client/Server Request Duration
The following query builds a dashboard to show critical duration of some requests using quantiles.
Note: quantiles can be modified.
histogram_quantile(0.50, sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m])) by (le, source_service_name)) / 1000
histogram_quantile(0.50, sum(irate(istio_request_duration_milliseconds_bucket{reporter="destination"}[1m])) by (le, destination_service_name)) / 1000
Incoming Request Size by Source/Destination
The following query builds a dashboard to show critical size of some requests using quantiles.
Note: quantiles can be modified.
histogram_quantile(0.50, sum(irate(istio_request_bytes_bucket{reporter="source"}[5m])) by (source_workload, source_workload_namespace, le))
histogram_quantile(0.50, sum(irate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, le))
Response Size By Source/Destination
The following query builds a dashboard to show critical size of some responses using quantiles.
Note: quantiles can be modified.
histogram_quantile(0.50, sum(irate(istio_response_bytes_bucket{reporter="source"}[5m])) by (source_workload, source_workload_namespace, le))
histogram_quantile(0.50, sum(irate(istio_response_bytes_bucket{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, le))
Agent Configuration
This is the default agent job for this integration:
- job_name: 'istiod'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (discovery);(.{0}$)
replacement: istiod
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "istiod"
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (citadel_server_csr_count|citadel_server_success_cert_issuance_count|galley_validation_failed|galley_validation_passed|istiod_uptime_seconds|pilot_conflict_inbound_listener|pilot_conflict_outbound_listener_http_over_current_tcp|pilot_conflict_outbound_listener_tcp_over_current_http|pilot_conflict_outbound_listener_tcp_over_current_tcp|pilot_endpoint_not_ready|pilot_services|pilot_total_xds_internal_errors|pilot_total_xds_rejects|pilot_virt_services|pilot_xds|pilot_xds_cds_reject|pilot_xds_config_size_bytes_bucket|pilot_xds_eds_reject|pilot_xds_lds_reject|pilot_xds_push_context_errors|pilot_xds_push_time_bucket|pilot_xds_pushes|pilot_xds_rds_reject|pilot_xds_send_time_bucket|pilot_xds_write_timeout|sidecar_injection_failure_total|sidecar_injection_requests_total|sidecar_injection_success_total)
action: keep
13 - Istio Envoy
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
Versions supported: 1.14
This integration has 16 metrics.
Timeseries generated: 155 timeseries per envoy
List of Alerts
Alert | Description | Format |
---|---|---|
[Istio-Envoy] High 4xx RequestError Rate | 4xx RequestError Rate is higher than 5% | Prometheus |
[Istio-Envoy] High 5xx RequestError Rate | 5xx RequestError Rate is higher than 5% | Prometheus |
[Istio-Envoy] High Request Latency | Envoy Request Latency is higher than 100ms | Prometheus |
List of Dashboards
Istio v1.14 Workload
The dashboard provides information on the Istio Envoy proxy status.
Istio v1.14 Service
The dashboard provides information on the Istio Service, Request rates and duration for Http and TCP connections.
List of Metrics
Metric name |
---|
citadel_server_csr_count |
envoy_cluster_membership_change |
envoy_cluster_membership_healthy |
envoy_cluster_membership_total |
envoy_cluster_upstream_cx_active |
envoy_cluster_upstream_cx_connect_ms_bucket |
envoy_cluster_upstream_rq_active |
envoy_cluster_upstream_rq_pending_active |
envoy_server_days_until_first_cert_expiring |
istio_build |
istio_request_bytes_bucket |
istio_request_duration_milliseconds_bucket |
istio_requests_total |
istio_response_bytes_bucket |
istio_tcp_received_bytes_total |
istio_tcp_sent_bytes_total |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This integration has no default agent job.
14 - Kafka
This integration is enabled by default.
Versions supported: > v2.7.x
This integration uses a standalone exporter that is available in UBI or scratch base image.
This integration has 37 metrics.
Timeseries generated: The JMX-Exporter generates ~270 timeseries and the Kafka-Exporter ~138 timeseries (the number of topics, partitions and consumers increases this number).
List of Alerts
Alert | Description | Format |
---|---|---|
[Kafka] Broker Down | There are less Kafka brokers up than expected. The ‘workload’ label of the Kafka Deployment/Stateful set must be specified. | Prometheus |
[Kafka] No Leader | There is no ActiveController or ’leader’ in the Kafka cluster. | Prometheus |
[Kafka] Too Many Leaders | There is more than one ActiveController or ’leader’ in the Kafka cluster. | Prometheus |
[Kafka] Offline Partitions | There are one or more Offline Partitions. These partitions don’t have an active leader and are hence not writable or readable. | Prometheus |
[Kafka] Under Replicated Partitions | There are one or more Under Replicated Partitions. | Prometheus |
[Kafka] Under In-Sync Replicated Partitions | There are one or more Under In-Sync Replicated Partitions. These partitions will be unavailable to producers who use ‘acks=all’. | Prometheus |
[Kafka] ConsumerGroup Lag Not Decreasing | The ConsumerGroup lag is not decreasing. The Consumers might be down, failing to process the messages and continuously retrying, or their consumption rate is lower than the production rate of messages. | Prometheus |
[Kafka] ConsumerGroup Without Members | The ConsumerGroup doesn’t have any members. | Prometheus |
[Kafka] Producer High ThrottleTime By Client-Id | The Producer has reached its quota and has high throttle time. Applicable when Client-Id-only quotas are being used. | Prometheus |
[Kafka] Producer High ThrottleTime By User | The Producer has reached its quota and has high throttle time. Applicable when User-only quotas are being used. | Prometheus |
[Kafka] Producer High ThrottleTime By User And Client-Id | The Producer has reached its quota and has high throttle time. Applicable when Client-Id + User quotas are being used. | Prometheus |
[Kafka] Consumer High ThrottleTime By Client-Id | The Consumer has reached its quota and has high throttle time. Applicable when Client-Id-only quotas are being used. | Prometheus |
[Kafka] Consumer High ThrottleTime By User | The Consumer has reached its quota and has high throttle time. Applicable when User-only quotas are being used. | Prometheus |
[Kafka] Consumer High ThrottleTime By User And Client-Id | The Consumer has reached its quota and has high throttle time. Applicable when Client-Id + User quotas are being used. | Prometheus |
List of Dashboards
Kafka
The dashboard provides information on the status of Kafka.
List of Metrics
Metric name |
---|
kafka_brokers |
kafka_consumergroup_current_offset |
kafka_consumergroup_lag |
kafka_consumergroup_members |
kafka_controller_active_controller |
kafka_controller_offline_partitions |
kafka_log_size |
kafka_network_consumer_request_time_milliseconds |
kafka_network_fetch_follower_time_milliseconds |
kafka_network_producer_request_time_milliseconds |
kafka_server_bytes_in |
kafka_server_bytes_out |
kafka_server_consumer_client_byterate |
kafka_server_consumer_client_throttle_time |
kafka_server_consumer_user_byterate |
kafka_server_consumer_user_client_byterate |
kafka_server_consumer_user_client_throttle_time |
kafka_server_consumer_user_throttle_time |
kafka_server_messages_in |
kafka_server_partition_leader_count |
kafka_server_producer_client_byterate |
kafka_server_producer_client_throttle_time |
kafka_server_producer_user_byterate |
kafka_server_producer_user_client_byterate |
kafka_server_producer_user_client_throttle_time |
kafka_server_producer_user_throttle_time |
kafka_server_under_isr_partitions |
kafka_server_under_replicated_partitions |
kafka_server_zookeeper_auth_failures |
kafka_server_zookeeper_disconnections |
kafka_server_zookeeper_expired_sessions |
kafka_server_zookeeper_read_only_connections |
kafka_server_zookeeper_sasl_authentications |
kafka_server_zookeeper_sync_connections |
kafka_topic_partition_current_offset |
kafka_topic_partition_oldest_offset |
kube_workload_status_desired |
Preparing the Integration
Installation of the JMX-Exporter as a Sidecar
The JMX-Exporter can be easily installed in two steps.
First deploy the ConfigMap which contains the Kafka JMX configurations. The following example is for a Kafka cluster which exposes the jmx port 9010:
helm repo add promcat-charts https://sysdiglabs.github.io/integrations-charts
helm repo update
helm -n kafka install kafka-jmx-exporter promcat-charts/jmx-exporter --set jmx_port=9010 --set integrationType=kafka --set onlyCreateJMXConfigMap=true
Then generate a patch file and apply it to your workload (your Kafka Deployment/StatefulSet/Daemonset). The following example is for a Kafka cluster which exposes the jmx port 9010, and is deployed as a StatefulSet called ‘kafka-cp-kafka’:
helm template kafka-jmx-exporter promcat-charts/jmx-exporter --set jmx_port=9010 --set integrationType=kafka --set onlyCreateSidecarPatch=true > jmx-exporter-sidecar-patch.yaml
kubectl -n kafka patch sts kafka-cp-kafka --patch-file jmx-exporter-sidecar-patch.yaml
Create Secrets for Authentication for the Kafka-Exporter
Your Kafka cluster external endpoints might be secured by using authentication for the clients that want to connect to it (TLS, SASL+SCARM, SASL+Kerberos). If you are going to make the Kafka-Exporter (which will be deployed in the next tab) use these secured external endpoints, then you’ll need to create Kubernetes Secrets in the following step. If you prefer using an internal not-secured (plaintext) endpoint for the Kafka-Exporter to connect to the Kafka cluster, then skip this step.
If using TLS, you’ll need to create a Secret which contains the CA, the client certificate and the client key. The names of these files must be “ca.crt”, “tls.crt” and “tls.key”. The name of the secret can be any name that you want. Example:
kubectl create secret generic kafka-exporter-certs --from-file=./tls.key --from-file=./tls.crt --from-file=./ca.crt --dry-run=true -o yaml | kubectl apply -f -
If using SASL+SCRAM, you’ll need to create a Secret which contains the “username” and “password”. Example:
echo -n 'admin' > username
echo -n '1f2d1e2e67df' > password
kubectl create secret generic kafka-exporter-sasl-scram --from-file=username --from-file=password --dry-run=true -o yaml | kubectl apply -f -
If using SASL+Kerberos, you’ll need to create a Secret which contains the “kerberos.conf”. If the ‘Kerberos Auth Type’ is ‘keytabAuth’, it should also contain the “kerberos.keytab”. Example:
kubectl create secret generic kafka-exporter-sasl-kerberos --from-file=./kerberos.conf --from-file=./kerberos.keytab --dry-run=true -o yaml | kubectl apply -f -
Installing
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use these Helm charts for expert users:
- https://github.com/sysdiglabs/integrations-charts/tree/main/charts/jmx-exporter
- https://github.com/sysdiglabs/integrations-charts/tree/main/charts/kafka-exporter
Monitoring and Troubleshooting Kafka
Here are some interesting metrics and queries to monitor and troubleshoot Kafka.
Brokers
Broker Down
Let’s get the number of expected Brokers, and the actual number of Brokers up and running. If the number is not the same, then there might a problem.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_desired)
-
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kafka_brokers)
> 0
Leadership
Let’s get the number of Kafka leaders. There should always be one leader. If not, a Kafka misconfiguration or a networking issue might be the problem.
sum(kafka_controller_active_controller) < 1
If there are more than one leader, that might be a temporal situation while the leadership is changing. If this case doesn’t get fixed by itslef over time, a split-brain situation might be happening.
sum(kafka_controller_active_controller) > 1
Offline, Under Replicated and In-Sync Under Replicated Partitions
When a Broker goes down, the other Brokers in the cluster will take leadership of the partitions it was leading. If several brokers go down, or just a few but the topic had a low replication factor, there will be Offline partitions. These partitions don’t have an active leader and are hence not writable or readable, which will most likely dangerous for business.
Let’s check if there are offline partitions:
sum(kafka_controller_offline_partitions) > 0
If other Brokers had replicas of those partitions, one of them will take leadership and the service won’t be down. In this situation there will be Under Replicated partitions. If there are enough Brokers where these partitions can be replicated, the situation will be fixed by itself over time. If there aren’t enough Brokers, the situation will only be fixed once the Brokers which went down come up again.
The following expression is used to get the under replication partitions:
sum(kafka_server_under_replicated_partitions) > 0
But there is a situation when having no Offline partitons but having Under Replicated partitions might pose a real problem. That’s the case of topics with ‘Minimum In-Sync Replicas’, and Kafka Producers with the configuration ‘acks=all’.
If one of this topics has any partition with less replicas than its ‘Minimum In-Sync Replicas’ configuration, and there is Producer with ‘acks=all’, that Producer won’t be able to produce messages into that partition, since ‘acks=all’ means that it waits for the produced messages to be replicated in all the minimum replicas in the Kafka cluster.
If the Producers have any configuration different than ‘acks=all’, then there won’t be any problem.
This is how Under In-Sync Replicated partitions can be checked:
sum(kafka_server_under_isr_partitions) > 0
Network
Broker Bytes In
Let’s get the amount of bytes produced into each Broker:
sum by (kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_server_bytes_in)
Broker Bytes Out
Now the same, but for bytes consumed from each Broker:
sum by (kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_server_bytes_out)
Broker Messages In
And similar, but for number of messages produced into each Broker:
sum by (kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_server_messages_in)
Topics
Topic Size
This query returns the size of a topic in the whole Kafka cluster. It also includes the size of all replicas, so increasing the replication factor of a topic will increase the overall size across the Kafka cluster.
sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_log_size)
In case of needing the size of a topic in each Broker, use the following query:
sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name)(kafka_log_size)
In a situation where the Broker disk space is running low, the retention of the topics can be decreased to free up some space. Let’s get the top 10 biggest topics:
topk(10,sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_log_size))
If this “low disk space” situation happened out of the blue, there might be a problem in a topic with a Producer filling it with unwanted messages. The following query will help find which topics increased their size the most in the past few hours, which will allow to find the responsible of the sudden increase of messages. It wouldn’t be the first time an exhausted developer wanted to perform a stress test in a topic in a Staging environment, but accidentally did it in Production.
topk(10,sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(delta(kafka_log_size[$__interval])))
Topic Messages
Calculating the number of messages inside a topic is as easy as substracting the offset of the newest message minus the offset of the oldest message:
sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_topic_partition_current_offset) - sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(kafka_topic_partition_oldest_offset)
But it’s very important to acknowledge that this is only true for topics with ‘compaction’ disabled, since compacted topics might have deleted messages in the middle. To get the number of messages in a compacted topic, a new Consumer must consume all the messages in that topic to count them.
It’s also quite easy to calculate the rate per second of messages being produced into a topic:
sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, topic)(rate(kafka_topic_partition_current_offset[$__interval]))
ConsumerGroup
ConsumerGroup Lag
Let’s check the ConsumerGroup lag of a Consumer in each partition of a topic:
sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic, partition)(kafka_consumergroup_lag)
If the lag of a ConsumerGroup is constantly increasing and never decreases, it might have different causes. The Consumers of the ConsumerGroups might be down, one of them might be failing to process the messages and continuously retrying, or their consumption rate might be lower than the production rate of messages.
A non-stop increasing lag can be detected using the following expression:
(sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic)(kafka_consumergroup_lag) > 0)
and
(sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic)(delta(kafka_consumergroup_lag[2m])) >= 0)
ConsumerGroup Consumption Rate
It might be useful to get the consumption speed of the Consumers of a ConsumerGroup, to detect any issues while processing messages, like internal issues related to the messages, or external issues related to the business. For example, the Consumers might want to send the processed messages to another microservice or another database, but there might be networking issues, or the database performance might be degraded so it slows down the Consumer.
Here you can check the consumption rate:
sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup, topic, partition)(rate(kafka_consumergroup_current_offset[$__interval]))
ConsumerGroup Members
It might be also help to know the number of Consumers in a ConsumerGroup, in case there are less than expected:
sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, consumergroup)(kafka_consumergroup_members)
Quotas
Kafka has the option to enforce quotas on requests to control the Broker resources used by clients (Producers and Consumers).
Quotas can be applied to user, client-id or both groups at the same time.
Each client can utilize this quota per Broker before it gets throttled. Throttling means that the client will need to wait some time before being able to produce or consume messages again.
Production/Consumption Rate
Depending if the client is a Consumer or a Producer, or if the quota is applied at cliend-id or user level, or both at the same time, a different metric will be used:
- kafka_server_producer_client_byterate
- kafka_server_producer_user_byterate
- kafka_server_producer_user_client_byterate
- kafka_server_consumer_client_byterate
- kafka_server_consumer_user_byterate
- kafka_server_consumer_user_client_byterate
Let’s check for example the production rate of a Producer using both user and client-id:
sum by(kube_cluster_name, kube_namespace_name, kube_workload_name, kube_pod_name, user, client_id)(kafka_server_producer_user_client_byterate)
Production/Consumption Throttle Time
Similar to the rate, there are throttle time for the same combinations of clients and quota groups:
- kafka_server_producer_client_throttle_time
- kafka_server_producer_user_throttle_time
- kafka_server_producer_user_client_throttle_time
- kafka_server_consumer_client_throttle_time
- kafka_server_consumer_user_throttle_time
- kafka_server_consumer_user_client_throttle_time
Let’s see in this case if the throtte time of a Consumer using user and client-id is higher than one second, at least in one Broker:
max by(kube_cluster_name, kube_namespace_name, kube_workload_name, user, client_id)(kafka_server_consumer_user_client_throttle_time) > 1000
Agent Configuration
These are the default agent jobs for this integration:
- job_name: 'kafka-exporter-default'
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (kafka-exporter);(.{0}$)
replacement: kafka
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (kafka-exporter);(kafka)
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
target_label: kube_namespace_name
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
target_label: kube_workload_type
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
target_label: kube_workload_name
- action: replace
replacement: true
target_label: sysdig_omit_source
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (kafka_brokers|kafka_consumergroup_current_offset|kafka_consumergroup_lag|kafka_consumergroup_members|kafka_topic_partition_current_offset|kafka_topic_partition_oldest_offset|kube_workload_status_desired)
action: keep
- job_name: 'kafka-jmx-default'
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (kafka-jmx-exporter);(kafka)
replacement: kafka
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (kafka-jmx-exporter);(kafka)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (kafka_controller_active_controller|kafka_controller_offline_partitions|kafka_log_size|kafka_network_consumer_request_time_milliseconds|kafka_network_fetch_follower_time_milliseconds|kafka_network_producer_request_time_milliseconds|kafka_server_bytes_in|kafka_server_bytes_out|kafka_server_consumer_client_byterate|kafka_server_consumer_client_throttle_time|kafka_server_consumer_user_byterate|kafka_server_consumer_user_client_byterate|kafka_server_consumer_user_client_throttle_time|kafka_server_consumer_user_throttle_time|kafka_server_messages_in|kafka_server_partition_leader_count|kafka_server_producer_client_byterate|kafka_server_producer_client_throttle_time|kafka_server_producer_user_byterate|kafka_server_producer_user_client_byterate|kafka_server_producer_user_client_throttle_time|kafka_server_producer_user_throttle_time|kafka_server_under_isr_partitions|kafka_server_under_replicated_partitions|kafka_server_zookeeper_auth_failures|kafka_server_zookeeper_disconnections|kafka_server_zookeeper_expired_sessions|kafka_server_zookeeper_read_only_connections|kafka_server_zookeeper_sasl_authentications|kafka_server_zookeeper_sync_connections)
action: keep
15 - KEDA
This integration is enabled by default.
Versions supported: > v2.0
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 6 metrics.
Timeseries generated: 3 metrics per Keda deployment + 1 metric per API metric timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Keda] Errors in Scaled Object | Errors detected in scaled object | Prometheus |
List of Dashboards
Keda
The dashboard provides information on the errors, values of the metrics generated and replicas of the scaled object.
List of Metrics
Metric name |
---|
keda_metrics_adapter_scaled_object_errors |
keda_metrics_adapter_scaler_metrics_value |
kubernetes.hpa.replicas.current |
kubernetes.hpa.replicas.desired |
kubernetes.hpa.replicas.max |
kubernetes.hpa.replicas.min |
Preparing the Integration
Enable Prometheus Metrics
Keda instruments Prometheus metrics and annotates the metrics API pod with Prometheus annotations.
Make sure that the prometheus metrics are activated. If you install Keda with Helm you need to use the following flag:
--set prometheus.metricServer.enabled=true
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting KEDA
This document describes important metrics and queries that you can use to monitor and troubleshoot KEDA.
Tracking metrics status
You can track KEDA metrics status with following alerts: Exporter proccess is not serving metrics
# [KEDA] Exporter Process Down
absent(keda_metrics_adapter_scaled_object_errors{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: keda-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: true
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (keda-operator-metrics-apiserver);(.{0}$)
replacement: keda
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "keda"
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
16 - Kube State Metrics OSS
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
This integration has 19 metrics.
This integration refers to the official OSS KSM exporter for Kubernetes.
List of Dashboards
KSM Pod Status & Performance
The dashboard provides information on the Pod Status and Performance.
KSM Workload Status & Performance
The dashboard provides information on the Workload Status and Performance.
KSM Container Resource Usage & Troubleshooting
The dashboard provides information on the Container Resource Usage and Troubleshooting.
KSM Cluster / Namespace Available Resources
The dashboard provides information on the Cluster and Namespace Available Resources.
List of Metrics
Metric name |
---|
ksm_container_cpu_cores_used |
ksm_container_cpu_quota_used_percent |
ksm_container_info |
ksm_container_memory_limit_used_percent |
ksm_container_memory_used_bytes |
ksm_kube_node_status_allocatable |
ksm_kube_node_status_capacity |
ksm_kube_pod_container_status_restarts_total |
ksm_kube_pod_container_status_terminated_reason |
ksm_kube_pod_container_status_waiting_reason |
ksm_kube_pod_status_ready |
ksm_kube_pod_status_reason |
ksm_kube_resourcequota |
ksm_workload_status_desired |
ksm_workload_status_ready |
kube_pod_container_cpu_request |
kube_pod_container_memory_request |
kube_pod_container_resource_limits_cpu_cores |
kube_pod_container_resource_limits_memory_bytes |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This integration has no default agent job.
17 - Kubernetes
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
This integration has 70 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[Kubernetes] Container Waiting | Container in waiting status for long time (CrashLoopBackOff, ImagePullErr…) | Prometheus |
[Kubernetes] Container Restarting | Container restarting | Prometheus |
[Kubernetes] Pod Not Ready | Pod in not ready status | Prometheus |
[Kubernetes] Init Container Waiting For a Long Time | Init container in waiting state (CrashLoopBackOff, ImagePullErr…) | Prometheus |
[Kubernetes] Pod Container Creating For a Long Time | Pod is stuck in ContainerCreating state | Prometheus |
[Kubernetes] Pod Container Terminated With Error | Pod Container Terminated With Error (OOMKilled, Error…) | Prometheus |
[Kubernetes] Init Container Terminated With Error | Init Container Terminated With Error (OOMKilled, Error…) | Prometheus |
[Kubernetes] Workload with Pods not Ready | Workload with Pods not Ready (Evicted, NodeLost, UnexpectedAdmissionError) | Prometheus |
[Kubernetes] Workload Replicas Mismatch | There are pod in the workload that could not start | Prometheus |
[Kubernetes] Pod Not Scheduled For DaemonSet | Pods cannot be scheduled for DaemonSet | Prometheus |
[Kubernetes] Pods In DaemonSet Incorrectly Scheduled | There are pods from a DaemonSet that should not be running | Prometheus |
[Kubernetes] CPU Overcommit | CPU OverCommit in cluster. If one node fails, the cluster will not be able to schedule all the current pods. | Prometheus |
[Kubernetes] Memory Overcommit | Memory OverCommit in cluster. If one node fails, the cluster will not be able to schedule all the current pods. | Prometheus |
[Kubernetes] CPU OverUsage | CPU OverUsage in cluster. If one node fails, the cluster will not have enough CPU to run all the current pods. | Prometheus |
[Kubernetes] Memory OverUsage | Memory OverUsage in cluster. If one node fails, the cluster will not have enough memory to run all the current pods. | Prometheus |
[Kubernetes] Container CPU Throttling | Container CPU usage next to limit. Possible CPU Throttling. | Prometheus |
[Kubernetes] Container Memory Next To Limit | Container memory usage next to limit. Risk of Out Of Memory Kill. | Prometheus |
[Kubernetes] Container CPU Unused | Container unused CPU higher than 85% of request for 8 hours. | Prometheus |
[Kubernetes] Container Memory Unused | Container unused Memory higher than 85% of request for 8 hours. | Prometheus |
[Kubernetes] Node Not Ready | Node in Not-Ready condition | Prometheus |
[Kubernetes] Not All Nodes Are Ready | Not all nodes are in Ready condition. | Prometheus |
[Kubernetes] Too Many Pods In Node | Node close to its limits of pods. | Prometheus |
[Kubernetes] Node Readiness Flapping | Node availability is unstable. | Prometheus |
[Kubernetes] Nodes Disappeared | Less nodes in cluster than 30 minutes before. | Prometheus |
[Kubernetes] All Nodes Gone In Cluster | All Nodes Gone In Cluster. | Prometheus |
[Kubernetes] Node CPU High Usage | High usage of CPU in node. | Prometheus |
[Kubernetes] Node Memory High Usage | High usage of memory in node. Risk of pod eviction. | Prometheus |
[Kubernetes] Node Root File System Almost Full | Root file system in node almost full. To include other file systems, change the value of the device label from ‘.root.’ to your device name | Prometheus |
[Kubernetes] Max Schedulable Pod Less Than 1 CPU Core | The maximum schedulable CPU request in a pod is less than 1 core. | Prometheus |
[Kubernetes] Max Schedulable Pod Less Than 512Mb Memory | The maximum schedulable memory request in a pod is less than 512Mb. | Prometheus |
[Kubernetes] HPA Desired Scale Up Replicas Unreached | HPA could not reach the desired scaled up replicas for long time. | Prometheus |
[Kubernetes] HPA Desired Scale Down Replicas Unreached | HPA could not reach the desired scaled down replicas for long time. | Prometheus |
[Kubernetes] Job failed to complete | Job failed to complete | Prometheus |
[Kubernetes] Cluster is reaching maximum pod capacity (95%) | Review cluster pod capacity to ensure pods can be scheduled. | Prometheus |
List of Dashboards
Workload Status & Performance
The dashboard provides information on the Workload Status and Performance.
Pod Status & Performance
The dashboard provides information on the Pod Status and Performance.
Cluster / Namespace Available Resources
The dashboard provides information on the Cluster and Namespace Available Resources.
Cluster Capacity Planning
Dashboard used for Cluster Capacity Planning.
Container Resource Usage & Troubleshooting
The dashboard provides information on the Container Resource Usage and Troubleshooting.
Node Status & Performance
The dashboard provides information on the Node Status and Performance.
Pod Rightsizing & Workload Capacity Optimization
Dashboard used for Pod Rightsizing and Workload Capacity Optimization.
Pod Scheduling Troubleshooting
Dashboard used for Pod Scheduling Troubleshooting.
Horizontal Pod Autoscaler
The dashboard provides information on the Horizontal Pod Autoscalers.
Kubernetes Jobs
The dashboard provides information on the Kubernetes Jobs.
List of Metrics
Metric name |
---|
container.image |
container.image.tag |
kube_cronjob_next_schedule_time |
kube_cronjob_status_active |
kube_cronjob_status_last_schedule_time |
kube_daemonset_status_current_number_scheduled |
kube_daemonset_status_desired_number_scheduled |
kube_daemonset_status_number_misscheduled |
kube_daemonset_status_number_ready |
kube_hpa_status_current_replicas |
kube_hpa_status_desired_replicas |
kube_job_complete |
kube_job_failed |
kube_job_spec_completions |
kube_job_status_active |
kube_namespace_labels |
kube_node_info |
kube_node_status_allocatable |
kube_node_status_allocatable_cpu_cores |
kube_node_status_allocatable_memory_bytes |
kube_node_status_capacity |
kube_node_status_capacity_cpu_cores |
kube_node_status_capacity_memory_bytes |
kube_node_status_capacity_pods |
kube_node_status_condition |
kube_node_sysdig_host |
kube_pod_container_info |
kube_pod_container_resource_limits |
kube_pod_container_resource_requests |
kube_pod_container_status_restarts_total |
kube_pod_container_status_terminated_reason |
kube_pod_container_status_waiting_reason |
kube_pod_info |
kube_pod_init_container_status_terminated_reason |
kube_pod_init_container_status_waiting_reason |
kube_pod_status_ready |
kube_resourcequota |
kube_workload_pods_status_reason |
kube_workload_status_desired |
kube_workload_status_ready |
kubernetes.hpa.replicas.current |
kubernetes.hpa.replicas.desired |
kubernetes.hpa.replicas.max |
kubernetes.hpa.replicas.min |
sysdig_container_cpu_cores_used |
sysdig_container_cpu_quota_used_percent |
sysdig_container_info |
sysdig_container_memory_limit_used_percent |
sysdig_container_memory_used_bytes |
sysdig_container_net_connection_in_count |
sysdig_container_net_connection_out_count |
sysdig_container_net_connection_total_count |
sysdig_container_net_error_count |
sysdig_container_net_http_error_count |
sysdig_container_net_http_request_time |
sysdig_container_net_http_statuscode_request_count |
sysdig_container_net_in_bytes |
sysdig_container_net_out_bytes |
sysdig_container_net_request_count |
sysdig_container_net_request_time |
sysdig_fs_free_bytes |
sysdig_fs_inodes_used_percent |
sysdig_fs_total_bytes |
sysdig_fs_used_bytes |
sysdig_fs_used_percent |
sysdig_program_cpu_cores_used |
sysdig_program_cpu_used_percent |
sysdig_program_memory_used_bytes |
sysdig_program_net_connection_total_count |
sysdig_program_net_total_bytes |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This integration has no default agent job.
18 - Kubernetes API server
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 41 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[Kubernetes API Server] Deprecated APIs | API-Server Deprecated APIs | Prometheus |
[Kubernetes API Server] Certificate Expiry | API-Server Certificate Expiry | Prometheus |
[Kubernetes API Server] Admission Controller High Latency | API-Server Admission Controller High Latency | Prometheus |
[Kubernetes API Server] Webhook Admission Controller High Latency | API-Server Webhook Admission Controller High Latency | Prometheus |
[Kubernetes API Server] High 4xx RequestError Rate | APIS-Server High 4xx Request Error Rate | Prometheus |
[Kubernetes API Server] High 5xx RequestError Rate | APIS-Server High 5xx Request Error Rate | Prometheus |
[Kubernetes API Server] High Request Latency | APIS-Server High Request Latency | Prometheus |
List of Dashboards
Kubernetes API Server
The dashboard provides information on the Kubernetes API Server.
List of Metrics
Metric name |
---|
apiserver_admission_controller_admission_duration_seconds_count |
apiserver_admission_controller_admission_duration_seconds_sum |
apiserver_admission_webhook_admission_duration_seconds_count |
apiserver_admission_webhook_admission_duration_seconds_sum |
apiserver_client_certificate_expiration_seconds_bucket |
apiserver_client_certificate_expiration_seconds_count |
apiserver_request_duration_seconds_count |
apiserver_request_duration_seconds_sum |
apiserver_request_total |
apiserver_requested_deprecated_apis |
apiserver_response_sizes_count |
apiserver_response_sizes_sum |
go_build_info |
go_gc_duration_seconds |
go_gc_duration_seconds_count |
go_gc_duration_seconds_sum |
go_goroutines |
go_memstats_buck_hash_sys_bytes |
go_memstats_gc_sys_bytes |
go_memstats_heap_alloc_bytes |
go_memstats_heap_idle_bytes |
go_memstats_heap_inuse_bytes |
go_memstats_heap_released_bytes |
go_memstats_heap_sys_bytes |
go_memstats_lookups_total |
go_memstats_mallocs_total |
go_memstats_mcache_inuse_bytes |
go_memstats_mcache_sys_bytes |
go_memstats_mspan_inuse_bytes |
go_memstats_mspan_sys_bytes |
go_memstats_next_gc_bytes |
go_memstats_stack_inuse_bytes |
go_memstats_stack_sys_bytes |
go_memstats_sys_bytes |
go_threads |
process_cpu_seconds_total |
process_max_fds |
process_open_fds |
process_resident_memory_bytes |
workqueue_adds_total |
workqueue_depth |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This is the default agent job for this integration:
- job_name: kubernetes-apiservers-default
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
regex: kube-system;kube-apiserver
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_container_name
- source_labels:
- __address__
action: replace
target_label: __address__
regex: (.+)(:\d.+)
replacement: $1:443
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- action: replace
source_labels:
- __name__
- resource
target_label: k8sresource
regex: (apiserver_requested_deprecated_apis);(.+)
replacement: $2
- action: labeldrop
regex: "^(resource|resourcescope|subresource)$"
- source_labels: [__name__]
regex: (apiserver_admission_controller_admission_duration_seconds_count|apiserver_admission_controller_admission_duration_seconds_sum|apiserver_admission_webhook_admission_duration_seconds_count|apiserver_admission_webhook_admission_duration_seconds_sum|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_request_duration_seconds_count|apiserver_request_duration_seconds_sum|apiserver_request_total|apiserver_requested_deprecated_apis|apiserver_response_sizes_count|apiserver_response_sizes_sum|go_build_info|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_goroutines|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads|process_cpu_seconds_total|process_max_fds|process_open_fds|process_resident_memory_bytes|workqueue_adds_total|workqueue_depth)
action: keep
19 - Kubernetes controller manager
This integration is enabled by default.
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 42 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[Kubernetes controller manager] High 4xx RequestError Rate | Kubernetes Controller Manager High 4xx Request Error Rate | Prometheus |
[Kubernetes controller manager] High 5xx RequestError Rate | Kubernetes Controller Manager High 5xx Request Error Rate | Prometheus |
List of Dashboards
Kubernetes Controller Manager
The dashboard provides information on the Kubernetes Controller Manager.
List of Metrics
Metric name |
---|
cloudprovider_aws_api_request_duration_seconds_count |
cloudprovider_aws_api_request_duration_seconds_sum |
cloudprovider_aws_api_request_errors |
go_build_info |
go_gc_duration_seconds |
go_gc_duration_seconds_count |
go_gc_duration_seconds_sum |
go_goroutines |
go_memstats_buck_hash_sys_bytes |
go_memstats_gc_sys_bytes |
go_memstats_heap_alloc_bytes |
go_memstats_heap_idle_bytes |
go_memstats_heap_inuse_bytes |
go_memstats_heap_released_bytes |
go_memstats_heap_sys_bytes |
go_memstats_lookups_total |
go_memstats_mallocs_total |
go_memstats_mcache_inuse_bytes |
go_memstats_mcache_sys_bytes |
go_memstats_mspan_inuse_bytes |
go_memstats_mspan_sys_bytes |
go_memstats_next_gc_bytes |
go_memstats_stack_inuse_bytes |
go_memstats_stack_sys_bytes |
go_memstats_sys_bytes |
go_threads |
process_cpu_seconds_total |
process_max_fds |
process_open_fds |
rest_client_request_duration_seconds_count |
rest_client_request_duration_seconds_sum |
rest_client_requests_total |
sysdig_container_cpu_cores_used |
sysdig_container_memory_used_bytes |
workqueue_adds_total |
workqueue_depth |
workqueue_queue_duration_seconds_count |
workqueue_queue_duration_seconds_sum |
workqueue_retries_total |
workqueue_unfinished_work_seconds |
workqueue_work_duration_seconds_count |
workqueue_work_duration_seconds_sum |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This is the default agent job for this integration:
- job_name: kube-controller-manager-default
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'kube-system/kube-controller-manager.+'
- source_labels:
- __address__
action: replace
target_label: __address__
regex: (.+?)(\\:\\d)?
replacement: $1:10257
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (cloudprovider_aws_api_request_duration_seconds_count|cloudprovider_aws_api_request_duration_seconds_sum|cloudprovider_aws_api_request_errors|go_goroutines|rest_client_request_duration_seconds_count|rest_client_request_duration_seconds_sum|rest_client_requests_total|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_goroutines|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_sum|apiserver_client_certificate_expiration_seconds_count)
action: keep
20 - Kubernetes CoreDNS
This integration is enabled by default.
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 37 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[CoreDNS] Error High | High Request Duration | Prometheus |
[CoreDNS] Latency High | Latency High | Prometheus |
List of Dashboards
Kubernetes CoreDNS
The dashboard provides information on the Kubernetes CoreDNS.
List of Metrics
Metric name |
---|
coredns_cache_hits_total |
coredns_cache_misses_total |
coredns_dns_request_duration_seconds_bucket |
coredns_dns_request_size_bytes_bucket |
coredns_dns_requests_total |
coredns_dns_response_size_bytes_bucket |
coredns_dns_responses_total |
coredns_forward_request_duration_seconds_bucket |
coredns_panics_total |
coredns_plugin_enabled |
go_build_info |
go_gc_duration_seconds |
go_gc_duration_seconds_count |
go_gc_duration_seconds_sum |
go_goroutines |
go_memstats_buck_hash_sys_bytes |
go_memstats_gc_sys_bytes |
go_memstats_heap_alloc_bytes |
go_memstats_heap_idle_bytes |
go_memstats_heap_inuse_bytes |
go_memstats_heap_released_bytes |
go_memstats_heap_sys_bytes |
go_memstats_lookups_total |
go_memstats_mallocs_total |
go_memstats_mcache_inuse_bytes |
go_memstats_mcache_sys_bytes |
go_memstats_mspan_inuse_bytes |
go_memstats_mspan_sys_bytes |
go_memstats_next_gc_bytes |
go_memstats_stack_inuse_bytes |
go_memstats_stack_sys_bytes |
go_memstats_sys_bytes |
go_threads |
process_cpu_seconds_total |
process_max_fds |
process_open_fds |
process_resident_memory_bytes |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This is the default agent job for this integration:
- job_name: kube-dns-default
honor_labels: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'kube-system/coredns.+'
- source_labels:
- __address__
action: keep
regex: (.*:9153)
- source_labels:
- __meta_kubernetes_pod_name
action: replace
target_label: instance
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
21 - Kubernetes etcd
This integration is enabled by default.
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 54 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[Etcd] Etcd Members Down | There are members down. | Prometheus |
[Etcd] Etcd Insufficient Members | Etcd cluster has insufficient members | Prometheus |
[Etcd] Etcd No Leader | Member has no leader. | Prometheus |
[Etcd] Etcd High Number Of Leader Changes | Leader changes within the last 15 minutes. | Prometheus |
[Etcd] Etcd High Number Of Failed GRPC Requests | High number of failed grpc requests | Prometheus |
[Etcd] Etcd GRPC Requests Slow | gRPC requests are taking too much time | Prometheus |
[Etcd] Etcd High Number Of Failed Proposals | High number of proposal failures within the last 30 minutes on etcd instance | Prometheus |
[Etcd] Etcd High Fsync Durations | 99th percentile fync durations are too high | Prometheus |
[Etcd] Etcd High Commit Durations | 99th percentile commit durations are too high | Prometheus |
[Etcd] Etcd HighNumber Of Failed HTTP Requests | High number of failed http requests | Prometheus |
[Etcd] Etcd HTTP Requests Slow | Https request are slow | Prometheus |
List of Dashboards
Kubernetes Etcd
The dashboard provides information on the Kubernetes Etcd.
List of Metrics
Metric name |
---|
etcd_debugging_mvcc_db_total_size_in_bytes |
etcd_disk_backend_commit_duration_seconds_bucket |
etcd_disk_wal_fsync_duration_seconds_bucket |
etcd_grpc_proxy_cache_hits_total |
etcd_grpc_proxy_cache_misses_total |
etcd_http_failed_total |
etcd_http_received_total |
etcd_http_successful_duration_seconds_bucket |
etcd_mvcc_db_total_size_in_bytes |
etcd_network_client_grpc_received_bytes_total |
etcd_network_client_grpc_sent_bytes_total |
etcd_network_peer_received_bytes_total |
etcd_network_peer_received_failures_total |
etcd_network_peer_round_trip_time_seconds_bucket |
etcd_network_peer_sent_bytes_total |
etcd_network_peer_sent_failures_total |
etcd_server_has_leader |
etcd_server_id |
etcd_server_leader_changes_seen_total |
etcd_server_proposals_applied_total |
etcd_server_proposals_committed_total |
etcd_server_proposals_failed_total |
etcd_server_proposals_pending |
go_build_info |
go_gc_duration_seconds |
go_gc_duration_seconds_count |
go_gc_duration_seconds_sum |
go_goroutines |
go_memstats_buck_hash_sys_bytes |
go_memstats_gc_sys_bytes |
go_memstats_heap_alloc_bytes |
go_memstats_heap_idle_bytes |
go_memstats_heap_inuse_bytes |
go_memstats_heap_released_bytes |
go_memstats_heap_sys_bytes |
go_memstats_lookups_total |
go_memstats_mallocs_total |
go_memstats_mcache_inuse_bytes |
go_memstats_mcache_sys_bytes |
go_memstats_mspan_inuse_bytes |
go_memstats_mspan_sys_bytes |
go_memstats_next_gc_bytes |
go_memstats_stack_inuse_bytes |
go_memstats_stack_sys_bytes |
go_memstats_sys_bytes |
go_threads |
grpc_server_handled_total |
grpc_server_handling_seconds_bucket |
grpc_server_started_total |
process_cpu_seconds_total |
process_max_fds |
process_open_fds |
sysdig_container_cpu_cores_used |
sysdig_container_memory_used_bytes |
Preparing the Integration
Add Certificate for Sysdig Agent
Disclaimer: This patch only works in vanilla Kubernetes
kubectl -n sysdig-agent patch ds sysdig-agent -p '{"spec":{"template":{"spec":{"volumes":[{"hostPath":{"path":"/etc/kubernetes/pki/etcd-manager-main","type":"DirectoryOrCreate"},"name":"etcd-certificates"}]}}}}'
kubectl -n sysdig-agent patch ds sysdig-agent -p '{"spec":{"template":{"spec":{"containers":[{"name":"sysdig","volumeMounts": [{"mountPath": "/etc/kubernetes/pki/etcd-manager","name": "etcd-certificates"}]}]}}}}'
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This is the default agent job for this integration:
- job_name: etcd-default
scheme: https
tls_config:
insecure_skip_verify: true
cert_file: /etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt
key_file: /etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'kube-system/etcd-manager-main.+'
- source_labels:
- __address__
action: replace
target_label: __address__
regex: (.+?)(\\:\\d)?
replacement: $1:4001
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (etcd_http_failed_total|etcd_http_received_total|etcd_http_successful_duration_seconds_bucket|etcd_server_has_leader|etcd_server_leader_changes_seen_total|etcd_server_proposals_failed_total|go_build_info|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads|process_cpu_seconds_total|grpc_server_started_total|grpc_server_started_total|grpc_server_started_total|grpc_server_handled_total|etcd_debugging_mvcc_db_total_size_in_bytes|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_disk_backend_commit_duration_seconds_bucket|sysdig_container_memory_used_bytes|etcd_server_proposals_committed_total|etcd_server_proposals_applied_total|sysdig_container_cpu_cores_used|go_goroutines|grpc_server_handled_total|grpc_server_handled_total|etcd_server_id|etcd_disk_backend_commit_duration_seconds_bucket|etcd_grpc_proxy_cache_hits_total|etcd_grpc_proxy_cache_misses_total|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|process_max_fds|process_open_fds|etcd_server_proposals_pending|etcd_network_peer_sent_failures_total|etcd_network_peer_received_failures_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_client_grpc_sent_bytes_total|etcd_network_client_grpc_received_bytes_total|etcd_network_peer_sent_bytes_total|etcd_network_peer_received_bytes_total|grpc_server_handling_seconds_bucket|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_sum|apiserver_client_certificate_expiration_seconds_count|etcd_mvcc_db_total_size_in_bytes)
action: keep
22 - Kubernetes kube-proxy
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 10 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[KubeProxy] Kube Proxy Down | KubeProxy detected down | Prometheus |
[KubeProxy] High Rest Client Latency | High Rest Client Latency detected | Prometheus |
[KubeProxy] High Rule Sync Latency | High Rule Sync Latency detected | Prometheus |
[KubeProxy] Too Many 500 Code | Too Many 500 Code detected | Prometheus |
List of Dashboards
Kubernetes Proxy
The dashboard provides information on the Kubernetes Proxy.
List of Metrics
Metric name |
---|
go_goroutines |
kube_node_info |
kubeproxy_network_programming_duration_seconds_bucket |
kubeproxy_network_programming_duration_seconds_count |
kubeproxy_sync_proxy_rules_duration_seconds_bucket |
kubeproxy_sync_proxy_rules_duration_seconds_count |
process_cpu_seconds_total |
process_resident_memory_bytes |
rest_client_request_duration_seconds_bucket |
rest_client_requests_total |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This is the default agent job for this integration:
- job_name: kubernetes-kube-proxy-default
honor_labels: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'kube-system/kube-proxy.+'
- source_labels:
- __address__
action: replace
target_label: __address__
regex: (.+?)(\\:\\d+)?
replacement: $1:10249
- source_labels:
- __meta_kubernetes_pod_name
action: replace
target_label: instance
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (up|kubeproxy_sync_proxy_rules_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|kubeproxy_network_programming_duration_seconds_count|kubeproxy_network_programming_duration_seconds_bucket|rest_client_requests_total|rest_client_request_duration_seconds_bucket|rest_client_request_duration_seconds_bucket|process_resident_memory_bytes|process_cpu_seconds_total|go_goroutines|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_sum|apiserver_client_certificate_expiration_seconds_count)
action: keep
23 - Kubernetes kubelet
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 25 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[k8s-kubelet] Kubelet Too Many Pods | Kubelet Too Many Pods | Prometheus |
[k8s-kubelet] Kubelet Pod Lifecycle Event Generator Duration High | Kubelet Pod Lifecycle Event Generator Duration High | Prometheus |
[k8s-kubelet] Kubelet Pod StartUp Latency High | Kubelet Pod StartUp Latency High | Prometheus |
[k8s-kubelet] Kubelet Down | Kubelet Down | Prometheus |
List of Dashboards
Kubernetes Kubelet
The dashboard provides information on the Kubernetes Kubelet.
List of Metrics
Metric name |
---|
go_goroutines |
kube_node_status_capacity_pods |
kube_node_status_condition |
kubelet_cgroup_manager_duration_seconds_bucket |
kubelet_cgroup_manager_duration_seconds_count |
kubelet_node_config_error |
kubelet_pleg_relist_duration_seconds_bucket |
kubelet_pleg_relist_interval_seconds_bucket |
kubelet_pod_start_duration_seconds_bucket |
kubelet_pod_start_duration_seconds_count |
kubelet_pod_worker_duration_seconds_bucket |
kubelet_pod_worker_duration_seconds_count |
kubelet_running_containers |
kubelet_running_pod_count |
kubelet_running_pods |
kubelet_runtime_operations_duration_seconds_bucket |
kubelet_runtime_operations_errors_total |
kubelet_runtime_operations_total |
process_cpu_seconds_total |
process_resident_memory_bytes |
rest_client_request_duration_seconds_bucket |
rest_client_requests_total |
storage_operation_duration_seconds_bucket |
storage_operation_duration_seconds_count |
volume_manager_total_volumes |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This is the default agent job for this integration:
- job_name: k8s-kubelet-default
scrape_interval: 60s
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_node_address_InternalIP]
regex: __HOSTIPS__
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
replacement: kube_node_label_$1
- replacement: localhost:10250
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_node_name]
target_label: kube_node_name
- action: replace
source_labels: [__meta_kubernetes_namespace]
target_label: kube_namespace_name
metric_relabel_configs:
# - source_labels: [__name__]
# regex: "kubelet_volume(.+)|storage(.+)"
# action: drop
- source_labels: [__name__]
regex: (go_goroutines|kube_node_status_capacity_pods|kube_node_status_condition|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_containers|kubelet_running_pods|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubernetes_build_info|process_cpu_seconds_total|process_resident_memory_bytes|rest_client_request_duration_seconds_bucket|rest_client_requests_total|volume_manager_total_volumes)
action: keep
24 - Kubernetes PVC
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 9 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[k8s-pvc] PV Not Available | Persistent Volume not available | Prometheus |
[k8s-pvc] PVC Pending For a Long Time | Persistent Volume Claim not available | Prometheus |
[k8s-pvc] PVC Lost | Persistent Volume Claim lost | Prometheus |
[k8s-pvc] PVC Storage Usage Is Reaching The Limit | Persistent Volume Claim storage at 95% | Prometheus |
[k8s-pvc] PVC Inodes Usage Is Reaching The Limit | PVC inodes Usage Is Reaching The Limit | Prometheus |
[k8s-pvc] PV Full In Four Days | Persistent Volume Full In Four Days | Prometheus |
List of Dashboards
PVC and Storage
The dashboard provides information on the Kubernetes PVC and Storage.
List of Metrics
Metric name |
---|
kube_persistentvolume_status_phase |
kube_persistentvolumeclaim_status_phase |
kubelet_volume_stats_available_bytes |
kubelet_volume_stats_capacity_bytes |
kubelet_volume_stats_inodes |
kubelet_volume_stats_inodes_used |
kubelet_volume_stats_used_bytes |
storage_operation_duration_seconds_bucket |
storage_operation_duration_seconds_count |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This is the default agent job for this integration:
- job_name: k8s-pvc-default
scrape_interval: 60s
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_node_address_InternalIP]
regex: __HOSTIPS__
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
replacement: kube_node_label_$1
- replacement: localhost:10250
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_node_name]
target_label: kube_node_name
metric_relabel_configs:
# - source_labels: [__name__]
# regex: "kubelet_volume(.+)"
# action: keep
- source_labels: [__name__]
regex: (kube_persistentvolume_status_phase|kube_persistentvolumeclaim_status_phase|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubelet_volume_stats_used_bytes)
action: keep
- action: replace
source_labels: [namespace]
target_label: kube_namespace_name
25 - Kubernetes Scheduler
This integration is enabled by default.
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 45 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[Kubernetes Scheduler] Failed Attempts to Schedule Pods | The error rate of attempts to schedule pods is high. | Prometheus |
List of Dashboards
Kubernetes Scheduler
The dashboard provides information on the Kubernetes Scheduler.
List of Metrics
Metric name |
---|
go_build_info |
go_gc_duration_seconds |
go_gc_duration_seconds_count |
go_gc_duration_seconds_sum |
go_goroutines |
go_memstats_buck_hash_sys_bytes |
go_memstats_gc_sys_bytes |
go_memstats_heap_alloc_bytes |
go_memstats_heap_idle_bytes |
go_memstats_heap_inuse_bytes |
go_memstats_heap_released_bytes |
go_memstats_heap_sys_bytes |
go_memstats_lookups_total |
go_memstats_mallocs_total |
go_memstats_mcache_inuse_bytes |
go_memstats_mcache_sys_bytes |
go_memstats_mspan_inuse_bytes |
go_memstats_mspan_sys_bytes |
go_memstats_next_gc_bytes |
go_memstats_stack_inuse_bytes |
go_memstats_stack_sys_bytes |
go_memstats_sys_bytes |
go_threads |
process_cpu_seconds_total |
process_max_fds |
process_open_fds |
rest_client_request_duration_seconds_count |
rest_client_request_duration_seconds_sum |
rest_client_requests_total |
scheduler_e2e_scheduling_duration_seconds_count |
scheduler_e2e_scheduling_duration_seconds_sum |
scheduler_pending_pods |
scheduler_pod_scheduling_attempts_count |
scheduler_pod_scheduling_attempts_sum |
scheduler_schedule_attempts_total |
sysdig_container_cpu_cores_used |
sysdig_container_memory_used_bytes |
workqueue_adds_total |
workqueue_depth |
workqueue_queue_duration_seconds_count |
workqueue_queue_duration_seconds_sum |
workqueue_retries_total |
workqueue_unfinished_work_seconds |
workqueue_work_duration_seconds_count |
workqueue_work_duration_seconds_sum |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This is the default agent job for this integration:
- job_name: kube-scheduler-default
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'kube-system/kube-scheduler.+'
- source_labels:
- __address__
action: replace
target_label: __address__
regex: (.+?)(\\:\\d)?
replacement: $1:10259
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (rest_client_request_duration_seconds_count|rest_client_request_duration_seconds_sum|rest_client_requests_total|scheduler_e2e_scheduling_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_pending_pods|scheduler_pod_scheduling_attempts_count|scheduler_pod_scheduling_attempts_sum|scheduler_schedule_attempts_total|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_goroutines|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_sum|apiserver_client_certificate_expiration_seconds_count)
action: keep
26 - Kubernetes storage
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 8 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[k8s-storage] High Storage Error Rate | High Storage Error Rate | Prometheus |
[k8s-storage] High Storage Latency | High Storage Latency | Prometheus |
List of Metrics
Metric name |
---|
kube_persistentvolume_status_phase |
kube_persistentvolumeclaim_status_phase |
kubelet_volume_stats_capacity_bytes |
kubelet_volume_stats_inodes |
kubelet_volume_stats_inodes_used |
kubelet_volume_stats_used_bytes |
storage_operation_duration_seconds_bucket |
storage_operation_duration_seconds_count |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This is the default agent job for this integration:
- job_name: k8s-storage-default
scrape_interval: 60s
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_node_address_InternalIP]
regex: __HOSTIPS__
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
replacement: kube_node_label_$1
- replacement: localhost:10250
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_node_name]
target_label: kube_node_name
metric_relabel_configs:
# - source_labels: [__name__]
# regex: "storage(.+)"
# action: keep
- source_labels: [__name__]
regex: (storage_operation_duration_seconds_bucket|storage_operation_duration_seconds_count)
action: keep
- action: replace
source_labels: [namespace]
target_label: kube_namespace_name
27 - Linux
This integration is enabled by default.
This integration has 19 metrics.
List of Alerts
Alert | Description | Format |
---|---|---|
[Linux] High CPU Usage | The CPU of the Linux instance reached 95% of use | Prometheus |
[Linux] High Disk Usage | Disk full over 95% in host | Prometheus |
[Linux] Disk Will Fill In 12 Hours | Disk full in 12h in host | Prometheus |
[Linux] High Physical Memory Usage | High physical memory usage in instance | Prometheus |
List of Dashboards
Linux Host Overview
The dashboard provides a general overview for a regular Linux host.
List of Metrics
Metric name |
---|
sysdig_fs_free_percent |
sysdig_fs_used_percent |
sysdig_host_cpu_cores_used_percent |
sysdig_host_cpu_system_percent |
sysdig_host_file_open_count |
sysdig_host_file_total_bytes |
sysdig_host_fs_free_bytes |
sysdig_host_fs_used_percent |
sysdig_host_memory_available_bytes |
sysdig_host_memory_used_percent |
sysdig_host_net_connection_in_count |
sysdig_host_net_connection_out_count |
sysdig_host_net_total_bytes |
sysdig_program_cpu_used_percent |
sysdig_program_file_open_count |
sysdig_program_memory_used_percent |
sysdig_program_net_connection_total_count |
sysdig_program_net_request_in_count |
sysdig_program_net_total_bytes |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting Linux
The Linux integration uses the out-of-the-box sysdig_host_*
and sysdig_program_*
metrics in the dashboards and alerts.
Agent Configuration
This integration has no default agent job.
28 - Memcached
This integration is enabled by default.
Versions supported: > v1.5
This integration uses a sidecar exporter that is available in UBI or scratch base image.
This integration has 13 metrics.
Timeseries generated: 20 series per instance
List of Alerts
Alert | Description | Format |
---|---|---|
[Memcached] Instance Down | Instance is not reachable | Prometheus |
[Memcached] Low UpTime | Uptime of less than 1 hour in a Memcached instance | Prometheus |
[Memcached] Connection Throttled | Connection throttled because max number of requests per event process reached | Prometheus |
[Memcached] Connections Close To The Limit 85% | The mumber of connections are close to the limit | Prometheus |
[Memcached] Connections Limit Reached | Reached the number of maximum connections and caused a connection error | Prometheus |
List of Dashboards
Memcached
The dashboard provides information on the status and performance of the Memcached instance.
List of Metrics
Metric name |
---|
memcached_commands_total |
memcached_connections_listener_disabled_total |
memcached_connections_yielded_total |
memcached_current_bytes |
memcached_current_connections |
memcached_current_items |
memcached_items_evicted_total |
memcached_items_reclaimed_total |
memcached_items_total |
memcached_limit_bytes |
memcached_max_connections |
memcached_up |
memcached_uptime_seconds |
Preparing the Integration
No preparations are required for this integration.
Installing
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/memcached-exporter
Agent Configuration
This is the default agent job for this integration:
- job_name: memcached-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "memcached"
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (memcached_commands_total|memcached_connections_listener_disabled_total|memcached_connections_yielded_total|memcached_current_bytes|memcached_current_connections|memcached_current_items|memcached_items_evicted_total|memcached_items_reclaimed_total|memcached_items_total|memcached_limit_bytes|memcached_max_connections|memcached_up|memcached_uptime_seconds)
action: keep
29 - MongoDB
This integration is enabled by default.
Versions supported: > v4.2
This integration uses a standalone exporter that is available in UBI or scratch base image.
This integration has 28 metrics.
Timeseries generated: 500 series per instance
List of Alerts
Alert | Description | Format |
---|---|---|
[MongoDB] Instance Down | Mongo server detected down by instance | Prometheus |
[MongoDB] Uptime less than one hour | Mongo server detected down by instance | Prometheus |
[MongoDB] Asserts detected | Mongo server detected down by instance | Prometheus |
[MongoDB] High Latency | High latency in instance | Prometheus |
[MongoDB] High Ticket Utilization | Ticket usage over 75% in instance | Prometheus |
[MongoDB] Recurrent Cursor Timeout | Recurrent cursors timeout in instance | Prometheus |
[MongoDB] Recurrent Memory Page Faults | Recurrent cursors timeout in instance | Prometheus |
List of Dashboards
MongoDB Instance Health
The dashboard provides information on the connections, cache hit rate, error rate, latency and traffic of one of the databases of the MongoDB instance.
MongoDB Database Details
The dashboard provides information on the status, error rate and resource usage of a MongoDB instance.
List of Metrics
Metric name |
---|
mongodb_asserts_total |
mongodb_connections |
mongodb_extra_info_page_faults_total |
mongodb_instance_uptime_seconds |
mongodb_memory |
mongodb_mongod_db_collections_total |
mongodb_mongod_db_data_size_bytes |
mongodb_mongod_db_index_size_bytes |
mongodb_mongod_db_indexes_total |
mongodb_mongod_db_objects_total |
mongodb_mongod_global_lock_client |
mongodb_mongod_global_lock_current_queue |
mongodb_mongod_global_lock_ratio |
mongodb_mongod_metrics_cursor_open |
mongodb_mongod_metrics_cursor_timed_out_total |
mongodb_mongod_op_latencies_latency_total |
mongodb_mongod_op_latencies_ops_total |
mongodb_mongod_wiredtiger_cache_bytes |
mongodb_mongod_wiredtiger_cache_bytes_total |
mongodb_mongod_wiredtiger_cache_evicted_total |
mongodb_mongod_wiredtiger_cache_pages |
mongodb_mongod_wiredtiger_concurrent_transactions_out_tickets |
mongodb_mongod_wiredtiger_concurrent_transactions_total_tickets |
mongodb_network_bytes_total |
mongodb_network_metrics_num_requests_total |
mongodb_op_counters_total |
mongodb_up |
net.error.count |
Preparing the Integration
Create Credentials for MongoDB Exporter
If you want to use a non-admin user for the exporter, you will have to create a user and grant the roles to be able to scrape statistics.
In the mongo shell:
use admin
db.auth("<YOUR-ADMIN-USER>", "<YOUR-ADMIN-PASSWORD>")
db.createUser(
{
user: "<YOUR-EXPORTER-USER>",
pwd: "<YOUR-EXPORTER-PASSWORD>",
roles: [
{ role: "clusterMonitor", db: "admin" },
{ role: "read", db: "admin" },
{ role: "read", db: "local" }
]
}
)
Create Kubernetes Secret for Connection and Authentication
To configure authentication, do the following:
- Create a text file with the connection string (mongodb-uri) for your MongoDB by using these examples:
# Basic authentication
mongodb://<YOUR-EXPORTER-USER>:<YOUR-EXPORTER-PASSWORD>@<YOUR-MONGODB-HOST>:<PORT>
# TLS
mongodb://<YOUR-EXPORTER-USER>:<YOUR-EXPORTER-PASSWORD>@<YOUR-MONGODB-HOST>:<PORT>/admin?tls=true&tlsCertificateKeyFile=/etc/mongodb/mongodb-exporter-key.pem&tlsAllowInvalidCertificates=true&tlsCAFile=/etc/mongodb/mongodb-exporter-ca.pem
# SSL
mongodb://<YOUR-EXPORTER-USER>:<YOUR-EXPORTER-PASSWORD>@<YOUR-MONGODB-HOST>:<PORT>/admin?ssl=true&sslclientcertificatekeyfile=/etc/mongodb/mongodb-exporter-key.pem&sslinsecure=true&sslcertificateauthorityfile=/etc/mongodb/mongodb-exporter-ca.pem
- Create the secret for the connection string:
kubectl create secret -n Your-Exporter-Namespace generic Your-Mongodb-Uri-Secret-Name \
--from-file=mongodb-uri=<route-to-file-with-mongodb-uri.txt>
- In case of TLS or SSL authentication, create the secret with the private key and the certificate authority (CA). If you do not have a CA file, you can use an empty file instead:
kubectl create secret -n Your-Exporter-Namespace generic mongodb-exporter-auth \
--from-file=mongodb-key=<route-to-your-private-key.pem> \
--from-file=mongodb-ca=<route-to-your-ca.pem>
Installing
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/mongodb-exporter
Monitoring and Troubleshooting MongoDB
This document describes important metrics and queries that you can use to monitor and troubleshoot MongoDB.
Tracking metrics status
You can track MongoDB metrics status with following alerts: Exporter proccess is not serving metrics
# [MongoDB] Exporter Process Down
absent(mongodb_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Exporter proccess is not serving metrics
# [MongoDB] Exporter Process Down
absent(mongodb_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: mongodb-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "mongodb"
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
target_label: kube_namespace_name
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
target_label: kube_workload_type
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
target_label: kube_workload_name
- action: replace
replacement: true
target_label: sysdig_omit_source
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
30 - MySQL
This integration is enabled by default.
Versions supported: > v5.7
This integration uses a standalone exporter that is available in UBI or scratch base image.
This integration has 47 metrics.
Timeseries generated: 1005 series per instance
List of Alerts
Alert | Description | Format |
---|---|---|
[MySQL] Mysql Down | MySQL instance is down | Prometheus |
[MySQL] Mysql Restarted | MySQL has just been restarted, less than one minute ago | Prometheus |
[MySQL] Mysql Too many Connections (>80%) | More than 80% of MySQL connections are in use | Prometheus |
[MySQL] Mysql High Threads Running | More than 60% of MySQL connections are in running state | Prometheus |
[MySQL] Mysql HighOpen Files | More than 80% of MySQL files open | Prometheus |
[MySQL] Mysql Slow Queries | MySQL server mysql has some new slow query | Prometheus |
[MySQL] Mysql Innodb Log Waits | MySQL innodb log writes stalling | Prometheus |
[MySQL] Mysql Slave Io Thread Not Running | MySQL Slave IO thread not running | Prometheus |
[MySQL] Mysql Slave Sql Thread Not Running | MySQL Slave SQL thread not running | Prometheus |
[MySQL] Mysql Slave Replication Lag | MySQL Slave replication lag | Prometheus |
List of Dashboards
MySQL
The dashboard provides information on the status, error rate and resource usage of a MySQL instance.
List of Metrics
Metric name |
---|
mysql_global_status_aborted_clients |
mysql_global_status_aborted_connects |
mysql_global_status_buffer_pool_pages |
mysql_global_status_bytes_received |
mysql_global_status_bytes_sent |
mysql_global_status_commands_total |
mysql_global_status_connection_errors_total |
mysql_global_status_innodb_buffer_pool_read_requests |
mysql_global_status_innodb_buffer_pool_reads |
mysql_global_status_innodb_log_waits |
mysql_global_status_innodb_mem_adaptive_hash |
mysql_global_status_innodb_mem_dictionary |
mysql_global_status_innodb_page_size |
mysql_global_status_questions |
mysql_global_status_select_full_join |
mysql_global_status_select_full_range_join |
mysql_global_status_select_range_check |
mysql_global_status_select_scan |
mysql_global_status_slow_queries |
mysql_global_status_sort_merge_passes |
mysql_global_status_sort_range |
mysql_global_status_sort_rows |
mysql_global_status_sort_scan |
mysql_global_status_table_locks_immediate |
mysql_global_status_table_locks_waited |
mysql_global_status_table_open_cache_hits |
mysql_global_status_table_open_cache_misses |
mysql_global_status_threads_cached |
mysql_global_status_threads_connected |
mysql_global_status_threads_created |
mysql_global_status_threads_running |
mysql_global_status_uptime |
mysql_global_variables_innodb_additional_mem_pool_size |
mysql_global_variables_innodb_log_buffer_size |
mysql_global_variables_innodb_open_files |
mysql_global_variables_key_buffer_size |
mysql_global_variables_max_connections |
mysql_global_variables_open_files_limit |
mysql_global_variables_query_cache_size |
mysql_global_variables_thread_cache_size |
mysql_global_variables_tokudb_cache_size |
mysql_slave_status_master_server_id |
mysql_slave_status_seconds_behind_master |
mysql_slave_status_slave_io_running |
mysql_slave_status_slave_sql_running |
mysql_slave_status_sql_delay |
mysql_up |
Preparing the Integration
Create Credentials for MySQL Exporter
- Create the user and password for the exporter in the database:
CREATE USER 'exporter' IDENTIFIED BY 'YOUR-PASSWORD' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter';
Replace the user name and the password in the SQL sentence for your custom ones.
- Create a mysql-exporter.cnf file with the credentials of the exporter:
[client]
user = exporter
password = "YOUR-PASSWORD"
host=YOUR-DB-IP
- In your cluster, create the secret with the mysql-exporter.cnf file. This file will be mounted in the exporter to authenticate with the database:
kubectl create secret -n Your-Application-Namespace generic mysql-exporter \
--from-file=.my.cnf=./mysql-exporter.cnf
Using SSL Authentication
If your database requires SSL authentication, you need to create secrets with the certificates. To do so, create the secret with SSL certificates for the exporter:
kubectl create secret -n Your-Application-Namespace generic mysql-exporter \
--from-file=.my.cnf=./mysql-exporter.cnf
--from-file=ca.pem=./certs/ca.pem \
--from-file=client-key.pem=./certs/client-key.pem \
--from-file=client-cert.pem=./certs/client-cert.pem
In the mysql-exporter.cnf file, include the following lines to route to the certificates in the exporter:
[client]
user = exporter
password = "YOUR-PASSWORD"
host=YOUR-DB-IP
ssl-ca=/lib/cert/ca.pem
ssl-key=/lib/cert/client-key.pem
ssl-cert=/lib/cert/client-cert.pem
Installing
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/mysql-exporter
Monitoring and Troubleshooting MySQL
This document describes important metrics and queries that you can use to monitor and troubleshoot MySQL.
Tracking metrics status
You can track MySQL metrics status with following alerts: Exporter proccess is not serving metrics
# [MySQL] Exporter Process Down
absent(mysql_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Related Blog Posts
Agent Configuration
This is the default agent job for this integration:
- job_name: mysql-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "mysql"
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
target_label: kube_namespace_name
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
target_label: kube_workload_type
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
target_label: kube_workload_name
- action: replace
replacement: true
target_label: sysdig_omit_source
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (mysql_global_status_aborted_clients|mysql_global_status_aborted_connects|mysql_global_status_buffer_pool_pages|mysql_global_status_bytes_received|mysql_global_status_bytes_sent|mysql_global_status_commands_total|mysql_global_status_connection_errors_total|mysql_global_status_innodb_buffer_pool_read_requests|mysql_global_status_innodb_buffer_pool_reads|mysql_global_status_innodb_log_waits|mysql_global_status_innodb_mem_adaptive_hash|mysql_global_status_innodb_mem_dictionary|mysql_global_status_innodb_page_size|mysql_global_status_questions|mysql_global_status_select_full_join|mysql_global_status_select_full_range_join|mysql_global_status_select_range_check|mysql_global_status_select_scan|mysql_global_status_slow_queries|mysql_global_status_sort_merge_passes|mysql_global_status_sort_range|mysql_global_status_sort_rows|mysql_global_status_sort_scan|mysql_global_status_table_locks_immediate|mysql_global_status_table_locks_waited|mysql_global_status_table_open_cache_hits|mysql_global_status_table_open_cache_misses|mysql_global_status_threads_cached|mysql_global_status_threads_connected|mysql_global_status_threads_created|mysql_global_status_threads_running|mysql_global_status_uptime|mysql_global_variables_innodb_additional_mem_pool_size|mysql_global_variables_innodb_log_buffer_size|mysql_global_variables_innodb_open_files|mysql_global_variables_key_buffer_size|mysql_global_variables_max_connections|mysql_global_variables_open_files_limit|mysql_global_variables_query_cache_size|mysql_global_variables_thread_cache_size|mysql_global_variables_tokudb_cache_size|mysql_slave_status_master_server_id|mysql_slave_status_seconds_behind_master|mysql_slave_status_slave_io_running|mysql_slave_status_slave_sql_running|mysql_slave_status_sql_delay|mysql_up)
action: keep
31 - NGINX
This integration is enabled by default.
Versions supported: > v12
This integration uses a sidecar exporter that is available in UBI or scratch base image.
This integration has 12 metrics.
Timeseries generated: 8 series per nginx container
List of Alerts
Alert | Description | Format |
---|---|---|
[Nginx] No Intances Up | No Nginx instances Up | Prometheus |
List of Dashboards
Nginx
The dashboard provides information on the status of the NGINX server and Golden Signals.
List of Metrics
Metric name |
---|
net.bytes.in |
net.bytes.out |
net.http.error.count |
net.http.request.count |
net.http.request.time |
nginx_connections_accepted |
nginx_connections_active |
nginx_connections_handled |
nginx_connections_reading |
nginx_connections_waiting |
nginx_connections_writing |
nginx_up |
Preparing the Integration
Enable Nginx stub_status Module
The exporter can be installed as a sidecar of the pod with the Nginx server. To make Nginx expose an endpoint for scraping metrics, enable the stub_status module. If your Nginx configuration is defined inside a Kubernetes ConfigMap, add the following snippet to enable the stub_status module:
server {
listen 80;
server_name localhost;
location /nginx_status {
stub_status on;
access_log on;
allow all; # REPLACE with your access policy
}
}
This is how the ConfigMap would look after adding this snippet:
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
data:
nginx.conf: |
server {
listen 80;
server_name localhost;
location /nginx_status {
stub_status on;
access_log on;
allow all; # REPLACE with your access policy
}
}
Installing
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/nginx-exporter
Monitoring and Troubleshooting NGINX
This document describes important metrics and queries that you can use to monitor and troubleshoot NGINX.
Tracking metrics status
You can track NGINX metrics status with following alerts: Exporter proccess is not serving metrics
# [NGINX] Exporter Process Down
absent(nginx_up{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: nginx-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "nginx"
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
32 - NGINX Ingress
This integration is enabled by default.
Versions supported: > v1.9.0
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 42 metrics.
Timeseries generated: 1500
This integration specifically supports kubernetes/ingress-nginx, and not other NGINX-Ingress versions like nginxinc/kubernetes-ingress.
List of Alerts
Alert | Description | Format |
---|---|---|
[Nginx-Ingress] High Http 4xx Error Rate | Too many HTTP requests with status 4xx (> 5%) | Prometheus |
[Nginx-Ingress] High Http 5xx Error Rate | Too many HTTP requests with status 5xx (> 5%) | Prometheus |
[Nginx-Ingress] High Latency | Nginx p99 latency is higher than 10 seconds | Prometheus |
[Nginx-Ingress] Ingress Certificate Expiry | Nginx Ingress Certificate will expire in less than 14 days | Prometheus |
List of Dashboards
Nginx Ingress
The dashboard provides information on the error rate, resource usage, traffic and certificate expiration of the NGINX ingress.
List of Metrics
Metric name |
---|
go_build_info |
go_gc_duration_seconds |
go_gc_duration_seconds_count |
go_gc_duration_seconds_sum |
go_goroutines |
go_memstats_buck_hash_sys_bytes |
go_memstats_gc_sys_bytes |
go_memstats_heap_alloc_bytes |
go_memstats_heap_idle_bytes |
go_memstats_heap_inuse_bytes |
go_memstats_heap_released_bytes |
go_memstats_heap_sys_bytes |
go_memstats_lookups_total |
go_memstats_mallocs_total |
go_memstats_mcache_inuse_bytes |
go_memstats_mcache_sys_bytes |
go_memstats_mspan_inuse_bytes |
go_memstats_mspan_sys_bytes |
go_memstats_next_gc_bytes |
go_memstats_stack_inuse_bytes |
go_memstats_stack_sys_bytes |
go_memstats_sys_bytes |
go_threads |
nginx_ingress_controller_config_last_reload_successful |
nginx_ingress_controller_config_last_reload_successful_timestamp_seconds |
nginx_ingress_controller_ingress_upstream_latency_seconds_count |
nginx_ingress_controller_ingress_upstream_latency_seconds_sum |
nginx_ingress_controller_nginx_process_connections |
nginx_ingress_controller_nginx_process_cpu_seconds_total |
nginx_ingress_controller_nginx_process_resident_memory_bytes |
nginx_ingress_controller_request_duration_seconds_bucket |
nginx_ingress_controller_request_duration_seconds_count |
nginx_ingress_controller_request_duration_seconds_sum |
nginx_ingress_controller_request_size_sum |
nginx_ingress_controller_requests |
nginx_ingress_controller_response_duration_seconds_count |
nginx_ingress_controller_response_duration_seconds_sum |
nginx_ingress_controller_response_size_sum |
nginx_ingress_controller_ssl_expire_time_seconds |
process_cpu_seconds_total |
process_max_fds |
process_open_fds |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting NGINX Ingress
This document describes important metrics and queries that you can use to monitor and troubleshoot NGINX Ingress.
Tracking metrics status
You can track NGINX Ingress metrics status with following alerts: Exporter proccess is not serving metrics
# [NGINX Ingress] Exporter Process Down
absent(nginx_ingress_controller_nginx_process_cpu_seconds_total{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: nginx-ingress-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: true
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (controller|nginx-ingress-controller);(.{0}$)
replacement: nginx-ingress
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "nginx-ingress"
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (go_build_info|nginx_ingress_controller_config_last_reload_successful|nginx_ingress_controller_config_last_reload_successful_timestamp_seconds|nginx_ingress_controller_ingress_upstream_latency_seconds_count|nginx_ingress_controller_ingress_upstream_latency_seconds_sum|nginx_ingress_controller_nginx_process_connections|nginx_ingress_controller_nginx_process_cpu_seconds_total|process_max_fds|process_open_fds|nginx_ingress_controller_nginx_process_resident_memory_bytes|nginx_ingress_controller_request_duration_seconds_bucket|nginx_ingress_controller_request_duration_seconds_count|nginx_ingress_controller_request_duration_seconds_sum|nginx_ingress_controller_request_size_sum|nginx_ingress_controller_requests|nginx_ingress_controller_response_duration_seconds_count|nginx_ingress_controller_response_duration_seconds_sum|nginx_ingress_controller_response_size_sum|nginx_ingress_controller_ssl_expire_time_seconds|go_gc_duration_seconds|go_gc_duration_seconds_count|go_gc_duration_seconds_sum|go_goroutines|go_memstats_buck_hash_sys_bytes|go_memstats_gc_sys_bytes|go_memstats_heap_alloc_bytes|go_memstats_heap_idle_bytes|go_memstats_heap_inuse_bytes|go_memstats_heap_released_bytes|go_memstats_heap_sys_bytes|go_memstats_lookups_total|go_memstats_mallocs_total|go_memstats_mcache_inuse_bytes|go_memstats_mcache_sys_bytes|go_memstats_mspan_inuse_bytes|go_memstats_mspan_sys_bytes|go_memstats_next_gc_bytes|go_memstats_stack_inuse_bytes|go_memstats_stack_sys_bytes|go_memstats_sys_bytes|go_threads)
action: keep
33 - NTP
This integration is enabled by default.
Versions supported: > v2
This integration uses a standalone exporter that is available in UBI or scratch base image.
This integration has 1 metrics.
Timeseries generated: 4 series per node
List of Alerts
Alert | Description | Format |
---|---|---|
[Ntp] Drift is too high | Drift is too high | Prometheus |
List of Dashboards
NTP
The dashboard provides information on the drift of each node.
List of Metrics
Metric name |
---|
ntp_drift_seconds |
Preparing the Integration
No preparations are required for this integration.
Installing
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/ntp-exporter
Monitoring and Troubleshooting NTP
This document describes important metrics and queries that you can use to monitor and troubleshoot NTP.
Tracking metrics status
You can track NTP metrics status with following alerts: Exporter proccess is not serving metrics
# [NTP] Exporter Process Down
absent(ntp_drift_seconds{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: ntp-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "ntp"
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_promcat_sysdig_com_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_ns]
target_label: kube_namespace_name
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_type]
target_label: kube_workload_type
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_target_workload_name]
target_label: kube_workload_name
- action: replace
replacement: true
target_label: sysdig_omit_source
- action: replace
source_labels: [__meta_kubernetes_pod_node_name]
target_label: kube_node_name
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
34 - OPA
This integration is enabled by default.
Versions supported: > v3.5.1
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 10 metrics.
Timeseries generated: 150 series for each Gatekeeper
List of Alerts
Alert | Description | Format |
---|---|---|
[Opa gatekeeper] Too much time since the last audit | There was more than 120 second since the last audit | Prometheus |
[Opa gatekeeper] Spike of violations | There was more than 30 violations | Prometheus |
List of Dashboards
OPA Gatekeeper
The dashboard provides information on the requests rate, latency, violations rate per constraint.
List of Metrics
Metric name |
---|
gatekeeper_audit_duration_seconds_bucket |
gatekeeper_audit_last_run_time |
gatekeeper_constraint_template_ingestion_count |
gatekeeper_constraint_template_ingestion_duration_seconds_bucket |
gatekeeper_constraint_templates |
gatekeeper_constraints |
gatekeeper_request_count |
gatekeeper_request_duration_seconds_bucket |
gatekeeper_request_duration_seconds_count |
gatekeeper_violations |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting OPA
This document describes important metrics and queries that you can use to monitor and troubleshoot OPA.
Tracking metrics status
You can track OPA metrics status with following alerts: Exporter proccess is not serving metrics
# [OPA] Exporter Process Down
absent(gatekeeper_request_count{kube_cluster_name=~$cluster,kube_namespace_name=~$namespace,kube_workload_name=~$workload}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: opa-default
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (manager);(.{0}$)
replacement: opa-gatekeeper
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "opa-gatekeeper"
- action: keep
source_labels:
- __meta_kubernetes_pod_container_port_name
regex: "metrics"
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels: [__address__,__meta_kubernetes_pod_container_port_name]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
35 - OpenShift API-Server
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
Versions supported: > v4.8
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 17 metrics.
Timeseries generated: API Server generates ~5k timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[OpenShift API Server] Deprecated APIs | API-Server Deprecated APIs | Prometheus |
[OpenShift API Server] Certificate Expiry | API-Server Certificate Expiry | Prometheus |
[OpenShift API Server] Admission Controller High Latency | API-Server Admission Controller High Latency | Prometheus |
[OpenShift API Server] Webhook Admission Controller High Latency | API-Server Webhook Admission Controller High Latency | Prometheus |
[OpenShift API Server] High 4xx RequestError Rate | APIS-Server High 4xx Request Error Rate | Prometheus |
[OpenShift API Server] High 5xx RequestError Rate | APIS-Server High 5xx Request Error Rate | Prometheus |
[OpenShift API Server] High Request Latency | APIS-Server High Request Latency | Prometheus |
List of Dashboards
OpenShift v4 API Server
The dashboard provides information on the K8s API Server and OpenShift API Server.
List of Metrics
Metric name |
---|
apiserver_admission_controller_admission_duration_seconds_count |
apiserver_admission_controller_admission_duration_seconds_sum |
apiserver_admission_webhook_admission_duration_seconds_count |
apiserver_admission_webhook_admission_duration_seconds_sum |
apiserver_client_certificate_expiration_seconds_bucket |
apiserver_client_certificate_expiration_seconds_count |
apiserver_request_duration_seconds_count |
apiserver_request_duration_seconds_sum |
apiserver_request_total |
apiserver_requested_deprecated_apis |
apiserver_response_sizes_count |
apiserver_response_sizes_sum |
go_goroutines |
process_cpu_seconds_total |
process_resident_memory_bytes |
workqueue_adds_total |
workqueue_depth |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting OpenShift API Server
Because OpenShift 4.X comes with both Prometheus and API servers ready to use, no additional installation is required. The OpenShift API server metrics are exposed using the \federated
endpoint.
Learning how to monitor Kubernetes API server is vital when running Kubernetes in production. Monitoring kube-apiserver
will help you detect and troubleshoot latency and errors, and validate whether the service performs as expected.
Here are some interesting queries to run and metrics to monitor for troubleshooting the OpenShift API Server.
Deprecated APIs
To check if deprecated API versions are used, use the following query:
sum by (kube_cluster_name, resource, removed_release,version)(apiserver_requested_deprecated_apis)
Certificate Expiration
Certificates are used to authenticate to the API server, and you can check with the following query if a certificate is expiring next week:
apiserver_client_certificate_expiration_seconds_count > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket[5m]))) < 7*24*60*60
API Server Latency
Latency spike is typically a sign of overload in the API server. Probably your cluster has a high load and the API server needs to be scaled out. Use the following query to check for latency spikes in the last 10 minutes.
sum by (kube_cluster_name,verb,apiserver)(rate(apiserver_request_duration_seconds_sum{verb!="WATCH"}[10m]))/sum by (kube_cluster_name,verb,apiserver)(rate(apiserver_request_duration_seconds_count{verb!="WATCH"}[10m]))
Request Error Rate
Request errror rate means that the API server is responding with 5xx errors. Check the CPU and memory of your api-server
pods.
sum by(kube_cluster_name)(rate(apiserver_request_total{code=~"5..",kube_cluster_name=~$cluster}[5m])) / sum by(kube_cluster_name)(rate(apiserver_request_total{kube_cluster_name=~$cluster}[5m])) > 0.05
Agent Configuration
This is the default agent job for this integration:
- job_name: openshift-apiserver-default
honor_labels: true
scheme: https
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"apiserver_request_total|apiserver_request_duration_seconds_sum|apiserver_request_duration_seconds_count|workqueue_adds_total|workqueue_depth|apiserver_response_sizes_sum|apiserver_response_sizes_count|apiserver_requested_deprecated_apis|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_admission_controller_admission_duration_seconds_sum|apiserver_admission_controller_admission_duration_seconds_count|apiserver_admission_webhook_admission_duration_seconds_sum|apiserver_admission_webhook_admission_duration_seconds_count|apiserver_tls_handshake_errors_total|go_goroutines|process_resident_memory_bytes|process_cpu_seconds_total",code!="0"}'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'openshift-monitoring/prometheus-k8s-0'
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (apiserver_request_total|apiserver_request_duration_seconds_sum|apiserver_request_duration_seconds_count|workqueue_adds_total|workqueue_depth|apiserver_response_sizes_sum|apiserver_response_sizes_count|apiserver_requested_deprecated_apis|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_admission_controller_admission_duration_seconds_sum|apiserver_admission_controller_admission_duration_seconds_count|apiserver_admission_webhook_admission_duration_seconds_sum|apiserver_admission_webhook_admission_duration_seconds_count|apiserver_tls_handshake_errors_total|go_goroutines|process_resident_memory_bytes|process_cpu_seconds_total)
action: keep
- action: replace
source_labels: [namespace]
target_label: kube_namespace_name
- action: replace
source_labels: [pod]
target_label: kube_pod_name
36 - OpenShift Controller Manager
This integration is enabled by default.
Versions supported: > v4.8
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 12 metrics.
Timeseries generated: Controller Manager generates ~650 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[OpenShift Controller Manager] Process Down | Controller Manager has disappeared from target discovery. | Prometheus |
[OpenShift Controller Manager] High 4xx RequestError Rate | OpenShift Controller Manager High 4xx Request Error Rate | Prometheus |
[OpenShift Controller Manager] High 5xx RequestError Rate | OpenShift Controller Manager High 5xx Request Error Rate | Prometheus |
List of Dashboards
OpenShift v4 Controller Manager
The dashboard provides information on the K8s and OpenShift Controller Manager.
List of Metrics
Metric name |
---|
go_goroutines |
rest_client_requests_total |
sysdig_container_cpu_cores_used |
sysdig_container_memory_used_bytes |
workqueue_adds_total |
workqueue_depth |
workqueue_queue_duration_seconds_count |
workqueue_queue_duration_seconds_sum |
workqueue_retries_total |
workqueue_unfinished_work_seconds |
workqueue_work_duration_seconds_count |
workqueue_work_duration_seconds_sum |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting OpenShift Controller Manager
Because OpenShift 4.X comes with both Prometheus and Controller Manager ready to use, no additional installation is required. The OpenShift Controller Manager metrics are exposed using a federated endpoint.
Here are some interesting queries to run and metrics to monitor for troubleshooting the OpenShift Controller Manager.
Work Queue
Work Queue Retries
The total number of retries that have been handled by the work queue. This value should be near 0.
topk(30,rate(workqueue_retries_total{job="openshift-controller-default"}[10m]))
Work Queue Latency
Queue latency is the time tasks spend in the queue before being processed
topk(30,rate(workqueue_queue_duration_seconds_sum{job="openshift-controller-default"}[10m]) / rate(workqueue_queue_duration_seconds_count{job="openshift-controller-default"}[10m]))
Work Queue Depth
This query checks the depth of the queue. High values can indicate the saturation of the controller manager.
topk(30,rate(workqueue_depth{job="openshift-controller-default"}[10m]))
Scheduler API Requests
Kube API Requests By Code
Check that there are no 5xx or 4xx error codes in the scheduler requests.
sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{job="openshift-controller-default",code=~"4.."}[10m]))
sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{job="openshift-controller-default",code=~"5.."}[10m]))
Agent Configuration
This is the default agent job for this integration:
- job_name: openshift-controller-manager-default
honor_labels: true
scheme: https
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kube-controller-manager|controller-manager",__name__=~"workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_queue_duration_seconds_count|workqueue_work_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_work_duration_seconds_sum|workqueue_depth|workqueue_adds_total|rest_client_requests_total|go_goroutines"}'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'openshift-monitoring/prometheus-k8s-1'
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
# Remove extended labelset
- action: replace
replacement: true
target_label: sysdig_omit_source
metric_relabel_configs:
- source_labels: [__name__]
regex: (go_goroutines|rest_client_requests_total|sysdig_container_cpu_cores_used|sysdig_container_memory_used_bytes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum)
action: keep
- action: replace
source_labels: [namespace]
target_label: kube_namespace_name
- action: replace
source_labels: [pod]
target_label: kube_pod_name
- source_labels: [job]
target_label: controller
- source_labels: [job]
action: replace
regex: (.*)
target_label: job
replacement: 'openshift-controller-default'
- action: replace
source_labels: [controller]
regex: '(controller-manager)'
target_label: controller
replacement: 'openshift-$1'
37 - OpenShift CoreDNS
This integration is enabled by default.
Versions supported: > v4.8
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 13 metrics.
Timeseries generated: CoreDNS generates ~230 timeseries per dns-default pod
List of Alerts
Alert | Description | Format |
---|---|---|
[OpenShift CoreDNS] Process Down | CoreDNS has disappeared from target discovery. | Prometheus |
[OpenShift CoreDNS] High Failed Responses | CoreDNS is returning failed responses. | Prometheus |
[OpenShift CoreDNS] High Latency | CoreDNS responses latency is higher than 60ms. | Prometheus |
[OpenShift CoreDNS] Panics Observed | CoreDNS Panics Observed. | Prometheus |
List of Dashboards
OpenShift v4 CoreDNS
The dashboard provides information on the OpenShift CoreDNS.
List of Metrics
Metric name |
---|
coredns_cache_hits_total |
coredns_cache_misses_total |
coredns_dns_request_duration_seconds_bucket |
coredns_dns_request_size_bytes_bucket |
coredns_dns_requests_total |
coredns_dns_response_size_bytes_bucket |
coredns_dns_responses_total |
coredns_forward_request_duration_seconds_bucket |
coredns_panics_total |
coredns_plugin_enabled |
go_goroutines |
process_cpu_seconds_total |
process_resident_memory_bytes |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting OpenShift CoreDNS
Because OpenShift 4.X comes with both Prometheus and CoreDNS ready to use, no additional installation is required. OpenShift CoreDNS metrics are exposed on the SSL port 9154.
Here are some interesting queries to run and metrics to monitor for troubleshooting OpenShift 4.
CoreDNS Panics
Number of Panics
To check the CoreDNS number of panics, use the following query:
sum(coredns_panics_total)
See the CoreDNS pods logs when you see this number growing.
DNS Requests
By Type
To filter DNS request types, use the following query:
(sum(rate(coredns_dns_requests_total[$__interval])) by (type,kube_cluster_name,kube_pod_name))
By Protocol
To filter DNS request types by protocol, use the following query:
(sum(rate(coredns_dns_requests_total[$__interval]) ) by (proto,kube_cluster_name,kube_pod_name))
By Zone
To filter DNS request types by zone, use the following query:
(sum(rate(coredns_dns_requests_total[$__interval]) ) by (zone,kube_cluster_name,kube_pod_name))
By Latency
This metrics detects any degradation in the service. With the following query, you can compare percentile 99 against average.
histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by(server, zone, le))
Error Rate
Watch carefully for this metric as you can filter depending on the status code: 200,404,400,500.
sum by (server, status)(coredns_dns_https_responses_total{server, status})
Cache
Cache Hit
To check the cache hit rate, use the following query:
sum(rate(coredns_cache_hits_total[$__interval])) by (type,kube_cluster_name,kube_pod_name)
Cache Miss
To check the cache miss rate, use the following query:
sum(rate(coredns_cache_misses_total[$__interval])) by(server,kube_cluster_name,kube_pod_name)
Agent Configuration
This is the default agent job for this integration:
- job_name: openshift-dns-default
honor_labels: true
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
scheme: https
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'openshift-dns/dns-default.+'
- source_labels:
- __address__
action: keep
regex: (.*:9154)
- source_labels:
- __meta_kubernetes_pod_name
action: replace
target_label: instance
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (coredns_cache_hits_total|coredns_cache_misses_total|coredns_dns_request_duration_seconds_bucket|coredns_dns_request_size_bytes_bucket|coredns_dns_requests_total|coredns_dns_response_size_bytes_bucket|coredns_dns_responses_total|coredns_forward_request_duration_seconds_bucket|coredns_panics_total|coredns_plugin_enabled|go_goroutines|process_cpu_seconds_total|process_resident_memory_bytes)
action: keep
38 - OpenShift Etcd
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
Versions supported: > v4.8
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 32 metrics.
Timeseries generated: Etcd generates ~1200 timeseries per etcd-ip pod
List of Alerts
Alert | Description | Format |
---|---|---|
[OpenShiftEtcd] Etcd Insufficient Members | Etcd cluster has insufficient members. | Prometheus |
[OpenShiftEtcd] Etcd No Leader | Member has no leader. | Prometheus |
[OpenShiftEtcd] Etcd High Number Of Leader Changes | Leader changes within the last 15 minutes. | Prometheus |
[OpenShiftEtcd] Etcd High Number Of Failed GRPC Requests. | High number of failed grpc requests | Prometheus |
[OpenShiftEtcd] Etcd GRPC Requests Slow | gRPC requests are taking too much time. | Prometheus |
[OpenShiftEtcd] Etcd High Number Of Failed Proposals | High number of proposal failures within the last 30 minutes on etcd instance. | Prometheus |
[OpenShiftEtcd] Etcd High Fsync Durations | 99th percentile fync durations are too high. | Prometheus |
[OpenShiftEtcd] Etcd High Commit Durations | 99th percentile commit durations are too high. | Prometheus |
[OpenShiftEtcd] Etcd HighNumber Of Failed HTTP Requests | High number of failed http requests | Prometheus |
[OpenShiftEtcd] Etcd HTTP Requests Slow | There are slow HTTP request. | Prometheus |
[OpenShiftEtcd] Etcd Excesive Database Growth | Etcd cluster database is growing very fast. | Prometheus |
List of Dashboards
OpenShift v4 Etcd
The dashboard provides information on the OpenShift Etcd.
List of Metrics
Metric name |
---|
etcd_debugging_mvcc_db_total_size_in_bytes |
etcd_disk_backend_commit_duration_seconds_bucket |
etcd_disk_wal_fsync_duration_seconds_bucket |
etcd_grpc_proxy_cache_hits_total |
etcd_grpc_proxy_cache_misses_total |
etcd_http_failed_total |
etcd_http_received_total |
etcd_http_successful_duration_seconds_bucket |
etcd_mvcc_db_total_size_in_bytes |
etcd_network_client_grpc_received_bytes_total |
etcd_network_client_grpc_sent_bytes_total |
etcd_network_peer_received_bytes_total |
etcd_network_peer_received_failures_total |
etcd_network_peer_round_trip_time_seconds_bucket |
etcd_network_peer_sent_bytes_total |
etcd_network_peer_sent_failures_total |
etcd_server_has_leader |
etcd_server_id |
etcd_server_leader_changes_seen_total |
etcd_server_proposals_applied_total |
etcd_server_proposals_committed_total |
etcd_server_proposals_failed_total |
etcd_server_proposals_pending |
etcd_server_quota_backend_bytes |
go_goroutines |
grpc_server_handled_total |
grpc_server_handling_seconds_bucket |
grpc_server_started_total |
process_max_fds |
process_open_fds |
sysdig_container_cpu_cores_used |
sysdig_container_memory_used_bytes |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting OpenShift Etcd
Because OpenShift 4.X comes with both Prometheus and API servers ready to use, no additional installation is required. OpenShift Etcd metrics are exposed using the \federated
endpoint.
Here are some interesting queries to run and metrics to monitor for troubleshooting OpenShift Etcd.
Etcd Consensus & Leader
Problems in the leader and consensus of the etcd cluster can cause outages in the cluster.
Etcd Leader
- If a member does not have a leader, it is totally unavailable.
- If all the members in a cluster do not have any leader, the entire cluster is totally unavailable.
Check the leader using this query:
count(etcd_server_id) % 2
If they query returns 1, etcd has a leader.
Leader Changes
Rapid leadership changes impact the performance of etcd significantly and it can also mean that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.
Check for leader changes in the last hour:
max(increase(etcd_server_leader_changes_seen_total[60m]))
Failed Proposals
Check if etcd has failed proposals. Failing proposals are caused by two issues:
- Temporary failures related to a leader election
- Longer downtime caused by a loss of quorum in the cluster
max(rate(etcd_server_proposals_failed_total[60m]))
Pending Proposals
Rising pending proposals suggests that client load is high or the member cannot commit proposals.
sum(etcd_server_proposals_pending)
Total Number of Consensus Proposals Commited
The etcd server applies every committed proposal asynchronously.
Check if the difference between proposals committed and proposals applied is small within a few thousands even under high load:
- If the difference between them continues to rise, the etcd server is overloaded.
- This might happen when applying expensive queries like heavy range queries or large txn operations.
Proposals commited:
sum(rate(etcd_server_proposals_committed_total[60m])) by (kube_cluster_name)
Proposals applied:
sum(rate(etcd_server_proposals_applied_total[60m])) by (kube_cluster_name)
gRPC
Error Rate
Check the gRPC error rate. These errors are most likely related to networking issues.
sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary",grpc_code!="OK"}[10m])) by (kube_cluster_name,kube_pod_name)
/
sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary"}[10m])) by (kube_cluster_name,kube_pod_name)
gRPC Traffic
Check for unusual spikes in the traffic. They could be related to networking issues.
rate(etcd_network_client_grpc_received_bytes_total[10m])
rate(etcd_network_client_grpc_sent_bytes_total[10m])
Disk
Disk Sync
Check if the fsync and commit latencies are below limits:
- High disk operation latencies often indicate disk issues.
- It may cause high request latency or make the cluster unstable
histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))
histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))
DB Size
Check for DB size if it keeps increasing. You should defrag etcd to decrease the DB size.
etcd_debugging_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"} or etcd_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"}
Networking Between Peers
This is only applicable to multi-master.
Errors from / to Peer
Check the total number of failures sent from peers:
rate(etcd_network_peer_sent_failures_total{container_name=~".*etcd.*|http"}[10m])
Check the total number of failures received by peers:
rate(etcd_network_peer_received_failures_total{container_name=~".*etcd.*|http"}[10m])
Agent Configuration
This is the default agent job for this integration:
- job_name: openshift-etcd-default
honor_labels: true
scheme: https
bearer_token_file: /run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"etcd"}'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'openshift-monitoring/prometheus-k8s-1'
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
# Remove extended labelset
- action: replace
replacement: true
target_label: sysdig_omit_source
metric_relabel_configs:
- source_labels: [__name__]
regex: (etcd_debugging_mvcc_db_total_size_in_bytes|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_grpc_proxy_cache_hits_total|etcd_grpc_proxy_cache_misses_total|etcd_http_failed_total|etcd_http_received_total|etcd_http_successful_duration_seconds_bucket|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_received_failures_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_bytes_total|etcd_network_peer_sent_failures_total|etcd_server_has_leader|etcd_server_id|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|process_max_fds|process_open_fds|etcd_mvcc_db_total_size_in_bytes|etcd_server_quota_backend_bytes)
action: keep
- action: replace
source_labels: [namespace]
target_label: kube_namespace_name
- action: replace
source_labels: [pod]
target_label: kube_pod_name
- action: replace
source_labels: [endpoint]
target_label: container_name
39 - OpenShift Kubelet
This integration is disabled by default. Please contact Sysdig Support to enable it in your account.
Versions supported: > v4.7
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 25 metrics.
Timeseries generated: Kubelet generates ~1200 timeseries per node
List of Alerts
Alert | Description | Format |
---|---|---|
[openshift-kubelet] Kubelet Too Many Pods | Kubelet Too Many Pods | Prometheus |
[openshift-kubelet] Kubelet Pod Lifecycle Event Generator Duration High | Kubelet Pod Lifecycle Event Generator Duration High | Prometheus |
[openshift-kubelet] Kubelet Pod StartUp Latency High | Kubelet Pod StartUp Latency High | Prometheus |
[openshift-kubelet] Kubelet Down | Kubelet Down | Prometheus |
List of Dashboards
OpenShift v4 Kubelet
The dashboard provides information on the OpenShift Kubelet.
List of Metrics
Metric name |
---|
go_goroutines |
kube_node_status_capacity_pods |
kube_node_status_condition |
kubelet_cgroup_manager_duration_seconds_bucket |
kubelet_cgroup_manager_duration_seconds_count |
kubelet_node_config_error |
kubelet_pleg_relist_duration_seconds_bucket |
kubelet_pleg_relist_interval_seconds_bucket |
kubelet_pod_start_duration_seconds_bucket |
kubelet_pod_start_duration_seconds_count |
kubelet_pod_worker_duration_seconds_bucket |
kubelet_pod_worker_duration_seconds_count |
kubelet_running_containers |
kubelet_running_pod_count |
kubelet_running_pods |
kubelet_runtime_operations_duration_seconds_bucket |
kubelet_runtime_operations_errors_total |
kubelet_runtime_operations_total |
process_cpu_seconds_total |
process_resident_memory_bytes |
rest_client_request_duration_seconds_bucket |
rest_client_requests_total |
storage_operation_duration_seconds_bucket |
storage_operation_duration_seconds_count |
volume_manager_total_volumes |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Agent Configuration
This integration has no default agent job.
40 - OpenShift Scheduler
This integration is enabled by default.
Versions supported: > v4.7
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 20 metrics.
Timeseries generated: Scheduler generates ~300 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[OpenShift Scheduler] Process Down | Scheduler has disappeared from target discovery. | Prometheus |
[OpenShift Scheduler] Failed Attempts to Schedule Pods | Scheduler Failed Attempts to Schedule Pods. | Prometheus |
[OpenShift Scheduler] High 4xx RequestError Rate | Scheduler High 4xx Request Error Rate. | Prometheus |
[OpenShift Scheduler] High 5xx RequestError Rate | Scheduler High 5xx Request Error Rate. | Prometheus |
List of Dashboards
OpenShift v4 Scheduler
The dashboard provides information on the OpenShift Scheduler.
List of Metrics
Metric name |
---|
go_goroutines |
rest_client_request_duration_seconds_count |
rest_client_request_duration_seconds_sum |
rest_client_requests_total |
scheduler_e2e_scheduling_duration_seconds_count |
scheduler_e2e_scheduling_duration_seconds_sum |
scheduler_pending_pods |
scheduler_pod_scheduling_attempts_count |
scheduler_pod_scheduling_attempts_sum |
scheduler_schedule_attempts_total |
sysdig_container_cpu_cores_used |
sysdig_container_memory_used_bytes |
workqueue_adds_total |
workqueue_depth |
workqueue_queue_duration_seconds_count |
workqueue_queue_duration_seconds_sum |
workqueue_retries_total |
workqueue_unfinished_work_seconds |
workqueue_work_duration_seconds_count |
workqueue_work_duration_seconds_sum |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
How to monitor OpenShift Scheduler with Sysdig agent
No further installation is needed, since OpenShift 4.X comes with both Prometheus and Scheduler ready to use. OpenShift Scheduler metrics are exposed using /federate endpoint.
Here are some interesting metrics and queries to monitor and troubleshoot OpenShift Scheduler.
Scheduling
Failed attempts to Schedule pods
Unschedulable pods means that a pod could not be scheduled, use this query to check for failed attempts:
sum by (kube_cluster_name,kube_pod_name,result) (rate(scheduler_schedule_attempts_total{result!~"scheduled"}[10m])) / ignoring(result) group_left sum by (kube_cluster_name,kube_pod_name)(rate(scheduler_schedule_attempts_total[10m]))
Pending pods
Check that there are no pods in pending queues with this query:
topk(30,rate(scheduler_pending_pods[10m]))
Work Queue
Work Queue Retries
The total number of retries that have been handled by the work queue. This value should be near 0.
topk(30,rate(workqueue_retries_total{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]))
Work Queue Latency
Queue latency is the time tasks spend in the queue before being processed
topk(30,rate(workqueue_queue_duration_seconds_sum{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]) / rate(workqueue_queue_duration_seconds_count{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]))
Work Queue Depth
Check the depth of the queue. High values can indicate the saturation of the controller manager.
topk(30,rate(workqueue_depth{container_name=~".*kube-scheduler.*"}[10m]))
Scheduler API Requests
Kube API Requests by code
Check that there are no 5xx or 4xx error codes in the scheduler requests.
sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{container_name=~".*kube-scheduler.*",code=~"4.."}[10m]))
sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{container_name=~".*kube-scheduler.*",code=~"5.."}[10m]))
Agent Configuration
This is the default agent job for this integration:
- job_name: openshift-scheduler-default
honor_labels: true
scheme: https
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"scheduler",__name__=~"scheduler_schedule_attempts_total|scheduler_pod_scheduling_attempts_sum|scheduler_pod_scheduling_attempts_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_e2e_scheduling_duration_seconds_count|scheduler_pending_pods|workqueue_retries_total|workqueue_work_duration_seconds_sum|workqueue_work_duration_seconds_count|workqueue_unfinished_work_seconds|workqueue_queue_duration_seconds_sum|workqueue_queue_duration_seconds_count|workqueue_depth|workqueue_adds_total|rest_client_requests_total|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count|go_goroutines"}'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'openshift-monitoring/prometheus-k8s-0'
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
# Remove extended labelset
- action: replace
replacement: true
target_label: sysdig_omit_source
metric_relabel_configs:
- source_labels: [__name__]
regex: (go_goroutines|rest_client_request_duration_seconds_count|rest_client_request_duration_seconds_sum|rest_client_requests_total|scheduler_e2e_scheduling_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_pending_pods|scheduler_pod_scheduling_attempts_count|scheduler_pod_scheduling_attempts_sum|scheduler_schedule_attempts_total|sysdig_container_cpu_cores_used|sysdig_container_memory_used_bytes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum)
action: keep
- action: replace
source_labels: [namespace]
target_label: kube_namespace_name
- action: replace
source_labels: [pod]
target_label: kube_pod_name
- action: replace
source_labels: [container]
target_label: container_name
- action: replace
source_labels: [job]
regex: '(.*)'
target_label: job
replacement: 'openshift-$1-default'
41 - OpenShift State Metrics
This integration is enabled by default.
Versions supported: > v4.7
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 4 metrics.
Timeseries generated: 30 timeseries + 4 series per route
List of Alerts
Alert | Description | Format |
---|---|---|
[OpenShift-state-metrics] CPU Resource Request Quota Usage | Resource request CPU usage is over 90% resource quota. | Prometheus |
[OpenShift-state-metrics] CPU Resource Limit Quota Usage | Resource limit CPU usage is over 90% resource limit quota. | Prometheus |
[OpenShift-state-metrics] Memory Resource Request Quota Usage | Resource request memory usage is over 90% resource quota. | Prometheus |
[OpenShift-state-metrics] Memory Resource Limit Quota Usage | Resource limit memory usage is over 90% resource limit quota. | Prometheus |
[OpenShift-state-metrics] Routes with issues | A route status is in error and is having issues. | Prometheus |
[OpenShift-state-metrics] Buid Processes with issues | A build process is in error or failed status. | Prometheus |
List of Dashboards
OpenShift v4 State Metrics
The dashboard provides information on the special OpenShift-state-metrics.
List of Metrics
Metric name |
---|
openshift_build_created_timestamp_seconds |
openshift_build_status_phase_total |
openshift_clusterresourcequota_usage |
openshift_route_status |
Preparing the Integration
No preparations are required for this integration.
Installing
The installation of an exporter is not required for this integration.
Monitoring and Troubleshooting OpenShift State Metrics
No further installation is needed, since OKD4 comes with both Prometheus and OSM ready to use.
Here are some interesting metrics and queries to monitor and troubleshoot OpenShift 4.
Resource Quotas
Resource Quotas Requests
% CPU Used vs Request Quota
Let’s get what’s the % of CPU used vs the request quota.
sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="used"}) / sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="hard"}) > 0
% Memory Used vs Request Quota
Now, the same but for the memory.
sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="hard"}) > 0
These queries return one time series for each resource quota deployed in the cluster.
Please, not that if your requests are near 100%, you can use the Pod Rightsizing & Workload Capacity Optimization dashboard to fix it. You can also talk to your cluster administrator to check your resource quota. Also, if your requests are too low, the resource quota could be rightsized.
Resource Quotas Limits
% CPU Used vs Limit Quota
Let’s get what’s the % of CPU used vs the limit quota.
sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="hard"}) > 0
% Memory Used vs Limit Quota
Now, the same but for the memory.
sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="hard"}) > 0
These queries return one time series for each resource quota deployed in the cluster.
Please, note that quota limits are normally higher than the quota requests. If your limits are too close to 100%, you might face scheduling issues. The Pod Scheduling Troubleshooting dashboard might help you to troubleshoot this scenario. Also, if limit usage is too low, the resource quota could be rightsized.
Routes
List the Routes
Let’s get a list of all the routes present in the cluster, aggregated by host and namespace
sum by (route, host, namespace) (openshift_route_info)
Duplicated Routes
Now, let’s find our duplicated routes:
sum by (host) (openshift_route_info) > 1
This query will return the duplicated hosts. If you want the full information for the duplicated routes, try this one:
openshift_route_info * on (host) group_left(host_name) label_replace((sum by (host) (openshift_route_info) > 1), "host_name", "$0", "host", ".+")
Why the label_replace
? because to get the full info, joining the openshift_route_info
metric with itself was necessary, but, as both sides of the join have the same labels, there wasn’t any extra label to join by.
What you can do is to perform a label_replace
to create a new label host_name
with the content of the host
label and the join will work.
Routes with Issues
Let’s get what are the routes with issues (a.k.a. routes with a False
status)
openshift_route_status{status == 'False'} > 0
Builds
New Builds, by Processing Time
Let’s list the new
builds, by how many time they have been processing. This query can be useful to detect slow processes.
time() - (openshift_build_created_timestamp_seconds) * on (build) group_left(build_phase) (openshift_build_status_phase_total{build_phase="new"} == 1)
Builds with Errors
Use this query to get builds that are in failed
or error
state.
sum by (build, buildconfig, kube_namespace_name, kube_cluster_name) (openshift_build_status_phase_total{build_phase=~"failed|error"}) > 0
Agent Configuration
This is the default agent job for this integration:
- job_name: 'openshift-state-metrics'
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
scheme: https
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (openshift-state-metrics);(.{0}$)
replacement: openshift-state-metrics
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "openshift-state-metrics"
- action: replace
source_labels: [__address__]
regex: ([^:]+)(?::\d+)?
replacement: $1:8443
target_label: __address__
- action: replace
replacement: true
target_label: sysdig_omit_source
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (openshift_build_created_timestamp_seconds|openshift_build_status_phase_total|openshift_clusterresourcequota_usage|openshift_route_status)
action: keep
- action: replace
source_labels: [namespace]
target_label: kube_namespace_name
42 - PHP-FPM
This integration is enabled by default.
Versions supported: > 7.2
This integration uses a sidecar exporter that is available in UBI or scratch base image.
This integration has 12 metrics.
Timeseries generated: 167 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Php-Fpm] Percentage of instances low | Less than 75% of instances are up | Prometheus |
[Php-Fpm] Recently reboot | Instances have been recently reboot | Prometheus |
[Php-Fpm] Limit of child proccess exceeded | Number of childs process have been exceeded | Prometheus |
[Php-Fpm] Reaching limit of queue process | Buffer of queue requests reaching its limit | Prometheus |
[Php-Fpm] Too slow requ |