OpenShift Etcd
This integration is disabled by default. See Enable and Disable Integrations to enable it in your account.
Versions supported: > v4.8
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 29 metrics.
Timeseries generated: Etcd generates ~1200 timeseries per etcd-ip pod
List of Alerts
Alert | Description | Format |
---|---|---|
[OpenShiftEtcd] Etcd Insufficient Members | Etcd cluster has insufficient members. | Prometheus |
[OpenShiftEtcd] Etcd No Leader | Member has no leader. | Prometheus |
[OpenShiftEtcd] Etcd High Number Of Leader Changes | Leader changes within the last 15 minutes. | Prometheus |
[OpenShiftEtcd] Etcd High Number Of Failed GRPC Requests. | High number of failed grpc requests | Prometheus |
[OpenShiftEtcd] Etcd GRPC Requests Slow | gRPC requests are taking too much time. | Prometheus |
[OpenShiftEtcd] Etcd High Number Of Failed Proposals | High number of proposal failures within the last 30 minutes on etcd instance. | Prometheus |
[OpenShiftEtcd] Etcd High Fsync Durations | 99th percentile fync durations are too high. | Prometheus |
[OpenShiftEtcd] Etcd High Commit Durations | 99th percentile commit durations are too high. | Prometheus |
[OpenShiftEtcd] Etcd Excesive Database Growth | Etcd cluster database is growing very fast. | Prometheus |
List of Dashboards
OpenShift v4 Etcd
If you are using Prometheus Remote Write you will need to add the following metric relabel config for this label.
- action: replace
source_labels: [ __address__ ]
target_label: _sysdig_integration_openshift_etcd
replacement: true
The dashboard provides information on the OpenShift Etcd.
List of Metrics
Metric name |
---|
etcd_debugging_mvcc_db_total_size_in_bytes |
etcd_disk_backend_commit_duration_seconds_bucket |
etcd_disk_wal_fsync_duration_seconds_bucket |
etcd_grpc_proxy_cache_hits_total |
etcd_grpc_proxy_cache_misses_total |
etcd_mvcc_db_total_size_in_bytes |
etcd_network_client_grpc_received_bytes_total |
etcd_network_client_grpc_sent_bytes_total |
etcd_network_peer_received_bytes_total |
etcd_network_peer_received_failures_total |
etcd_network_peer_round_trip_time_seconds_bucket |
etcd_network_peer_sent_bytes_total |
etcd_network_peer_sent_failures_total |
etcd_server_has_leader |
etcd_server_id |
etcd_server_leader_changes_seen_total |
etcd_server_proposals_applied_total |
etcd_server_proposals_committed_total |
etcd_server_proposals_failed_total |
etcd_server_proposals_pending |
etcd_server_quota_backend_bytes |
go_goroutines |
grpc_server_handled_total |
grpc_server_handling_seconds_bucket |
grpc_server_started_total |
process_max_fds |
process_open_fds |
sysdig_container_cpu_cores_used |
sysdig_container_memory_used_bytes |
Prerequisites
None.
Installation
Installing an exporter is not required for this integration.
Monitoring and Troubleshooting OpenShift Etcd
Because OpenShift 4.X comes with both Prometheus and API servers ready to use, no additional installation is required. OpenShift Etcd metrics are exposed using the \federated
endpoint.
Here are some interesting queries to run and metrics to monitor for troubleshooting OpenShift Etcd.
Etcd Consensus & Leader
Problems in the leader and consensus of the etcd cluster can cause outages in the cluster.
Etcd Leader
- If a member does not have a leader, it is totally unavailable.
- If all the members in a cluster do not have any leader, the entire cluster is totally unavailable.
Check the leader using this query:
count(etcd_server_id) % 2
If they query returns 1, etcd has a leader.
Leader Changes
Rapid leadership changes impact the performance of etcd significantly and it can also mean that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.
Check for leader changes in the last hour:
max(increase(etcd_server_leader_changes_seen_total[60m]))
Failed Proposals
Check if etcd has failed proposals. Failing proposals are caused by two issues:
- Temporary failures related to a leader election
- Longer downtime caused by a loss of quorum in the cluster
max(rate(etcd_server_proposals_failed_total[60m]))
Pending Proposals
Rising pending proposals suggests that client load is high or the member cannot commit proposals.
sum(etcd_server_proposals_pending)
Total Number of Consensus Proposals Commited
The etcd server applies every committed proposal asynchronously.
Check if the difference between proposals committed and proposals applied is small within a few thousands even under high load:
- If the difference between them continues to rise, the etcd server is overloaded.
- This might happen when applying expensive queries like heavy range queries or large txn operations.
Proposals commited:
sum(rate(etcd_server_proposals_committed_total[60m])) by (kube_cluster_name)
Proposals applied:
sum(rate(etcd_server_proposals_applied_total[60m])) by (kube_cluster_name)
gRPC
Error Rate
Check the gRPC error rate. These errors are most likely related to networking issues.
sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary",grpc_code!="OK"}[10m])) by (kube_cluster_name,kube_pod_name)
/
sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary"}[10m])) by (kube_cluster_name,kube_pod_name)
gRPC Traffic
Check for unusual spikes in the traffic. They could be related to networking issues.
rate(etcd_network_client_grpc_received_bytes_total[10m])
rate(etcd_network_client_grpc_sent_bytes_total[10m])
Disk
Disk Sync
Check if the fsync and commit latencies are below limits:
- High disk operation latencies often indicate disk issues.
- It may cause high request latency or make the cluster unstable
histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))
histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))
DB Size
Check for DB size if it keeps increasing. You should defrag etcd to decrease the DB size.
etcd_debugging_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"} or etcd_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"}
Networking Between Peers
This is only applicable to multi-master.
Errors from / to Peer
Check the total number of failures sent from peers:
rate(etcd_network_peer_sent_failures_total{container_name=~".*etcd.*|http"}[10m])
Check the total number of failures received by peers:
rate(etcd_network_peer_received_failures_total{container_name=~".*etcd.*|http"}[10m])
Agent Configuration
The default agent job for this integration is as follows:
- job_name: openshift-etcd-default
honor_labels: true
scheme: https
bearer_token_file: /run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"etcd"}'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- source_labels: [__meta_kubernetes_pod_phase]
action: keep
regex: Running
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'openshift-monitoring/prometheus-k8s-1'
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
# Remove extended labelset
- action: replace
replacement: true
target_label: sysdig_omit_source
- action: replace
source_labels: [ __address__ ]
target_label: _sysdig_integration_openshift_etcd
replacement: true
metric_relabel_configs:
- source_labels: [__name__]
regex: (etcd_debugging_mvcc_db_total_size_in_bytes|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_grpc_proxy_cache_hits_total|etcd_grpc_proxy_cache_misses_total|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_received_failures_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_bytes_total|etcd_network_peer_sent_failures_total|etcd_server_has_leader|etcd_server_id|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|process_max_fds|process_open_fds|etcd_mvcc_db_total_size_in_bytes|etcd_server_quota_backend_bytes)
action: keep
- action: replace
source_labels: [namespace]
target_label: kube_namespace_name
- action: replace
source_labels: [pod]
target_label: kube_pod_name
- action: replace
source_labels: [endpoint]
target_label: container_name
- action: replace
target_label: job
replacement: openshift-etcd-default
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.