OpenShift Etcd

Metrics, Dashboards, Alerts and more for OpenShift Etcd Integration in Sysdig Monitor.
OpenShift Etcd

This integration is disabled by default. See Enable and Disable Integrations to enable it in your account.

Versions supported: > v4.8

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 29 metrics.

Timeseries generated: Etcd generates ~1200 timeseries per etcd-ip pod

List of Alerts

AlertDescriptionFormat
[OpenShiftEtcd] Etcd Insufficient MembersEtcd cluster has insufficient members.Prometheus
[OpenShiftEtcd] Etcd No LeaderMember has no leader.Prometheus
[OpenShiftEtcd] Etcd High Number Of Leader ChangesLeader changes within the last 15 minutes.Prometheus
[OpenShiftEtcd] Etcd High Number Of Failed GRPC Requests.High number of failed grpc requestsPrometheus
[OpenShiftEtcd] Etcd GRPC Requests SlowgRPC requests are taking too much time.Prometheus
[OpenShiftEtcd] Etcd High Number Of Failed ProposalsHigh number of proposal failures within the last 30 minutes on etcd instance.Prometheus
[OpenShiftEtcd] Etcd High Fsync Durations99th percentile fync durations are too high.Prometheus
[OpenShiftEtcd] Etcd High Commit Durations99th percentile commit durations are too high.Prometheus
[OpenShiftEtcd] Etcd Excesive Database GrowthEtcd cluster database is growing very fast.Prometheus

List of Dashboards

OpenShift v4 Etcd

If you are using Prometheus Remote Write you will need to add the following metric relabel config for this label.


    - action: replace
    source_labels: [ __address__ ]
    target_label: _sysdig_integration_openshift_etcd 
    replacement: true

The dashboard provides information on the OpenShift Etcd. OpenShift v4 Etcd

List of Metrics

Metric name
etcd_debugging_mvcc_db_total_size_in_bytes
etcd_disk_backend_commit_duration_seconds_bucket
etcd_disk_wal_fsync_duration_seconds_bucket
etcd_grpc_proxy_cache_hits_total
etcd_grpc_proxy_cache_misses_total
etcd_mvcc_db_total_size_in_bytes
etcd_network_client_grpc_received_bytes_total
etcd_network_client_grpc_sent_bytes_total
etcd_network_peer_received_bytes_total
etcd_network_peer_received_failures_total
etcd_network_peer_round_trip_time_seconds_bucket
etcd_network_peer_sent_bytes_total
etcd_network_peer_sent_failures_total
etcd_server_has_leader
etcd_server_id
etcd_server_leader_changes_seen_total
etcd_server_proposals_applied_total
etcd_server_proposals_committed_total
etcd_server_proposals_failed_total
etcd_server_proposals_pending
etcd_server_quota_backend_bytes
go_goroutines
grpc_server_handled_total
grpc_server_handling_seconds_bucket
grpc_server_started_total
process_max_fds
process_open_fds
sysdig_container_cpu_cores_used
sysdig_container_memory_used_bytes

Prerequisites

None.

Installation

Installing an exporter is not required for this integration.

Monitoring and Troubleshooting OpenShift Etcd

Because OpenShift 4.X comes with both Prometheus and API servers ready to use, no additional installation is required. OpenShift Etcd metrics are exposed using the \federated endpoint.

Here are some interesting queries to run and metrics to monitor for troubleshooting OpenShift Etcd.

Etcd Consensus & Leader

Problems in the leader and consensus of the etcd cluster can cause outages in the cluster.

Etcd Leader

  • If a member does not have a leader, it is totally unavailable.
  • If all the members in a cluster do not have any leader, the entire cluster is totally unavailable.

Check the leader using this query:

count(etcd_server_id) % 2

If they query returns 1, etcd has a leader.

Leader Changes

Rapid leadership changes impact the performance of etcd significantly and it can also mean that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

Check for leader changes in the last hour:

max(increase(etcd_server_leader_changes_seen_total[60m]))

Failed Proposals

Check if etcd has failed proposals. Failing proposals are caused by two issues:

  • Temporary failures related to a leader election
  • Longer downtime caused by a loss of quorum in the cluster
max(rate(etcd_server_proposals_failed_total[60m]))

Pending Proposals

Rising pending proposals suggests that client load is high or the member cannot commit proposals.

sum(etcd_server_proposals_pending)

Total Number of Consensus Proposals Commited

The etcd server applies every committed proposal asynchronously.

Check if the difference between proposals committed and proposals applied is small within a few thousands even under high load:

  • If the difference between them continues to rise, the etcd server is overloaded.
  • This might happen when applying expensive queries like heavy range queries or large txn operations.

Proposals commited:

sum(rate(etcd_server_proposals_committed_total[60m])) by (kube_cluster_name)

Proposals applied:

sum(rate(etcd_server_proposals_applied_total[60m])) by (kube_cluster_name)

gRPC

Error Rate

Check the gRPC error rate. These errors are most likely related to networking issues.

sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary",grpc_code!="OK"}[10m])) by (kube_cluster_name,kube_pod_name)
/
sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary"}[10m])) by (kube_cluster_name,kube_pod_name)

gRPC Traffic

Check for unusual spikes in the traffic. They could be related to networking issues.

rate(etcd_network_client_grpc_received_bytes_total[10m])
rate(etcd_network_client_grpc_sent_bytes_total[10m])

Disk

Disk Sync

Check if the fsync and commit latencies are below limits:

  • High disk operation latencies often indicate disk issues.
  • It may cause high request latency or make the cluster unstable
histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))
histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))

DB Size

Check for DB size if it keeps increasing. You should defrag etcd to decrease the DB size.

etcd_debugging_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"} or etcd_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"}

Networking Between Peers

This is only applicable to multi-master.

Errors from / to Peer

Check the total number of failures sent from peers:

rate(etcd_network_peer_sent_failures_total{container_name=~".*etcd.*|http"}[10m])

Check the total number of failures received by peers:

rate(etcd_network_peer_received_failures_total{container_name=~".*etcd.*|http"}[10m])

Agent Configuration

The default agent job for this integration is as follows:

- job_name: openshift-etcd-default
  honor_labels: true
  scheme: https
  bearer_token_file: /run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"etcd"}'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - source_labels: [__meta_kubernetes_pod_phase]
    action: keep
    regex: Running
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'openshift-monitoring/prometheus-k8s-1'
    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
    # Remove extended labelset
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  - action: replace
    source_labels: [ __address__ ]
    target_label: _sysdig_integration_openshift_etcd 
    replacement: true    
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (etcd_debugging_mvcc_db_total_size_in_bytes|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_grpc_proxy_cache_hits_total|etcd_grpc_proxy_cache_misses_total|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_received_failures_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_bytes_total|etcd_network_peer_sent_failures_total|etcd_server_has_leader|etcd_server_id|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|process_max_fds|process_open_fds|etcd_mvcc_db_total_size_in_bytes|etcd_server_quota_backend_bytes)
    action: keep
  - action: replace
    source_labels: [namespace]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [pod]
    target_label: kube_pod_name
  - action: replace
    source_labels: [endpoint]
    target_label: container_name
  - action: replace
    target_label: job
    replacement: openshift-etcd-default