OpenShift Etcd

OpenShift Etcd

OpenShift Etcd

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[OpenShiftEtcd] Etcd Insufficient MembersEtcd cluster has insufficient membersPrometheus
[OpenShiftEtcd] Etcd No LeaderMember has no leader.Prometheus
[OpenShiftEtcd] Etcd High Number Of Leader ChangesLeader changes within the last 15 minutes.Prometheus
[OpenShiftEtcd] Etcd High Number Of Failed GRPC RequestsHigh number of failed grpc requestsPrometheus
[OpenShiftEtcd] Etcd GRPC Requests SlowgRPC requests are taking too much timePrometheus
[OpenShiftEtcd] Etcd High Number Of Failed ProposalsHigh number of proposal failures within the last 30 minutes on etcd instancePrometheus
[OpenShiftEtcd] Etcd High Fsync Durations99th percentile fync durations are too highPrometheus
[OpenShiftEtcd] Etcd High Commit Durations99th percentile commit durations are too highPrometheus
[OpenShiftEtcd] Etcd HighNumber Of Failed HTTP RequestsHigh number of failed http requestsPrometheus
[OpenShiftEtcd] Etcd HTTP Requests SlowHttps request are slowPrometheus

List of Metrics:

  • etcd_debugging_mvcc_db_total_size_in_bytes
  • etcd_disk_backend_commit_duration_seconds_bucket
  • etcd_disk_wal_fsync_duration_seconds_bucket
  • etcd_grpc_proxy_cache_hits_total
  • etcd_grpc_proxy_cache_misses_total
  • etcd_http_failed_total
  • etcd_http_received_total
  • etcd_http_successful_duration_seconds_bucket
  • etcd_mvcc_db_total_size_in_bytes
  • etcd_network_client_grpc_received_bytes_total
  • etcd_network_client_grpc_sent_bytes_total
  • etcd_network_peer_received_bytes_total
  • etcd_network_peer_received_failures_total
  • etcd_network_peer_round_trip_time_seconds_bucket
  • etcd_network_peer_sent_bytes_total
  • etcd_network_peer_sent_failures_total
  • etcd_server_has_leader
  • etcd_server_id
  • etcd_server_leader_changes_seen_total
  • etcd_server_proposals_applied_total
  • etcd_server_proposals_committed_total
  • etcd_server_proposals_failed_total
  • etcd_server_proposals_pending
  • go_goroutines
  • grpc_server_handled_total
  • grpc_server_handling_seconds_bucket
  • grpc_server_started_total
  • process_max_fds
  • process_open_fds
  • sysdig_container_cpu_cores_used
  • sysdig_container_memory_used_bytes

How to monitor OpenShift Etcd with Sysdig agent

No further installation is needed, since OpenShift 4.X comes with both Prometheus and Etcd ready to use. OpenShift Etcd metrics are exposed using /federate endpoint.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift Etcd.

Etcd Consensus & Leader

Problems in the leader and consensus of the etcd cluster can cause outages in the cluster.

Etcd leader

If a member does not have a leader, it is totally unavailable. If all the members in the cluster do not have any leader, the entire cluster is totally unavailable.

Check leader using this query, if the result is 1 etcd has a leader:

count(etcd_server_id) % 2

Leader changes

Rapid leadership changes impact the performance of etcd significantly and it can also mean that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

Check for leader changes in the last hour:

max(increase(etcd_server_leader_changes_seen_total[60m]))

Failed proposals

Check the proposal fails. They are normally related to two issues:

  • Temporary failures related to a leader election
  • Longer downtime caused by a loss of quorum in the cluster
max(rate(etcd_server_proposals_failed_total[60m]))

Pending proposals

Rising pending proposals suggests there is a high client load or the member cannot commit proposals

sum(etcd_server_proposals_pending)

Total number of consensus proposals commited

The etcd server applies every committed proposal asynchronously

Check that the difference between proposals committed applied is small (within a few thousands even under high load):

  • If the difference between them continues to rise, it indicates that the etcd server is overloaded.
  • This might happen when applying expensive queries like heavy range queries or large txn operations.

Proposals commited

sum(rate(etcd_server_proposals_committed_total[60m])) by (kube_cluster_name)

Proposals applied

sum(rate(etcd_server_proposals_applied_total[60m])) by (kube_cluster_name)

gRPC

Error rate

Check gRPC error rate, this error are most likely related to networking issues.

sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary",grpc_code!="OK"}[10m])) by (kube_cluster_name,kube_pod_name)
/
sum(rate(grpc_server_handled_total{container_name=~".*etcd.*|http",grpc_type="unary"}[10m])) by (kube_cluster_name,kube_pod_name)

gRPC Traffic

Check for unusual spikes in the traffic, they could be related with networking issues.

rate(etcd_network_client_grpc_received_bytes_total[10m])
rate(etcd_network_client_grpc_sent_bytes_total[10m])

Disk

Disk sync

Check that the fsync and commit latencies are below limits:

  • High disk operation latencies often indicate disk issues.
  • It may cause high request latency or make the cluster unstable
histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))
histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[10m])) by (instance, le,kube_cluster_name,kube_pod_name))

DB Size

Check for DB size in case it keeps increasing. You should defrag etcd to decrease DB Size

etcd_debugging_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"} or etcd_mvcc_db_total_size_in_bytes{container_name=~".*etcd.*|http"}

Networking between peers (only if multi-master)

Errors from / to peer

Check the total number of sent failures from peers

rate(etcd_network_peer_sent_failures_total{container_name=~".*etcd.*|http"}[10m])

Check the total number of received failures from peers

rate(etcd_network_peer_received_failures_total{container_name=~".*etcd.*|http"}[10m])