OpenShift CoreDNS

OpenShift CoreDNS

OpenShift CoreDNS

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[OpenShiftCoreDNS] Error HighHigh Request DurationPrometheus
[OpenShiftCoreDNS] Latency HighLatency HighPrometheus

List of Metrics:

  • coredns_cache_hits_total
  • coredns_cache_misses_total
  • coredns_dns_request_duration_seconds_bucket
  • coredns_dns_request_size_bytes_bucket
  • coredns_dns_requests_total
  • coredns_dns_response_size_bytes_bucket
  • coredns_dns_responses_total
  • coredns_forward_request_duration_seconds_bucket
  • coredns_panics_total
  • coredns_plugin_enabled
  • go_goroutines
  • process_cpu_seconds_total
  • process_resident_memory_bytes

How to monitor OpenShift CoreDNS with Sysdig agent

No further installation is needed, since OpenShift 4.X comes with both Prometheus and CoreDNS ready to use. OpenShift CoreDNS metrics are exposed in SSL port 9154.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift 4.

CoreDNS panics

Number of panics

Let’s check the CoreDNS number of panics. Check for CoreDNS pods logs in case you see this number growing.

sum(coredns_panics_total)

DNS Requests

by type

To filter DNS request types use the following query:

(sum(rate(coredns_dns_requests_total[$__interval])) by (type,kube_cluster_name,kube_pod_name))

by protocol

To filter DNS request types by protocolo use the following query:

(sum(rate(coredns_dns_requests_total[$__interval]) ) by (proto,kube_cluster_name,kube_pod_name))

by zone

To filter DNS request types by zone use the following query:

(sum(rate(coredns_dns_requests_total[$__interval]) ) by (zone,kube_cluster_name,kube_pod_name))

by Latency

This metrics is important to detect any degradation in the service. With the following compare you can compare percentile 99 against average.

histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by(server, zone, le))

Error Rate

Watch carefully for this metric as you can filter depending on the status code (200,404,400 or 500).

sum by (server, status)(coredns_dns_https_responses_total{server, status})

Cache

Cache hit

To check the cache hit rate use the following query:

sum(rate(coredns_cache_hits_total[$__interval])) by (type,kube_cluster_name,kube_pod_name)

Cache miss

To check the cache miss rate use the following query:

sum(rate(coredns_cache_misses_total[$__interval])) by(server,kube_cluster_name,kube_pod_name)