OpenShift State Metrics

OpenShift State Metrics

OpenShift State Metrics

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[OpenShift-state-metrics] CPU Resource Request Quota UsageResource request CPU usage is over 90% resource quota.Prometheus
[OpenShift-state-metrics] CPU Resource Limit Quota UsageResource limit CPU usage is over 90% resource limit quota.Prometheus
[OpenShift-state-metrics] Memory Resource Request Quota UsageResource request memory usage is over 90% resource quota.Prometheus
[OpenShift-state-metrics] Memory Resource Limit Quota UsageResource limit memory usage is over 90% resource limit quota.Prometheus
[OpenShift-state-metrics] Routes with issuesA route status is in error and is having issues.Prometheus
[OpenShift-state-metrics] Buid Processes with issuesA build process is in error or failed status.Prometheus

List of Dashboards:

  • OpenShift v4 State Metrics OpenShift v4 State Metrics

List of Metrics:

  • openshift_build_created_timestamp_seconds
  • openshift_build_status_phase_total
  • openshift_clusterresourcequota_usage
  • openshift_route_status

How to monitor OpenShift State Metrics with Sysdig agent

No further installation is needed, since OKD4 comes with both Prometheus and OSM ready to use.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift 4.

Resource Quotas

Resource Quotas Requests:

% CPU used vs request quota

Let’s get what’s the % of CPU used vs the request quota.

sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="used"}) / sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="hard"}) > 0

% Memory used vs request quota

Now, the same but for the memory.

sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="hard"}) > 0

These queries return one time series for each resource quota deployed in the cluster.

Please, not that if your requests are near 100%, you can use the Pod Rightsizing & Workload Capacity Optimization dashboard to fix it. You can also talk to your cluster administrator to check your resource quota. Also, if your requests are too low, the resource quota could be rightsized.

Resource Quotas Limits:

% CPU used vs limit quota

Let’s get what’s the % of CPU used vs the limit quota.

sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="hard"}) > 0

% Memory used vs limit quota

Now, the same but for the memory.

sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="hard"}) > 0

These queries return one time series for each resource quota deployed in the cluster.

Please, note that quota limits are normally higher than the quota requests. If your limits are too close to 100%, you might face scheduling issues. The Pod Scheduling Troubleshooting dashboard might help you to troubleshoot this scenario. Also, if limit usage is too low, the resource quota could be rightsized.

Routes

List the routes

Let’s get a list of all the routes present in the cluster, aggregated by host and namespace

sum by (route, host, namespace) (openshift_route_info)

Duplicated routes

Now, let’s find our duplicated routes:

sum by (host) (openshift_route_info) > 1

This query will return the duplicated hosts. If you want the full information for the duplicated routes, try this one:

openshift_route_info * on (host) group_left(host_name) label_replace((sum by (host) (openshift_route_info) > 1), "host_name", "$0", "host", ".+")

Why the label_replace? because to get the full info, joining the openshift_route_info metric with itself was necessary, but, as both sides of the join have the same labels, there wasn’t any extra label to join by.

What you can do is to perform a label_replace to create a new label host_name with the content of the host label and the join will work.

Routes with issues

Let’s get what are the routes with issues (a.k.a. routes with a False status)

openshift_route_status{status == 'False'} > 0

Builds

New builds, by processing time

Let’s list the new builds, by how many time they have been processing. This query can be useful to detect slow processes.

time() - (openshift_build_created_timestamp_seconds) * on (build) group_left(build_phase) (openshift_build_status_phase_total{build_phase="new"} == 1)

Builds with errors

Use this query to get builds that are in failed or error state.

sum by (build, buildconfig, kube_namespace_name, kube_cluster_name) (openshift_build_status_phase_total{build_phase=~"failed|error"}) > 0