OpenShift State Metrics
This integration is enabled by default.
Versions supported: > v4.7
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 4 metrics.
Timeseries generated: 30 timeseries + 4 series per route
List of Alerts
| Alert | Description | Format | 
|---|---|---|
| [OpenShift-state-metrics] CPU Resource Request Quota Usage | Resource request CPU usage is over 90% resource quota. | Prometheus | 
| [OpenShift-state-metrics] CPU Resource Limit Quota Usage | Resource limit CPU usage is over 90% resource limit quota. | Prometheus | 
| [OpenShift-state-metrics] Memory Resource Request Quota Usage | Resource request memory usage is over 90% resource quota. | Prometheus | 
| [OpenShift-state-metrics] Memory Resource Limit Quota Usage | Resource limit memory usage is over 90% resource limit quota. | Prometheus | 
| [OpenShift-state-metrics] Routes with issues | A route status is in error and is having issues. | Prometheus | 
| [OpenShift-state-metrics] Buid Processes with issues | A build process is in error or failed status. | Prometheus | 
List of Dashboards
OpenShift v4 State Metrics
If you are using Prometheus Remote Write you will need to add the following metric relabel config for this label.
    - action: replace
    source_labels: [ __address__ ]
    target_label: _sysdig_integration_openshift_state_metrics 
    replacement: true
The dashboard provides information on the special OpenShift-state-metrics.

List of Metrics
| Metric name | 
|---|
| openshift_build_created_timestamp_seconds | 
| openshift_build_status_phase_total | 
| openshift_clusterresourcequota_usage | 
| openshift_route_status | 
Prerequisites
None.
Installation
Installing an exporter is not required for this integration.
Monitoring and Troubleshooting OpenShift State Metrics
No further installation is needed, since OKD4 comes with both Prometheus and OSM ready to use.
Here are some interesting metrics and queries to monitor and troubleshoot OpenShift 4.
Resource Quotas
Resource Quotas Requests
% CPU Used vs Request Quota
Let’s get what’s the % of CPU used vs the request quota.
sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="used"}) / sum by (name, kube_cluster_name) (openshift_clusterresourcequota_usage{resource="requests.cpu", type="hard"}) > 0
% Memory Used vs Request Quota
Now, the same but for the memory.
sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="requests.memory", type="hard"}) > 0
These queries return one time series for each resource quota deployed in the cluster.
Please, not that if your requests are near 100%, you can use the Pod Rightsizing & Workload Capacity Optimization dashboard to fix it. You can also talk to your cluster administrator to check your resource quota. Also, if your requests are too low, the resource quota could be rightsized.
Resource Quotas Limits
% CPU Used vs Limit Quota
Let’s get what’s the % of CPU used vs the limit quota.
sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.cpu", type="hard"}) > 0
% Memory Used vs Limit Quota
Now, the same but for the memory.
sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="used"}) / sum by (name, kube_cluster_name)(openshift_clusterresourcequota_usage{resource="limits.memory", type="hard"}) > 0
These queries return one time series for each resource quota deployed in the cluster.
Please, note that quota limits are normally higher than the quota requests. If your limits are too close to 100%, you might face scheduling issues. The Pod Scheduling Troubleshooting dashboard might help you to troubleshoot this scenario. Also, if limit usage is too low, the resource quota could be rightsized.
Routes
List the Routes
Let’s get a list of all the routes present in the cluster, aggregated by host and namespace
sum by (route, host, namespace) (openshift_route_info)
Duplicated Routes
Now, let’s find our duplicated routes:
sum by (host) (openshift_route_info) > 1
This query will return the duplicated hosts. If you want the full information for the duplicated routes, try this one:
openshift_route_info * on (host) group_left(host_name) label_replace((sum by (host) (openshift_route_info) > 1), "host_name", "$0", "host", ".+")
Why the label_replace? because to get the full info, joining the openshift_route_info metric with itself was necessary, but, as both sides of the join have the same labels, there wasn’t any extra label to join by.
What you can do is to perform a label_replace to create a new label host_name with the content of the host label and the join will work.
Routes with Issues
Let’s get what are the routes with issues (a.k.a. routes with a False status)
openshift_route_status{status == 'False'} > 0
Builds
New Builds, by Processing Time
Let’s list the new builds, by how many time they have been processing. This query can be useful to detect slow processes.
time() - (openshift_build_created_timestamp_seconds) * on (build) group_left(build_phase) (openshift_build_status_phase_total{build_phase="new"} == 1)
Builds with Errors
Use this query to get builds that are in failed or error state.
sum by (build, buildconfig, kube_namespace_name, kube_cluster_name) (openshift_build_status_phase_total{build_phase=~"failed|error"}) > 0
Agent Configuration
The default agent job for this integration is as follows:
- job_name: 'openshift-state-metrics'
  tls_config:
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  scheme: https
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - source_labels: [__meta_kubernetes_pod_phase]
    action: keep
    regex: Running
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    regex: (openshift-state-metrics)
    replacement: openshift-state-metrics
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "openshift-state-metrics"
  - action: replace
    source_labels: [__address__]
    regex: ([^:]+)(?::\d+)?
    replacement: $1:8443
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  - action: replace
    source_labels: [ __address__ ]
    target_label: _sysdig_integration_openshift_state_metrics 
    replacement: true
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (openshift_build_created_timestamp_seconds|openshift_build_status_phase_total|openshift_clusterresourcequota_usage|openshift_route_status)
    action: keep