OpenShift Scheduler
This integration is enabled by default.
Versions supported: > v4.7
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 20 metrics.
Timeseries generated: Scheduler generates ~300 timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[OpenShift Scheduler] Process Down | Scheduler has disappeared from target discovery. | Prometheus |
[OpenShift Scheduler] Failed Attempts to Schedule Pods | Scheduler Failed Attempts to Schedule Pods. | Prometheus |
[OpenShift Scheduler] High 4xx RequestError Rate | Scheduler High 4xx Request Error Rate. | Prometheus |
[OpenShift Scheduler] High 5xx RequestError Rate | Scheduler High 5xx Request Error Rate. | Prometheus |
List of Dashboards
OpenShift v4 Scheduler
If you are using Prometheus Remote Write you will need to add the following metric relabel config for this label.
- action: replace
source_labels: [ __address__ ]
target_label: _sysdig_integration_openshift_scheduler
replacement: true
The dashboard provides information on the OpenShift Scheduler.
List of Metrics
Metric name |
---|
go_goroutines |
rest_client_request_duration_seconds_count |
rest_client_request_duration_seconds_sum |
rest_client_requests_total |
scheduler_e2e_scheduling_duration_seconds_count |
scheduler_e2e_scheduling_duration_seconds_sum |
scheduler_pending_pods |
scheduler_pod_scheduling_attempts_count |
scheduler_pod_scheduling_attempts_sum |
scheduler_schedule_attempts_total |
sysdig_container_cpu_cores_used |
sysdig_container_memory_used_bytes |
workqueue_adds_total |
workqueue_depth |
workqueue_queue_duration_seconds_count |
workqueue_queue_duration_seconds_sum |
workqueue_retries_total |
workqueue_unfinished_work_seconds |
workqueue_work_duration_seconds_count |
workqueue_work_duration_seconds_sum |
Prerequisites
None.
Installation
Installing an exporter is not required for this integration.
How to monitor OpenShift Scheduler with Sysdig agent
No further installation is needed, since OpenShift 4.X comes with both Prometheus and Scheduler ready to use. OpenShift Scheduler metrics are exposed using /federate endpoint.
Here are some interesting metrics and queries to monitor and troubleshoot OpenShift Scheduler.
Scheduling
Failed attempts to Schedule pods
Unschedulable pods means that a pod could not be scheduled, use this query to check for failed attempts:
sum by (kube_cluster_name,kube_pod_name,result) (rate(scheduler_schedule_attempts_total{result!~"scheduled"}[10m])) / ignoring(result) group_left sum by (kube_cluster_name,kube_pod_name)(rate(scheduler_schedule_attempts_total[10m]))
Pending pods
Check that there are no pods in pending queues with this query:
topk(30,rate(scheduler_pending_pods[10m]))
Work Queue
Work Queue Retries
The total number of retries that have been handled by the work queue. This value should be near 0.
topk(30,rate(workqueue_retries_total{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]))
Work Queue Latency
Queue latency is the time tasks spend in the queue before being processed
topk(30,rate(workqueue_queue_duration_seconds_sum{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]) / rate(workqueue_queue_duration_seconds_count{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]))
Work Queue Depth
Check the depth of the queue. High values can indicate the saturation of the controller manager.
topk(30,rate(workqueue_depth{container_name=~".*kube-scheduler.*"}[10m]))
Scheduler API Requests
Kube API Requests by code
Check that there are no 5xx or 4xx error codes in the scheduler requests.
sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{container_name=~".*kube-scheduler.*",code=~"4.."}[10m]))
sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{container_name=~".*kube-scheduler.*",code=~"5.."}[10m]))
Agent Configuration
The default agent job for this integration is as follows:
- job_name: openshift-scheduler-default
honor_labels: true
scheme: https
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
insecure_skip_verify: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"scheduler",__name__=~"scheduler_schedule_attempts_total|scheduler_pod_scheduling_attempts_sum|scheduler_pod_scheduling_attempts_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_e2e_scheduling_duration_seconds_count|scheduler_pending_pods|workqueue_retries_total|workqueue_work_duration_seconds_sum|workqueue_work_duration_seconds_count|workqueue_unfinished_work_seconds|workqueue_queue_duration_seconds_sum|workqueue_queue_duration_seconds_count|workqueue_depth|workqueue_adds_total|rest_client_requests_total|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count|go_goroutines"}'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- source_labels: [__meta_kubernetes_pod_phase]
action: keep
regex: Running
- action: keep
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_pod_name
separator: '/'
regex: 'openshift-monitoring/prometheus-k8s-0'
# Holding on to pod-id and container name so we can associate the metrics
# with the container (and cluster hierarchy)
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
# Remove extended labelset
- action: replace
replacement: true
target_label: sysdig_omit_source
- action: replace
source_labels: [ __address__ ]
target_label: _sysdig_integration_openshift_scheduler
replacement: true
metric_relabel_configs:
- source_labels: [__name__]
regex: (go_goroutines|rest_client_request_duration_seconds_count|rest_client_request_duration_seconds_sum|rest_client_requests_total|scheduler_e2e_scheduling_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_pending_pods|scheduler_pod_scheduling_attempts_count|scheduler_pod_scheduling_attempts_sum|scheduler_schedule_attempts_total|sysdig_container_cpu_cores_used|sysdig_container_memory_used_bytes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum)
action: keep
- action: replace
source_labels: [namespace]
target_label: kube_namespace_name
- action: replace
source_labels: [pod]
target_label: kube_pod_name
- action: replace
source_labels: [container]
target_label: container_name
- action: replace
source_labels: [job]
regex: '(.*)'
target_label: job
replacement: 'openshift-$1-default'
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.