OpenShift Scheduler

Metrics, Dashboards, Alerts and more for OpenShift Scheduler Integration in Sysdig Monitor.
OpenShift Scheduler

This integration is enabled by default.

Versions supported: > v4.7

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 20 metrics.

Timeseries generated: Scheduler generates ~300 timeseries

List of Alerts

AlertDescriptionFormat
[OpenShift Scheduler] Process DownScheduler has disappeared from target discovery.Prometheus
[OpenShift Scheduler] Failed Attempts to Schedule PodsScheduler Failed Attempts to Schedule Pods.Prometheus
[OpenShift Scheduler] High 4xx RequestError RateScheduler High 4xx Request Error Rate.Prometheus
[OpenShift Scheduler] High 5xx RequestError RateScheduler High 5xx Request Error Rate.Prometheus

List of Dashboards

OpenShift v4 Scheduler

If you are using Prometheus Remote Write you will need to add the following metric relabel config for this label.


    - action: replace
    source_labels: [ __address__ ]
    target_label: _sysdig_integration_openshift_scheduler 
    replacement: true

The dashboard provides information on the OpenShift Scheduler. OpenShift v4 Scheduler

List of Metrics

Metric name
go_goroutines
rest_client_request_duration_seconds_count
rest_client_request_duration_seconds_sum
rest_client_requests_total
scheduler_e2e_scheduling_duration_seconds_count
scheduler_e2e_scheduling_duration_seconds_sum
scheduler_pending_pods
scheduler_pod_scheduling_attempts_count
scheduler_pod_scheduling_attempts_sum
scheduler_schedule_attempts_total
sysdig_container_cpu_cores_used
sysdig_container_memory_used_bytes
workqueue_adds_total
workqueue_depth
workqueue_queue_duration_seconds_count
workqueue_queue_duration_seconds_sum
workqueue_retries_total
workqueue_unfinished_work_seconds
workqueue_work_duration_seconds_count
workqueue_work_duration_seconds_sum

Prerequisites

None.

Installation

Installing an exporter is not required for this integration.

How to monitor OpenShift Scheduler with Sysdig agent

No further installation is needed, since OpenShift 4.X comes with both Prometheus and Scheduler ready to use. OpenShift Scheduler metrics are exposed using /federate endpoint.

Here are some interesting metrics and queries to monitor and troubleshoot OpenShift Scheduler.

Scheduling

Failed attempts to Schedule pods

Unschedulable pods means that a pod could not be scheduled, use this query to check for failed attempts:

sum by (kube_cluster_name,kube_pod_name,result) (rate(scheduler_schedule_attempts_total{result!~"scheduled"}[10m])) / ignoring(result) group_left sum by (kube_cluster_name,kube_pod_name)(rate(scheduler_schedule_attempts_total[10m]))

Pending pods

Check that there are no pods in pending queues with this query:

topk(30,rate(scheduler_pending_pods[10m]))

Work Queue

Work Queue Retries

The total number of retries that have been handled by the work queue. This value should be near 0.

topk(30,rate(workqueue_retries_total{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]))

Work Queue Latency

Queue latency is the time tasks spend in the queue before being processed

topk(30,rate(workqueue_queue_duration_seconds_sum{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]) / rate(workqueue_queue_duration_seconds_count{job=~"kube-scheduler-default|openshift-scheduler-default"}[10m]))

Work Queue Depth

Check the depth of the queue. High values can indicate the saturation of the controller manager.

topk(30,rate(workqueue_depth{container_name=~".*kube-scheduler.*"}[10m]))

Scheduler API Requests

Kube API Requests by code

Check that there are no 5xx or 4xx error codes in the scheduler requests.

sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{container_name=~".*kube-scheduler.*",code=~"4.."}[10m]))
sum by (kube_cluster_name,kube_pod_name)(rate(rest_client_requests_total{container_name=~".*kube-scheduler.*",code=~"5.."}[10m]))

Agent Configuration

The default agent job for this integration is as follows:

- job_name: openshift-scheduler-default
  honor_labels: true
  scheme: https
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  tls_config:
    insecure_skip_verify: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{job=~"scheduler",__name__=~"scheduler_schedule_attempts_total|scheduler_pod_scheduling_attempts_sum|scheduler_pod_scheduling_attempts_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_e2e_scheduling_duration_seconds_count|scheduler_pending_pods|workqueue_retries_total|workqueue_work_duration_seconds_sum|workqueue_work_duration_seconds_count|workqueue_unfinished_work_seconds|workqueue_queue_duration_seconds_sum|workqueue_queue_duration_seconds_count|workqueue_depth|workqueue_adds_total|rest_client_requests_total|rest_client_request_duration_seconds_sum|rest_client_request_duration_seconds_count|go_goroutines"}'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - source_labels: [__meta_kubernetes_pod_phase]
    action: keep
    regex: Running
  - action: keep
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_pod_name
    separator: '/'
    regex: 'openshift-monitoring/prometheus-k8s-0'
    # Holding on to pod-id and container name so we can associate the metrics
    # with the container (and cluster hierarchy)
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
    # Remove extended labelset
  - action: replace
    replacement: true
    target_label: sysdig_omit_source
  - action: replace
    source_labels: [ __address__ ]
    target_label: _sysdig_integration_openshift_scheduler 
    replacement: true
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (go_goroutines|rest_client_request_duration_seconds_count|rest_client_request_duration_seconds_sum|rest_client_requests_total|scheduler_e2e_scheduling_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_sum|scheduler_pending_pods|scheduler_pod_scheduling_attempts_count|scheduler_pod_scheduling_attempts_sum|scheduler_schedule_attempts_total|sysdig_container_cpu_cores_used|sysdig_container_memory_used_bytes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_count|workqueue_queue_duration_seconds_sum|workqueue_retries_total|workqueue_unfinished_work_seconds|workqueue_work_duration_seconds_count|workqueue_work_duration_seconds_sum)
    action: keep
  - action: replace
    source_labels: [namespace]
    target_label: kube_namespace_name
  - action: replace
    source_labels: [pod]
    target_label: kube_pod_name
  - action: replace
    source_labels: [container]
    target_label: container_name
  - action: replace
    source_labels: [job]
    regex: '(.*)'
    target_label: job
    replacement: 'openshift-$1-default'