Clusters Data

This topic discusses the Clusters Overview page and helps you understand its gauge charts and the data displayed on them.

About Clusters Overview

In Kubernetes, a pool of nodes combine together their resources to form a more powerful machine, that is a Cluster. The Cluster Overview page provides key metrics indicating the health, risk, capacity, and compliance of each cluster. Your cluster can reside in any cloud or multi-cloud environment of your choice.

Each row in the Clusters page represents a cluster. Clusters are sorted by the severity of corresponding events in order to highlight the area that needs attention. For example, a cluster with high severity events is bubbled up to the top of the page to highlight the issue. You can further drill down to the Nodes or Namespaces Overview page for investigating at each level.

In environments where no Sysdig Secure is enabled, Network I/O is shown instead of the Compliance score.

Interpret the Cluster Data

This topic gives insight into the metrics displayed on the Clusters Overview screen.

Node Ready Status

The chart shows the latest value returned by avg(min(kubernetes.node.ready)).

What Is It?

The number shows the readiness for nodes to accept pods across the entire cluster. The numeric availability indicates the percentage of time the nodes are reported as ready by Kubernetes. For example:

  • 100% is displayed when 10 out of 10 nodes are ready for the entire time window, say, for the last one hour.

  • 95% is displayed when 9 out of 10 nodes are ready for the entire time window and one node is ready only for 50% of the time.

The bar chart displays the trend across the selected time window, and each bar represents a time slice. For example, selecting the last 1-hour window displays 6 bars, each indicating a 10-minute time slice. Each bar represents the availability across the time slice (green) or the unavailability (red).

For instance, the following image shows an average availability of 80% across the last 1-hour, and each 10-minute time slice shows a constant availability for the same time window:

What to Expect?

Expect a constant 100% at all times.

What to Do Otherwise?

If the value is less than 100%, determine whether a node is not available at all, or one or more nodes are partially available.

  • Drill down either to the Nodes screen in Overview or to the “Kubernetes Cluster Overview” in Explore to see the list of nodes and their availability.

  • Check the Kubernetes Node Overview dashboard in Explore to identify the problem that Kubernetes reports.

Pods Available vs Desired

The chart shows the latest value returned by sum(avg(kubernetes.namespace.pod.available.count)) / sum(avg(kubernetes.namespace.pod.desired.count)).

What Is It?

The chart displays the ratio between available and desired pods, averaged across the selected time window, for all the pods in a given Cluster. The upper bound shows the number of desired pods in the Cluster.

For instance, the following image shows 42 desired pods are available to use:

What to Expect?

You should typically expect 100%.

If certain pods take a long time to be available you might temporarily see a value that is less than 100%. Pulling images, pod initialization, readiness probe, and so on causes such delays.

What to Do Otherwise?

Identify one or more Namespaces that have lower availability. To do so, drill down to the Namespaces screen, then drill down to the Workloads screen to identify the unavailable pods.

If the number of unavailable pods is considerably higher (the ratio is significantly low), check the status of the Nodes. A Node failure will cause several pods to become unavailable across most of the Namespaces.

Several factors could cause the pods to stuck in the Pending state:

  • Pods make requests for resources that exceed what’s available across the nodes (the remaining allocatable pods).

  • Pods make requests higher than the availability of every single node. For example, you have 8-core Nodes and you create a pod with a 16-core request. These pods might require reconfiguration and specific setup related to Node affinity and anti-affinity constraints.

  • Namespace quota is reached before making a high resource request.

    If a quota is enforced at the Namespace level, you may hit the limit independent of the resource availability across the Nodes.

CPU Requests vs Allocatable

The chart shows the latest value returned by sum(avg(kubernetes.pod.resourceRequests.cpuCores)) / sum(avg(kubernetes.node.allocatable.cpuCores)).

What Is It?

The chart displays the ratio between CPU requests configured for all the pods in a selected Cluster and allocatable CPUs across all the nodes.

The upper bound shows the number of allocatable CPU cores across all the nodes in the Cluster.

For instance, the image below shows that out of 620 available CPU cores across all the nodes (allocatable CPUs), 71% is requested by the pods:

What to Expect?

Your resource utilization strategy determines what ratio you can expect. A healthy ratio falls between 50% and 80%.

Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, the ratio will be 90% if you have 9 nodes. Having this percentage protects you against a node becoming unavailable.

What to Do Otherwise?

A lower ratio indicates under-utilized resources (and corresponding cost) in your infrastructure. A higher ratio indicates insufficient resources. As a result

  • Applications cannot be scheduled to be run.

  • Pods might not start and remain in a Pending/Unscheduled state.

To triage, do the following:

  • Drill down to the Nodes screen to get insights into how resources are utilized across all nodes.

  • Drill down to the Namespaces screen to understand how resources are requested across Namespaces.

  • Drill down to Explore and refer to the following dashboards:

    • Kubernetes CPU Allocation Optimization: Evaluate whether a significant amount of resources are under-utilized in the infrastructure.

    • Kubernetes Workloads CPU Usage and Allocation: Determine whether pods are properly configured and are using resources as expected.

Can the Value Be Higher than 100%?

Currently, the ratio accounts only for scheduled pods, while pending pods are excluded from the calculation. This means pods have been scheduled to run on Nodes out of the allocatable pods. Consequently, the ratio cannot be higher than 100%.

In the case of over-commitment (pods requesting for more resources than what’s available), you can expect a higher Requests vs Allocatable ratio and a lower Pods Available vs Desired ratio. What it indicates is that most of the available resources are being used, and what’s left is not enough to schedule additional pods. Therefore, the Available vs Desired ratio for pods will decrease.

When your environment has pods that are updated often or that are deleted and created often (for example, testing Clusters), the total requests might appear higher than what it is at any given time. Consequently, the ratio becomes higher across the selected time window, and you might see a value that is higher than 100%. This error is rendered due to how the data engine calculates the aggregated ratio.

Drill down to Kubernetes Cluster Overview to see the CPU Cores Usage vs Requests vs Allocatable time series to correctly evaluate the trend of the request commitments.

Listed below are some of the factors that could cause the pods to stuck in a Pending state:

  • Pods make requests that exceed what’s available across the nodes (the remaining allocatable pods). The Requests vs Allocatable ratio is an indicator of this issue.

  • Pods make requests that are higher than the availability of every single Node. For example, you have 8-core Nodes and you create a pod with a 16-core request. These pods might require reconfiguration and specific setup related to Node affinity and anti-affinity constraints.

  • The Quota set at the Namespace level is reached before a request is configured. The Requests vs Allocatable ratio may not suggest the problem, but the Pods Available vs Desired ratio would decrease, especially for the specific Namespaces. See the Namespaces screen in Overview.

Memory Requests vs Allocatable

The chart shows the latest value returned by sum(avg(kubernetes.pod.resourceRequests.memBytes)) / sum(avg(kubernetes.node.allocatable.memBytes)).

What Is It?

The chart displays the ratio between memory requests configured for all the pods in the Cluster and allocatable memory available across all the Nodes.

The upper bound shows the allocatable memory available across all Nodes. The value is expressed in bytes, displayed in a specified unit.

For instance, the image below shows that out of 29.7 GiB available across all Nodes (allocatable memory), 35% is requested by the pods:

What to Expect?

Your resource utilization strategy determines what ratio you can expect. A healthy ratio falls between 50% and 80%.

Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, 90% if you have 9 nodes. This ratio protects your system against a node becoming unavailable.

What to do Otherwise

A lower ratio indicates under-utilized resources (and corresponding cost) in your infrastructure. A higher ratio indicates insufficient resources. As a result

  • Applications cannot be scheduled to be run.

  • Pods might not start and remain in a Pending/Unscheduled state.

To troubleshoot, do the following:

  • Drill down to the Nodes screen to get insights into how resources are utilized across all the Nodes.

  • Drill down to the Namespaces screen to understand how resources are requested across Namespaces.

  • Drill down to Explore and refer to the following dashboards:

    • Kubernetes Memory Allocation Optimization: Evaluate whether a significant amount of resources are under-utilized in the infrastructure.

    • Kubernetes Workloads Memory Usage and Allocation: Determine whether pods are properly configured and are using resources as expected.

Can the Value be Higher than 100%?

The ratio currently accounts only for scheduled pods, while pending pods are excluded from the calculation. What this implies is that pods have been scheduled to run on Nodes out of the allocatable resources available. Consequently, the ratio cannot be higher than 100%.

In the case of over-commitment (pods requesting for more resources than what’s available), expect a higher Requests vs Allocatable ratio and a lower Pods Available vs Desired ratio. What it indicates is that most of the available resources have been used and what’s left is not enough to schedule additional pods. Therefore, the Pods Available vs Desired ratio will decrease.

When your environment has pods that are updated often or that are deleted and created often (for example, testing Clusters), the total requests might appear higher than what it is at any given time. Consequently, the ratio becomes higher across the selected time window, and you might see a value that is higher than 100%. This error is rendered due to how the data engine calculates the aggregated ratio.

Drill down to Kubernetes Cluster Overview to see the Memory Requests vs Allocatable time series to correctly evaluate the trend for the request commitments.

Listed are some of the factors that could cause your pods to stuck in a Pending state:

  • Pods make requests that exceed what’s available across the nodes (the remaining allocatable pods). The Requests vs Allocatable ratio is an indicator of this issue.

  • Pods make requests that are higher than the availability of every single Node. For example, you have 8-core nodes and you create a pod with a 16-core request. These pods might require configuration changes and specific setup related to node affinity and anti-affinity factors.

  • The Quota set at the Namespace-level is reached before a high request is configured. The Requests vs Allocatable ratio might not suggest the problem, but the Pods Available vs Desired ratio would decrease, especially for the specific Namespaces. See the Namespaces screen in Overview.

Compliance Score

Kubernetes: The latest value returned by avg(avg(compliance.k8s-bench.pass_pct)).

Docker: The latest value returned by avg(avg(compliance.docker-bench.pass_pct)).

What Is it?

The numbers show the percentage of benchmarks that succeeded in the selected time window, respectively for Docker and Kubernetes entities.

What to Expect

If you do not have Sysdig Secure enabled, or you do not have benchmarks scheduled, then you should expect no data available.

Otherwise, the higher the score, the more compliant your infrastructure is.

What to Do Otherwise?

If the score is lower than expected, drill down to Docker Compliance Report or Kubernetes Compliance Report to see further details about benchmark checks and their results.

You may also want to use the Benchmarks / Results page in Sysdig Secure to see the history of checks.