Sysdig Documentation

How to Troubleshoot Using Overview

The Overview feature provides several options to narrow down and troubleshoot infrastructure issues.

General Guidelines

Enable Overview

  • When Overview is enabled for the first time, Sysdig Monitor fetches data and generates associated pages. Because the poll interval is 10 minutes, you might have to wait up to 10 minutes to fully load the Overview data on the UI.

  • If the environment is created for the first time, and the Overview feature is immediately enabled, wait for, at the maximum, 1 hour to see the Overview pages with the necessary data.

  • Overview uses time windows in segments of 1H, 6H and 1D, and therefore wait respectively for 1H, 6H and 1D to be able to see data on the Overview pages.

  • If enough data is not available for the first 1 hour, the "No Data Available" page will be presented until the first 1 hour passes.

Tuning/Caching Overview Data

Sysdig Monitor leverages a caching mechanism to fetch pre-computed data for the Overview screens.

If pre-computed data is unavailable, data fetched will be non-computed data, which must be calculated before displaying. This additional computational time adds delays. Caching is enabled for Overview but for optimum performance, you must wait for 1H, 6H, and 1D windows the first time you enable Overview. After the specified time has passed, the data will be automatically be cached with every passing minute.

Basics

  • Drill Down: Use the arrows to the right of each row to toggle over to an Explore window or a Sysdig Secure compliance report, as applicable.

  • Check context-sensitive events listed in the feed on the right

  • Green-Yellow-Red at a glance: Standard color cues and visual design help target trouble areas quickly

cluster_overview.png

When No Data is Displayed

Note

NO Overview data will display unless Kube State Metrics are enabled.

If necessary, Enable Kube State Metrics and log back in.

Otherwise, a No Data state will show up if no data is available for a specified set of metrics.

It may be displayed in Overview widgets as a message, empty gray circles, or no data in the Events feed on the right side of the page.

Specific No Data displays may include:

  • Event count: No numbers will display if no events occurred in the specified time window selected in the time navigation bar.

    event_count.png
  • Gauge chart shows a No Data message if:

    • Any two compared metrics lack the data needed to show the chart

    • Both compared metrics have a value of 0. It wouldn't make sense to get a percentage of two metrics that are 0.

    gauge.png
  • Multi-number chart (such as Compliance Score) shows gray circles if:

    • No data is available for compliance score metrics

    • The environment is not using Sysdig Secure

    compliance.png
  • Sparkline Chart shows a "No Data" message if the metric in the widget doesn't have at least one data point for a specified time window, selected in the time navigation bar.

    sparkline.png
  • Pie Chart shows a No Data message if:

    • Kubernetes jobs failed/ succeeded metrics don't have any data

    • The environment is not using Kubernetes jobs in the infrastructure

Clusters and Nodes Overview

This topic discusses some of the common Kubernetes issues related to Cluster and Nodes objects.

Node Ready Status

  • Green: indicates a healthy system. All nodes were ready during the selected time period.

  • Red: If you detect a red (node flapping or not ready for a certain amount of time), drill down to isolate the problem:

From the Cluster view, detect which node in the cluster is not ready or flapping. From the Nodes View drill down deeper into the Node itself to see the following:

  • Drill down to Nodes Overview to see which nodes are not green. There should be a correlation associated with the color indicated.

  • Drill down to Node Ready State Dashboard which shows the status of each node over time. It helps isolate the nodes that are not ready or indicates what happened during the time they were not up. Here is a modified Kubernetes Node State.

kube_node_state.png

Consider the following:

  • Resource availability questions.

    • Were there disk, memory, or network issues?

    • Which nodes were not ready?

    • Were there pod restarts on the nodes?

    • Any other events that may impact the node readiness status?

    • What was the CPU, memory used percent, network I/O, disk I/O, top processes before the node was not ready?

    cpu_capacity.png

    This dashboard shows how many nodes are not ready.

  • Did the cluster have enough capacity so that nodes could be served

    • Which nodes were not ready?

    • When were they not ready?

    • What impacted node readiness?

    mem-capacity.png

Pods Available Vs Desired

Green: Pods available should be more than what is desired.

Red: If available is less than desired there is a problem and we need to add more capacity.

Cluster level: Pods available vs desired added up for all the namespaces in the cluster.

From the Cluster level:

  • Drill down to Namespace Overview to see which namespaces are in Red to indicate the Pods Available vs Desired status.

  • Drill down to Cluster Pod Summary dashboard that highlights the namespaces which have fewer pods than desired. It also shows unavailable pods.

    • When a namespace has been identified, find out why are there not enough pods and whether it is a capacity issue? Look at the Namespace Capacity dashboard to see which workloads don’t have enough capacity.

    • Take a look at the control plane to see if pod scheduling latency is high or the queue is getting filled on the Kube API server dashboard.

Memory Requests vs Allocatable

Green: Indicates that there is enough capacity in the cluster or node to serve all the resource requests by all the pods on the selected cluster or node.

Not green but < 100%:  Indicates it is approaching the maximum capacity. Drill down to figure out which pods are consuming more resources.

Greater than 100%:  The requests could be not set correctly. For example, requests are guaranteed resources, not limits. If a pod requests one CPU core and the machine has, say, idle 8 CPU cores to spare, the pod is then free to use all of the 8 CPU cores. Drill down to figure out which pods do not have Kubernetes requests set.

cluster-node-capacity.png

Once a pod has been identified as the top resource consumer, do the following:

  • Determine which namespace it belongs to and who owns that. To do so, select Explore, select Namespaces &gt; Deployments &amp; Pods. Then search for that pod name and see which namespace and deployment/Statefulset/daemonset it falls under.

  • Identify what’s going on in that pod. Go to the pod state dashboard to see what processes are consuming the resources.

Usage vs Requests

If not all pods set requests, then comparing usage vs requests leads to confusing numbers. In such cases:

  • Use the list of pods with requests configured to compute numbers.

  • State that N pods do not have requests configured.

mem-usage.png