This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

  • 1:
    • 2:
      • 3:
        • 4:

          Overview

          Overview leverages Sysdig’s unified data platform to monitor, secure, and troubleshoot your hosts and Kubernetes clusters and workloads.

          The module provides a unified view of the health, risk, and capacity of your Kubernetes infrastructure— a single pane of glass for host machines as well as Kubernetes Clusters, Nodes, Namespaces, and Workloads across a multi- and hybrid-cloud environment. You can easily filter by any of these entities and view associated events and health data.

          Overview shows metrics prioritized by event count and severity, allowing you to get to the root cause of the problem faster. Sysdig Monitor polls the infrastructure data every 10 minutes and refreshes the metrics and events on the Overview page with the system health.

          Key Benefits

          Overview provides the following benefits:

          • Show a unified view of the health, risk, resource use, and capacity of your infrastructure environment at scale

            • Render metrics, security events, compliance CIS benchmark results, and contextual events in a single location

            • Eliminate the need for stand-alone security, monitoring, and forensics tools

            • View data on-the-fly by workload or by infrastructure

          • Display contextual live event stream from alerts, Kubernetes, containers, policies, and image scanning results

          • Surface entities intelligently based on event count and severity

          • Drills down from Clusters to Nodes and Namespaces

          • Support Infrastructure monitoring of multi- and hybrid- cloud environments

          • Expose relevant information based on core operational users :

            • DevOps / Platform Ops

            • Security Analyst

            • Service Owner

          Accessing the Overview User Interface

          You can access and set the scope of Overview in the Sysdig Monitor UI or with the URL:

          Click Overview in the left navigation, then select one of the Kubernetes entities:

          About the Overview User Interface

          The Overview interface opens to the Cluster Overview page. This section describes the major components of the interface and the navigation options.

          Overview Rows

          Each row represents a Kubernetes entity: a cluster, node, namespace, or workload. In the screenshot above, each row shows a Kubernetes cluster.

          • Navigating rows is easy

            Click on the Overview icon in the left navigation and choose an Overview page, or drill down into the next Overview page to explore the next granular level of data. Each Overview page shows 10 rows by default and a maximum of 100 rows. Click Load More to display additional rows if there are more than 10 rows per page.

          • Ability to select a specific row in an Overview page

            Each row contains the scope of the relevant entity that it is showing data for. Clicking a specific row leads to deselecting the rest of the rows (for instance, selecting staging deselects all other rows in the screenshot above) to focus on the scope of the selected entity, including the events which are scoped out by that row. Further, the Live badge will change to Paused, implying rows will not be updated regardless of the new data coming in. Pausing to focus on a single row provides a snapshot of what is going on until at the moment with the entity under purview.

          • Entities are listed according to the severity and number of events detected in them, not by how new the events are

            Rows are sorted by the count and severity level of the events associated with the entity and are displayed in descending order. The items with the highest number of high severity events are shown first, followed by medium, low, and info. This organization helps to highlight events demanding immediate attention and to streamline troubleshooting efforts, in environments that may include thousands of entities.

          Scope Editor

          Scope Editor allows targeting down to a specific entity, such as a particular workload or namespace, from environments that may include thousands of entities. The levels of scope, determined by Kubernetes hierarchy, progresses from Workload to Cluster where Cluster being at the top level. In smaller environments, using the Scope Editor is equivalent to clicking a single row in an Overview page where no scope has been applied.

          Cluster: The highest level in the hierarchy. The only scope applied to the page is Cluster. It allows you to select a specific cluster from a list of available ones.

          Node: The second level in the hierarchy. The scope is determined by Cluster and Node. Selection is narrowed down to a specific node in a selected cluster.

          Namespace: The third level in the hierarchy. The scope is determined by Cluster and Namespace. Selection is narrowed down to a specific namespace in a selected cluster.

          Workloads: The last entity in the hierarchy. The scope is initially determined by Cluster and Namespace, then the selection is narrowed to a specific Deployment, Service, or StatefulSet. Choosing all three options are not allowed.

          Time Navigation

          The Overview feature is based around time. Sysdig Monitor polls the infrastructure data every 10 minutes and refreshes the metrics and events on the Overview page with the system health. You select how to view this gathered data by choosing a Preset interval and a time Range.

          Presets

          Presets are a way of visualizing data that Sysdig Monitor gathers every 10 minutes. A preset that is 10 minutes or less is refreshed every 30 seconds. A preset that is greater than 10 minutes is refreshed every 1 minute. Select a preset to determine the data sample to be displayed. Overview supports the following presets:

          • 1 Hour: Data polled for the last one hour. This is the default value.

          • 6 Hour: Data polled for the last six-hour.

          • 1 Day: Data polled for the last day.

          Presets work in conjunction with Range selections. Selecting a particular preset interval refreshes Range selection and reloads the Overview rows and events subsequently. For example:

          10 Minutes: Resets the Range to Jul 9, 2.20 pm - Jul 9, 2.30 pm.

          6 Hour: Resets the Range to Jul 9, 8.30 am - Jul 9, 2.30 pm.

          1 Day: Resets the Range to Jul 8, 2.30 pm - Jul 9, 2.30 pm.

          Because metrics and events are refreshed every 10 minutes on the Overview page, if you stay for more then 10 minutes on the Overview page, the data will be updated to show the newly-computed values.

          Presets are global throughout the Sysdig Monitor interface. For example, if you select 10 minutes in the Explore view, the Overview preset will also be 10 minutes, and vice versa. Choosing an unsupported Preset in Explorer falls back to 1 day in Overview.

          Range

          Range shows both date and time interval as well as the selected Presets in parenthesis. The Range indicated on the UI is determined by Presets. The time given is the closest time interval and by default, it is the current date and time preset by 1 hour. See Presets to understand how Range works with Presets.

          Time Format

          Overview supports UTC and PDT time formats. Use the toggle button next to Range to change the time format for the slot shown in Range. The default is PDT.

          Live

          The Live badge shows if the feed (Overview rows with data) is Live or Paused.

          • Live: the data is continuously updating based on the 10-minute polling of the Sysdig back end. The Overview feed is normally always Live.

          • Paused: When a specific row is selected, the data refresh pauses and the rows will not be updated with new data coming in.

          Unified Stream of Events

          The right panel of Overview provides a context-sensitive events feed.

          Click an overview row to see relevant Events on the right. Each event is intelligently populated with end-to-end metadata to give context and enable troubleshooting.

          Event Types

          Overview renders the following event types:

          • Alert: See Alerts.

          • Custom: Ensure that Custom labels are enabled to view this type of events.

          • Containers: Events associated with containers.

          • Kubernetes: Events associated with Kubernetes infrastructure.

          • Scanning: See Image Scanning.

          • Policy: See Policies.

          Event Statuses

          Overview renders the following alert-generated event statuses:

          • Triggered: The alert condition has been met and still persists.

          • Resolved: A previously existed alert condition no longer persists.

          • Acknowledged: The event has been acknowledged by the intended recipient.

          • Un-acknowledged: The event has not been acknowledged by an intended recipient. All events are by default marked as Un-acknowledged.

          • Silenced: The alert event has been silenced for a specified scope. No alert notification will be sent out to the channels during the silenced window.

          General Guidelines

          First-Time Usage

          • If the environment is created for the first time, Sysdig Monitor fetches data and generates associated pages. The Overview feature is immediately enabled. However, wait for, at the maximum, 1 hour to see the Overview pages with the necessary data.

          • Overview uses time windows in segments of 1H, 6H and 1D, and therefore wait respectively for 1H, 6H and 1D to be able to see data on the Overview pages.

          • If enough data is not available for the first 1 hour, the “No Data Available” page will be presented until the first 1 hour passes.

          Tuning Overview Data

          Sysdig Monitor leverages a caching mechanism to fetch pre-computed data for the Overview screens.

          If pre-computed data is unavailable, data fetched will be non-computed data, which must be calculated before displaying. This additional computational time adds delays. Caching is enabled for Overview but for optimum performance, you must wait for 1H, 6H, and 1D windows the first time you use Overview. After the specified time has passed, the data will be automatically be cached with every passing minute.

          Enabling Overview for On-Prem Deployments

          The Overview feature is not available by default on On-Prem deployments. Use the following API to enable it:

          1. Get the Beta settings as follows:

            curl -X GET 'https://<Sysdig URL>/api/on-prem/settings/overviews' \
            -H 'Authorization: Bearer <GLOBAL_SUPER_ADMIN_SDC_TOKEN>' \
            -H 'X-Sysdig-Product: SDC' -k
            

            Replace <Sysdig URL> with the Sysdig URL associated with your deployment and <GLOBAL_SUPER_ADMIN_SDC_TOKEN> with the SDC token associated with your deployment.

          2. Copy the payload and change the desired values in the settings.

          3. Update the settings as follows:

            curl X PUT 'https://<Sysdig URL>/api/on-prem/settings/overview' \
            -H 'Authorization: Bearer <GLOBAL_SUPER_ADMIN_SDC_TOKEN>' \
            -H 'X-Sysdig-Product: SDC' \
            -d '{  "overviews": true,  "eventScopeExpansion": true}'
            

          Feature Flags

          • overviews: Set overviews to true to enable the backend components and the UI.

          • eventScopeExpansion: Set eventScopeExpansion to true to enable scope expansion for all the Event types.

          1 -

          Clusters Data

          This topic discusses the Clusters Overview page and helps you understand its gauge charts and the data displayed on them.

          About Clusters Overview

          In Kubernetes, a pool of nodes combine together their resources to form a more powerful machine, that is a Cluster. The Cluster Overview page provides key metrics indicating the health, risk, capacity, and compliance of each cluster. Your cluster can reside in any cloud or multi-cloud environment of your choice.

          Each row in the Clusters page represents a cluster. Clusters are sorted by the severity of corresponding events in order to highlight the area that needs attention. For example, a cluster with high severity events is bubbled up to the top of the page to highlight the issue. You can further drill down to the Nodes or Namespaces Overview page for investigating at each level.

          In environments where no Sysdig Secure is enabled, Network I/O is shown instead of the Compliance score.

          Interpret the Cluster Data

          This topic gives insight into the metrics displayed on the Clusters Overview screen.

          Node Ready Status

          The chart shows the latest value returned by avg(min(kubernetes.node.ready)).

          What Is It?

          The number shows the readiness for nodes to accept pods across the entire cluster. The numeric availability indicates the percentage of time the nodes are reported as ready by Kubernetes. For example:

          • 100% is displayed when 10 out of 10 nodes are ready for the entire time window, say, for the last one hour.

          • 95% is displayed when 9 out of 10 nodes are ready for the entire time window and one node is ready only for 50% of the time.

          The bar chart displays the trend across the selected time window, and each bar represents a time slice. For example, selecting the last 1-hour window displays 6 bars, each indicating a 10-minute time slice. Each bar represents the availability across the time slice (green) or the unavailability (red).

          For instance, the following image shows an average availability of 80% across the last 1-hour, and each 10-minute time slice shows a constant availability for the same time window:

          What to Expect?

          Expect a constant 100% at all times.

          What to Do Otherwise?

          If the value is less than 100%, determine whether a node is not available at all, or one or more nodes are partially available.

          • Drill down either to the Nodes screen in Overview or to the “Kubernetes Cluster Overview” in Explore to see the list of nodes and their availability.

          • Check the Kubernetes Node Overview dashboard in Explore to identify the problem that Kubernetes reports.

          Pods Available vs Desired

          The chart shows the latest value returned by sum(avg(kubernetes.namespace.pod.available.count)) / sum(avg(kubernetes.namespace.pod.desired.count)).

          What Is It?

          The chart displays the ratio between available and desired pods, averaged across the selected time window, for all the pods in a given Cluster. The upper bound shows the number of desired pods in the Cluster.

          For instance, the following image shows 42 desired pods are available to use:

          What to Expect?

          You should typically expect 100%.

          If certain pods take a long time to be available you might temporarily see a value that is less than 100%. Pulling images, pod initialization, readiness probe, and so on causes such delays.

          What to Do Otherwise?

          Identify one or more Namespaces that have lower availability. To do so, drill down to the Namespaces screen, then drill down to the Workloads screen to identify the unavailable pods.

          If the number of unavailable pods is considerably higher (the ratio is significantly low), check the status of the Nodes. A Node failure will cause several pods to become unavailable across most of the Namespaces.

          Several factors could cause the pods to stuck in the Pending state:

          • Pods make requests for resources that exceed what’s available across the nodes (the remaining allocatable pods).

          • Pods make requests higher than the availability of every single node. For example, you have 8-core Nodes and you create a pod with a 16-core request. These pods might require reconfiguration and specific setup related to Node affinity and anti-affinity constraints.

          • Namespace quota is reached before making a high resource request.

            If a quota is enforced at the Namespace level, you may hit the limit independent of the resource availability across the Nodes.

          CPU Requests vs Allocatable

          The chart shows the latest value returned by sum(avg(kubernetes.pod.resourceRequests.cpuCores)) / sum(avg(kubernetes.node.allocatable.cpuCores)).

          What Is It?

          The chart displays the ratio between CPU requests configured for all the pods in a selected Cluster and allocatable CPUs across all the nodes.

          The upper bound shows the number of allocatable CPU cores across all the nodes in the Cluster.

          For instance, the image below shows that out of 620 available CPU cores across all the nodes (allocatable CPUs), 71% is requested by the pods:

          What to Expect?

          Your resource utilization strategy determines what ratio you can expect. A healthy ratio falls between 50% and 80%.

          Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, the ratio will be 90% if you have 9 nodes. Having this percentage protects you against a node becoming unavailable.

          What to Do Otherwise?

          A lower ratio indicates under-utilized resources (and corresponding cost) in your infrastructure. A higher ratio indicates insufficient resources. As a result

          • Applications cannot be scheduled to be run.

          • Pods might not start and remain in a Pending/Unscheduled state.

          To triage, do the following:

          • Drill down to the Nodes screen to get insights into how resources are utilized across all nodes.

          • Drill down to the Namespaces screen to understand how resources are requested across Namespaces.

          • Drill down to Explore and refer to the following dashboards:

            • Kubernetes CPU Allocation Optimization: Evaluate whether a significant amount of resources are under-utilized in the infrastructure.

            • Kubernetes Workloads CPU Usage and Allocation: Determine whether pods are properly configured and are using resources as expected.

          Can the Value Be Higher than 100%?

          Currently, the ratio accounts only for scheduled pods, while pending pods are excluded from the calculation. This means pods have been scheduled to run on Nodes out of the allocatable pods. Consequently, the ratio cannot be higher than 100%.

          In the case of over-commitment (pods requesting for more resources than what’s available), you can expect a higher Requests vs Allocatable ratio and a lower Pods Available vs Desired ratio. What it indicates is that most of the available resources are being used, and what’s left is not enough to schedule additional pods. Therefore, the Available vs Desired ratio for pods will decrease.

          When your environment has pods that are updated often or that are deleted and created often (for example, testing Clusters), the total requests might appear higher than what it is at any given time. Consequently, the ratio becomes higher across the selected time window, and you might see a value that is higher than 100%. This error is rendered due to how the data engine calculates the aggregated ratio.

          Drill down to Kubernetes Cluster Overview to see the CPU Cores Usage vs Requests vs Allocatable time series to correctly evaluate the trend of the request commitments.

          Listed below are some of the factors that could cause the pods to stuck in a Pending state:

          • Pods make requests that exceed what’s available across the nodes (the remaining allocatable pods). The Requests vs Allocatable ratio is an indicator of this issue.

          • Pods make requests that are higher than the availability of every single Node. For example, you have 8-core Nodes and you create a pod with a 16-core request. These pods might require reconfiguration and specific setup related to Node affinity and anti-affinity constraints.

          • The Quota set at the Namespace level is reached before a request is configured. The Requests vs Allocatable ratio may not suggest the problem, but the Pods Available vs Desired ratio would decrease, especially for the specific Namespaces. See the Namespaces screen in Overview.

          Memory Requests vs Allocatable

          The chart shows the latest value returned by sum(avg(kubernetes.pod.resourceRequests.memBytes)) / sum(avg(kubernetes.node.allocatable.memBytes)).

          What Is It?

          The chart displays the ratio between memory requests configured for all the pods in the Cluster and allocatable memory available across all the Nodes.

          The upper bound shows the allocatable memory available across all Nodes. The value is expressed in bytes, displayed in a specified unit.

          For instance, the image below shows that out of 29.7 GiB available across all Nodes (allocatable memory), 35% is requested by the pods:

          What to Expect?

          Your resource utilization strategy determines what ratio you can expect. A healthy ratio falls between 50% and 80%.

          Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, 90% if you have 9 nodes. This ratio protects your system against a node becoming unavailable.

          What to do Otherwise

          A lower ratio indicates under-utilized resources (and corresponding cost) in your infrastructure. A higher ratio indicates insufficient resources. As a result

          • Applications cannot be scheduled to be run.

          • Pods might not start and remain in a Pending/Unscheduled state.

          To troubleshoot, do the following:

          • Drill down to the Nodes screen to get insights into how resources are utilized across all the Nodes.

          • Drill down to the Namespaces screen to understand how resources are requested across Namespaces.

          • Drill down to Explore and refer to the following dashboards:

            • Kubernetes Memory Allocation Optimization: Evaluate whether a significant amount of resources are under-utilized in the infrastructure.

            • Kubernetes Workloads Memory Usage and Allocation: Determine whether pods are properly configured and are using resources as expected.

          Can the Value be Higher than 100%?

          The ratio currently accounts only for scheduled pods, while pending pods are excluded from the calculation. What this implies is that pods have been scheduled to run on Nodes out of the allocatable resources available. Consequently, the ratio cannot be higher than 100%.

          In the case of over-commitment (pods requesting for more resources than what’s available), expect a higher Requests vs Allocatable ratio and a lower Pods Available vs Desired ratio. What it indicates is that most of the available resources have been used and what’s left is not enough to schedule additional pods. Therefore, the Pods Available vs Desired ratio will decrease.

          When your environment has pods that are updated often or that are deleted and created often (for example, testing Clusters), the total requests might appear higher than what it is at any given time. Consequently, the ratio becomes higher across the selected time window, and you might see a value that is higher than 100%. This error is rendered due to how the data engine calculates the aggregated ratio.

          Drill down to Kubernetes Cluster Overview to see the Memory Requests vs Allocatable time series to correctly evaluate the trend for the request commitments.

          Listed are some of the factors that could cause your pods to stuck in a Pending state:

          • Pods make requests that exceed what’s available across the nodes (the remaining allocatable pods). The Requests vs Allocatable ratio is an indicator of this issue.

          • Pods make requests that are higher than the availability of every single Node. For example, you have 8-core nodes and you create a pod with a 16-core request. These pods might require configuration changes and specific setup related to node affinity and anti-affinity factors.

          • The Quota set at the Namespace-level is reached before a high request is configured. The Requests vs Allocatable ratio might not suggest the problem, but the Pods Available vs Desired ratio would decrease, especially for the specific Namespaces. See the Namespaces screen in Overview.

          Compliance Score

          Docker: The latest value returned by avg(avg(compliance.k8s-bench.pass_pct)).

          Kubernetes: The latest value returned by avg(avg(compliance.docker-bench.pass_pct)).

          What Is it?

          The numbers show the percentage of benchmarks that succeeded in the selected time window, respectively for Docker and Kubernetes entities.

          What to Expect

          If you do not have Sysdig Secure enabled, or you do not have benchmarks scheduled, then you should expect no data available.

          Otherwise, the higher the score, the more compliant your infrastructure is.

          What to Do Otherwise?

          If the score is lower than expected, drill down to Docker Compliance Report or Kubernetes Compliance Report to see further details about benchmark checks and their results.

          You may also want to use the Benchmarks / Results page in Sysdig Secure to see the history of checks.

          2 -

          Nodes Data

          This topic discusses the Nodes Overview page and helps you understand its gauge charts and the data displayed on them.

          About Nodes Overview

          A node refers to a worker machine in Kubernetes. A physical machine or VM can represent a node. The Nodes Overview page provides key metrics indicating the health, capacity, and compliance of each node in your cluster.

          In environments where no Sysdig Secure is enabled, Network I/O is shown instead of the Compliance score.

          Interpret the Nodes Data

          This topic gives insight into the metrics displayed on the Nodes Overview page.

          Node Ready Status

          The chart shows the latest value returned by avg(min(kubernetes.node.ready)).

          What Is It?

          The number expresses the Node readiness to accept pods across the Cluster. The numeric availability indicates the percentage of time the Node is reported ready by Kubernetes. For example:

          • 100% is displayed when a Node is ready for the entire time window, say, for the last one hour.

          • 95% when the Node is ready for 95% of the time window, say, 57 out of 60 minutes.

          The bar chart displays the trend across the selected time window, and each bar represents a time slice. For example, selecting “last 1 hour” displays 6 bars, each indicating a 10-minute time slice. Each bar shows the availability across the time slice (green) and the unavailability (red).

          For instance, the image below indicates the Node has not been ready for the entire last 1-hour time window:

          What to Expect?

          The chart should show a constant 100% at all times.

          What to Do Otherwise?

          If the number is less than 100%, review the status reported by Kubernetes. Drill-down to the Kubernetes Node Overview Dashboard in Explore to see details about the Node readiness:

          If the Node Ready Status has an alternating behavior, as shown in the image, the node is flapping. Flapping indicates that the kubelet is not healthy. See specific conditions reported by Kubernetes that would help determine the causes for the Node not being ready. Such conditions include network issues and memory pressure.

          Pods Ready vs Allocatable

          The chart reports the latest value of sum(avg(kubernetes.pod.status.ready)) / avg(avg(kubernetes.node.allocatable.pods)).

          What Is It?

          It is the ratio between available and allocatable pods configured on the node, averaged across the selected time window.

          The Clusters page includes a similar chart named Pods Available vs Desired. However, the meaning is different:

          • The Pods Available vs Desired chart for Clusters highlights how many pods you expect and how many are actually available. See IsPodAvailable for a detailed definition.

          • The Pods Ready vs Allocatable chart for Nodes indicates how many pods can be scheduled on each Node and how many are actually ready.

          The upper bound shows the number of pods you can allocate in the node. See node configuration.

          For instance, the image below indicates that you can allocate 110 pods in the Node (default configuration), but only 11 pods are ready:

          What to Expect?

          The ratio does not relate to resource utilization, but it measures the pod density on each node. The more pods you have on a single node, the more effort the kubelet has to put in order to manage the pods, the routing mechanism, and Kubernetes overall.

          Given the allocatable is properly set, values lower than 80% indicate a healthy status.

          What to Do Otherwise?

          • Reviewing the default maximum pods configuration of the kubelet to allow more pods, especially if the CPU and memory utilization is healthy.

          • Adding more nodes to allow for more pods to be scheduled.

          • Reviewing kubelet process performance and Node resource utilization in general. A higher ratio indicates high pressure on the operating system and for Kubernetes itself.

          CPU Requests vs Allocatable

          The chart shows the latest value returned by sum(avg(kubernetes.pod.resourceRequests.cpuCores)) / sum(avg(kubernetes.node.allocatable.cpuCores)).

          What Is It?

          The chart shows the ratio between the number of CPU cores requested by the pods scheduled on the Node and the number of cores available to pods. The upper bound shows the CPU cores available to pods, which corresponds to the user-defined configuration for allocatable CPU.

          For instance, the image below shows that the Node has 16 CPU cores available, out of which, 84% are requested by the pods scheduled on the Node:

          What to Expect?

          Expect a value up to 80%.

          Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, 90% if you have 9 nodes. Having a high ratio protects your system against a Node becoming unavailable.

          What to Do Otherwise?

          • A low ratio indicates the Node is underutilized. Drill up to the corresponding cluster in the Clusters page to determine whether the number of pods currently running is lower, or if the pods cannot run for other reasons.

          • A high ratio indicates a potential risk of being unable to schedule additional pods on the Node.

            Drill down to the  Kubernetes Node Overview Dashboard to evaluate what Namespaces, Workloads, and pods are running. Additionally, drill up in the Clusters page to evaluate whether you are over-committing the CPU resource. You might not have enough resources to fulfill requests, and consequently, pods might not be able to run on the Node. Consider adding Nodes or replacing Nodes with additional CPU cores.

          Can the Value Be Higher than 100%?

          Kubernetes schedules pods on Nodes where sufficient allocatable resources are available to fulfill the pod request. This means Kubernetes does not allow having a total request higher than the allocatable. Consequently, the ratio cannot be higher than 100%.

          Over-committing (pods requesting resources higher than the capacity) results in a high Requests vs Allocatable ratio and a low Pods Available vs Desired ratio at the Cluster level. What it indicates is that most of the available resources are being used, consequently, what’s available is not sufficient to schedule additional pods. Therefore, Pods Available vs Desired ratio will also decrease.

          Memory Requests vs Allocatable

          The chart highlights the latest value returned by sum(avg(kubernetes.pod.resourceRequests.memBytes)) / sum(avg(kubernetes.node.allocatable.memBytes)).

          What Is It?

          The ratio between the number of bytes of memory is requested by the pods scheduled on the node and the number of bytes of memory available.The upper bound shows the memory available to pods, which corresponds to the user-defined allocatable memory configuration.

          For instance, the image below indicates the node has 62.8 GiB of memory available, out of which, 37% is requested by the pods scheduled on the Node:

          What to Expect?

          A healthy ratio falls under 80%.

          Assuming all the nodes have the same amount of allocatable resources, a reasonable upper bound is the value of (node_count - 1) / node_count x 100. For example, the ratio is 90% if you have 9 nodes. Having a high ratio protects your system against a node becoming unavailable.

          What to Do Otherwise?

          • A low ratio indicates that the Node is underutilized. Drill up to the corresponding cluster in the Clusters page to determine whether the number of pods running is low, or if pods cannot run for other reasons.

          • A high ratio indicates a potential risk of being unable to schedule additional pods on the node.

            • Drill down to the  Kubernetes Node Overview dashboard to evaluate what Namespaces, Workloads, and pods are running.

            • Additionally, drill up in the Clusters page to evaluate whether you are over-committing the memory resource. Consequently, you don’t have enough resources to fulfill requests, and pods might not be able to run. Consider adding nodes or replacing nodes with more memory.

          Can the Value be Higher than 100%?

          Kubernetes schedules pods on nodes where sufficient allocatable resources are available to fulfill the pod request. This means Kubernetes does not allow having a total request higher than the allocatable. Consequently, the ratio cannot be higher than 100%.

          Over-committing (pods requesting for more resources than that are available) results in a high Requests vs Allocatable ratio at the Nodes level and a low Pods Available vs Desired ratio at the Cluster level. What it indicates is that most of the resources are being used, consequently, what’s available is not sufficient to schedule additional pods. Therefore, Pods Available vs Desired ratio will also decrease.

          Network I/O

          The chart shows the latest value returned by avg(avg(net.bytes.total)).

          What Is It?

          The sparkline shows the trend of network traffic (inbound and outbound) for a Node. The number indicates the most recent rate of restarts per second.

          For reference, the sparklines show the following number of steps (sampling):

          • Last hour: 6 steps, each for a 10-minute time slice

          • Last 6 hours: 12 steps, each for a 20-minute time slice

          • Last day: 12 steps, each for a 2-hour time slice

          What to Expect?

          The metric highly depends on what type of applications run on the Node. You should expect some network activity for Kubernetes related operations.

          Drilling down to the Kubernetes Node Overview Dashboard in Explore will provide additional details, such as network activity across pods.

          3 -

          Namespaces Data

          This topic discusses the Namespaces Overview page and helps you understand its gauge charts and the data displayed on them.

          About Namespaces Overview

          Namespaces are virtual clusters on a physical cluster. They provide logical separation between the teams and their environments. The Namespaces Overview page provides key metrics indicating the health, capacity, and performance of each Namespace in your cluster.

          Interpret the Namespaces Data

          This topic gives insight into the metrics displayed on the Namespaces Overview screen.

          Pod Restarts

          The chart highlights the latest value returned by avg(timeAvg(kubernetes.pod.restart.rate)).

          What Is It?

          The sparkline shows the trend of pod restarts rate across all the pods in a selected Namespace. The number shows the most recent rate of restarts per second.

          For instance, the image shows a rate of 0.04 restarts per second for the last 2-hours, given the selected time window is one day. The trend also suggests a non-flat pattern (periodic crashes).

          • Last hour: 6 steps, each for a 10-minute time slice

          • Last 6 hours: 12 steps, each for a 20-minute time slice

          • Last day: 12 steps, each for a 2-hour time slice

          What to Expect?

          Expect 0 restarts for any pod.

          What to Do Otherwise?

          A few restarts across the last one hour or larger time windows might not indicate a serious problem. In the event restart loop, identify the root cause as follows:

          • Drill down to the Workloads page in Overview to identify the Workloads that have been stuck at a restart loop.

          • Drill down to the Kubernetes Namespace Overview to see a detailed trend broken down by pods:

          Pods Available vs Desired

          The chart shows the latest value returned by sum(avg(kubernetes.namespace.pod.available.count)) / sum(avg(kubernetes.namespace.pod.desired.count)).

          What Is It?

          The chart displays the ratio between available and desired pods, averaged across the selected time window, in a given Namespace.

          The upper bound shows the number of desired pods in the namespace.

          For instance, the image below shows 42 desired pods that are available:

          What to Expect?

          Expect 100% on the chart.

          If certain pods take a significant amount of time to become available due to delays (image pull time, pod initialization, readiness probe) you might temporarily see a ratio lower than 100%.

          What to Do Otherwise?

          • Identify one or more Workloads that have low availability by drilling down to the Workloads page.

          • Once you identify the Workload, drill down to the related dashboard in Explore. For example, Kubernetes Deployment Overview to determine the trend and the state of the pods.

            For instance, in the following image, the ratio is 98% (3.93 / 4 x 100). The decline is due to an update that caused pods to be terminated and consequently to be started with a newer version.

          CPU Used vs Requests

          The chart shows the latest value returned by sum(avg(cpu.cores.used)) / sum(avg(kubernetes.pod.resourceRequests.cpuCores)).

          What Is It?

          The chart shows the ratio between the total CPU usage across all the pods in the Namespace and the total CPU requested by all the pods.

          The upper bound shows the total CPU requested by all the pods. The value is expressed as the number of CPU cores.

          For instance, the image below shows the pods in a Namespace requests for 40 CPU cores, of which only 43% is being used (about 17 cores):

          What to Expect?

          The value you see depends on the type of Workloads running in the Namespace.

          Typically, values that fall between 80% and 120% is considered healthy. Values higher than 100% is considered healthy relatively for a short amount of time.

          For applications whose resource usage is constant (such as background processes), expect the ratio to be close to 100%.

          For “bursty” applications, such as an API server, expect the ratio to be less than 100%. Note that this value is averaged for the selected time window, therefore, a usage spike would be compensated by an idle period.

          What to Do Otherwise?

          A low usage indicates that the application is not properly running (not executing the expected functions) or the Workload configuration is not accurate (requests are too high compared to what the pods actually need).

          A high usage indicates that the application is operating with a heavy load or the workload configuration is not accurate (requests are too low compared to what pods actually need).

          In either case, drill down to the Workloads page to determine the workload that requires a deeper analysis.

          Can the Value Be Higher than 100%?

          Yes, it can.

          • You can configure requests without limits, or requests lower than the limits. In either case, you are allowing the containers to use more resources than requested, typically to handle temporary overloads.

          • Consider a Namespace with two Workloads with one pod each. Say, one Workload is configured to request for 1 CPU core and uses 1 CPU core (ratio of Used vs Request is 100%). The other Workload is configured without any request and uses 1 CPU core. In this example, 2 CPU cores used to 1 CPU core requested ratio at the Namespace level is 200%.

          Memory Used vs Requests

          The chart shows the latest value returned by sum(avg(memory.bytes.used)) / sum(avg(kubernetes.pod.resourceRequests.memBytes)).

          What Is It?

          The chart shows the ratio between the total memory usage across all pods of the Namespace and the total memory requested by all pods.

          The upper bound shows the total memory requested by all the pods, expressed in a specified unit for bytes.

          For instance, the image below shows that all the pods in the Namespace requests for 120 GiB, of which only 24% is being used (about 29 GiB):

          What to Expect?

          It depends on the type of Workloads you run in the Namespace. Typically, values that fall between 80% and 120% are considered healthy.

          Values that are higher than 100% considered normal for a relatively short amount of time.

          What to Do Otherwise?

          A low usage indicates the application is not properly running (not executing the expected functions) or the workload configuration is not accurate (high requests compared to what the pods actually need).

          A high usage indicates the application is operating with a high load or the Workload configuration is not accurate (Fewer requests compared to what the pods actually need).

          Given the configured limits for the Workloads and the memory pressure on the nodes, if the Workloads use more memory than what’s requested they are at risk of eviction. See Exceed a Container’s Limit for more information.

          In both cases, you may want to drill down to the Workloads page to determine which Workload requires a deeper analysis.

          Can the Value Be Higher than 100%?

          Yes, it can.

          • You can configure requests without limits, or requests lower than the limits. In either case, you are allowing the containers to use more resources than requested, typically to handle temporary overloads.

          • Consider a Namespace with two Workloads with one pod each. Say, one Workload is configured to request for 1 GiB of memory and uses 1 GiB (Used vs Request ratio is 100%). The other Workload is configured without any request and uses 1 GiB. In this example, 2 GiB of Memory Used to1 GiB Requested ratio at the Namespace level is 200%.

          Network I/O

          The chart shows the latest value returned by avg(avg(net.bytes.total)).

          What Is It?

          The sparkline shows the trend of network traffic (inbound and outbound) for all the pods in the Namespace. The number shows the most recent rate, expressed in restarts per second.

          For reference, the sparklines show the following number of steps (sampling):

          • Last hour: 6 steps, each for a 10-minute time slice

          • Last 6 hours: 12 steps, each for a 30-minute time slice

          • Last day: 12 steps, each for a 2-hour time slice

          What to Expect?

          The type of applications run in the Namespace determine the metrics. Drilling down to the Kubernetes Namespace Overview Dashboard in Explore provides additional details, such as network activity across pods.

          4 -

          Workloads Data

          This topic discusses the Workloads Overview page and helps you understand its gauge charts and the data displayed on them.

          About Workloads Overview

          Workloads, in Kubernetes terminology, refers to your containerized applications. Workloads comprise of Deployments, Statefulsets, and Daemonsets within a Namespace.

          In a Cluster, worker nodes run your application workloads, whereas the master node provides the core Kubernetes services and orchestration for application workloads. The Workloads Overview page provides the key metrics indicating health, capacity, and compliance.

          Interpret the Workloads Data

          This topic gives insight into the metrics displayed on the Workloads Overview page.

          Pod Restarts

          The chart displays the latest value returned by sum(timeAvg(kubernetes.pod.restart.rate)).

          What Is It?

          The sparkline shows the trend of Pod Restarts rate across all the pods in a selected Workload. The number shows the most recent rate, expressed in Restarts per Second.

          For instance, the image below shows the trend for the last hour. The number indicates that the rate of pod restarts is less than 0.01 for the last 10 minutes.

          For reference, the sparklines show the following number of steps (sampling):

          • Last hour: 6 steps, each for a 10-minute time slice.

          • Last 6 hours: 12 steps, each for a 20-minute time slice.

          • Last day: 12 steps, each for a 2-hour time slice.

          What to Expect?

          A healthy pod will have 0 restarts at any given time.

          What to Do Otherwise?

          In most cases, fewer restarts in the last hour (or larger time windows) do not indicate a serious problem. Drill down to the Kubernetes Overview Dashboard related to the Workload in Explore. For example, Kubernetes StatefulSet Overview provides a detailed trend broken down by pods.

          In this example, the number of restarts is constant (roughly every 5 minutes) and no pods are ready. This might indicate a crash loop back-off .

          Pods Available vs Desired

          The chart shows the latest value of returned by sum(avg(kubernetes.deployment.replicas.available)) / sum(avg(kubernetes.deployment.replicas.desired)).

          What Is It?

          The chart displays the ratio between available and desired pods, averaged across the selected time window, for all the pods in a given Workload.

          The upper bound shows the number of desired pods in the Workload.

          For instance, the image below shows all the 42 desired pods are available.

          What to Expect?

          You should typically expect 100%.

          If certain pods take a significant amount of time to become available (image pull time, pod initialization, readiness probe), then you may temporarily see a ratio lower than 100%.

          What to Do Otherwise?

          Determine the Workloads that have low availability by drilling down to the related Dashboard in Explore. For example, the Kubernetes Deployment Overview helps understand the trend and the state of the pods.

          For instance, the image above shows that the ratio is 98% (3.93 / 4 x 100). The slight decline is due to an update that caused pods to be terminated and consequently to be started with a newer version.

          CPU Used vs Requests

          The chart shows the latest value returned by sum(avg(cpu.cores.used)) / sum(avg(kubernetes.pod.resourceRequests.cpuCores)).

          What Is It?

          The chart shows the ratio between the total CPU usage across all pods of a selected Workload and the total CPU requested by all the pods.

          The upper bound shows the total CPU requested by all the pods. The value denotes the number of CPU cores.

          In this image, the pods in the Workload requests for 40 CPU cores, of which 43% is actually used (about 17 cores).

          What to Expect?

          It depends on the type of workload.

          For applications (background processes) whose resource usage is constant, expect the ratio to be around 100%.

          For “bursty” applications, such as an API server, expect the ratio to be lower than 100%. Note that the value is averaged for the selected time window, therefore, a usage spike would be compensated by an idle period.

          Generally, values between 80% and 120% are considered normal. Values that are higher than 100% deemed normal if it’s observed only for a relatively short time.

          What to Do Otherwise?

          • A low usage indicates that the application is not properly running (not executing the expected functions) or the Workload configuration is not accurate (requests are too high compared to what the pods actually need).

          • A high usage indicates that the load is high for applications or the Workload configuration is not accurate (low requests compared to what the pods actually need).

          In either case, drill down to the Kubernetes Overview Dashboard corresponding to the Workload in Explore. For example, the Kubernetes Deployment Overview Dashboard provides insight into resource usage and configuration.

          Can the Value Be Higher than 100%?

          Yes, it can.

          • Configuring CPU requests without limits or requests lower than limits is permissible. In these cases, you are allowing the containers to use more resources than requested, typically to handle temporary overloads.

          • Consider a Workload with two containers. Say, one container is configured to request for 1 CPU core and uses 1 CPU core (Used vs Request ratio is 100%). The other is configured without any request and uses 1 CPU core. In this example, the 2 CPU core Used to 1 CPU core Requested ratio is 200% at the Workload level.

          What Does “No Data” Mean?

          If the Workload is configured with no requests and limits, then the Usage vs Requests ratio cannot be computed. In this case, the chart will show “no data”. Drill down to the Dashboard in Explore to evaluate the actual usage.

          You must always configure requests. Setting requests helps to detect Workloads that require reconfiguration.

          Kubernetes itself might expose Workloads with no requests or limits configured. For example, the kube-system Namespace can have Workloads without requests configured.

          Memory Used vs Requests

          The chart shows the latest value returned by sum(avg(memory.bytes.used)) / sum(avg(kubernetes.pod.resourceRequests.memBytes)).

          What Is It?

          The chart shows the ratio between the total memory usage across all the pods in a Workload and the total memory requested by the Workload.

          The upper bound shows the total memory requested by all the pods, expressed in the specified unit of bytes.

          For instance, the image shows that the pods in the selected Workload requested for 120 GiB, of which 24% is actually used (about 29 GiB).

          What to Expect?

          The type of Workload determines the ratio. Values between 80% and 120% are considered normal. Values that are higher than 100% is deemed normal if it’s observed only for a relatively short time.

          What to Do Otherwise?

          A low memory usage indicates that the application is not properly running (not executing the expected functions) or the Workload configuration is not accurate (requests are too high compared to what the pods actually need).

          A high memory usage indicates that the load is higher for applications or the Workload configuration is not accurate (low requests compared to what the pods actually need).

          Given the configured limits for the Workloads and the memory pressure on the nodes, if the Workloads use more memory than what’s requested they are at risk of eviction. For more information, see Container’s Memory Limit.

          In either case, drill down to the Workloads page to determine the Workload that requires a deeper analysis.

          Can the Value Be Higher than 100%?

          Yes, it can.

          • Configuring memory requests without limits or requests lower than limits is permissible. In these cases, you are allowing the containers to use more resources than requested, typically to handle temporary overloads.

          • Consider a Workload with two containers. Say, one container is configured to request for 1 GiB of memory and uses 1 GiB (Used vs Request ratio is 100%), while the other is configured without any request and uses 1 GiB of memory. In this example, the 2 GiB of memory used to 1 GiB requested ratio is 200% at the Workload level.

          What Does “No Data” Mean?

          If the Workload is configured with no memory requests and limits, then the Usage vs Requests ratio cannot be computed. In this case, the chart will show “no data”. Drill down to the Dashboard in Explore to evaluate the actual usage.

          You must configure requests. It helps to detect Workloads that require reconfiguration.

          Kubernetes itself might expose Workloads with no requests or limits configured. For example, the kube-system Namespace can have Workloads without requests configured.

          Network I/O

          The chart shows the latest value returned by avg(avg(net.bytes.total)).

          What Is It?

          The sparkline shows the trend of network traffic (inbound and outbound) for the Workload. The number shows the most recent rate, expressed in bytes per second in a specific unit.

          For reference, the sparklines show the following number of steps (sampling):

          • Last hour: 6 steps, each for a 10-minute time slice

          • Last 6 hours: 12 steps, each for a 30-minute time slice

          • Last day: 12 steps, each for a 2-hour time slice

          What to Expect?

          The type of application runs in the Workload determines the metrics. Drill down to the Kubernetes Overview Dashboard corresponding to the Workload in Explore. For example, the Kubernetes Deployment Overview Dashboard provides additional details, such as network activity across pods.