Sysdig Documentation

Pre-Defined Dashboards

Sysdig provides a number of pre-defined dashboards to assist users in monitoring their environments and applications. This section outlines the main dashboards that are available out-of-the-box.

Application Dashboards

Dashboard

Description

Use Cases

Elasticsearch

This view lists eight important metrics for node and document counts, shards, indexing time and query latency.

  • Track the node count, as this can impact query times.

HAProxy

This view reports metrics for host CPU use and proxy throughput.

Redis

This view reports seven metrics for host resource usage and application performance.

Cassandra By Node

This view shows how every node in a Cassandra cluster is performing, by mixing key system metrics with Cassandra-specific metrics such as requests volume and compactions.

  • Use this view on a group containing the entire Cassandra cluster when you have already identified that there is a problem with a metric (using the "Cassandra Overview" view), and you need to see which node is causing the problem.

  • Spot issues such as imbalances between the size of data held in each node, nodes going down and generating a lot of hinted handoffs, or disk bottlenecks by looking at the pending compactions.

Cassandra Overview

This view shows how a Cassandra cluster is performing, by mixing key system metrics with Cassandra-specific metrics such as requests volume and compactions.

  • Use this view on a group containing the entire Cassandra cluster as a first starting point to troubleshoot the overall health of your database.

  • Inspect typical system metrics to make sure the cluster is not being overloaded

  • Correlate the information displayed with important advanced Cassandra metrics such as pending compactions or JVM metrics to identify critical problems.

HTTP Top Requests

This view details the top requested URLs to your web server, including the total number of requests, average and maximum times to service the requests, and the amount of traffic contained in the requests and responses.

MongoDB

This view shows how busy the MongoDB service is, which collections are in highest demand and which have the slowest performance.

  • Use to spot which collections may benefit from query and index performance tuning.

HTTP

This view provides a basic understanding of the health of your web server by showing the load being put on it and the server's ability to service requests in a timely manner.

  • Gauge the overall busyness of the server.

  • Identify correlations between the Top URLs and Slowest URLs panels to find opportunities to increase performance.

MySQL/PostgreSQL

This view shows the overall load and performance status of your SQL database transactions with metrics for the number of requests and how quickly they are handled.

  • Determine whether performance can be improved.

MySQL/PostgreSQL Top

This view shows the top SQL queries by displaying metrics for the number of queries received and the amount of traffic sent and received for the query.

  • Identify the most requested, highest traffic producing or slowest processing queries.

Compliance Dashboards

Dashboard

Description

Use Cases

Compliance (Docker)

Provides an overview of the available compliance metrics for Docker.

  • Review the Docker configuration after running CIS Docker benchmark tests.

Compliance (Kubernetes)

Provides an overview of the available compliance metrics for Kubernetes

  • Review the Kubernetes Cluster configuration after running CIS Kubernetes benchmark tests.

Hosts and Containers Dashboards

Dashboard

Description

Use Cases

Overview by Container

Displays resource usage statistics, including CPU, file bytes, memory and network bytes, for containers running within the defined scope.

  • Monitor this view to identify which containers are using disproportionate amounts of resources.

  • Helpful in determining if an application should be moved to a more capable host.

Overview by Host

Displays resource usage statistics, including CPU, file bytes, memory and network bytes, for hosts running within the defined scope.

  • Use this view to identify when a host is being over or under utilized within a group of hosts with similar job functions.

Overview by Process

Displays resource usage statistics, including CPU, file bytes, memory and network bytes, for the top processes running within the defined scope.

  • Monitor this view to identify which processes are using disproportionate amounts of resources.

  • Helpful in determining if an application should be moved to a more capable host.

Overview by Container Image

The container image overview breaks down resource usage metrics by images within the environment.

Container Limits

The Container Limits dashboard shows CPU and memory limits across the environment, and the percentages currently used.

Top Files

The Top Files dashboard displays a table of the most used files across the environment. By default, the column metrics are the total bytes used, errors encountered, and the total time for input/output operation relating to the file.

Sysdig Secure Summary

The summary dashboard provides a complete overview of the Sysdig Secure environment, including the number of active agents, the number of defined policies and how many have been enabled, and summary policy event information.

Top Processes

Lists the top processes running on the host/s.

  • Identify the top consuming processes in an environment where the same process is spawned multiple times.

Sysdig Agent Summary

This view reports the number of Sysdig agents deployed in your environment and their versions.

Top Server Processes

Displays the resource consumption for server-oriented processes only (for example, httpd, java, and ntpd).

  • Use this view to see resource usage for only server processes.

File System

This table view shows directory mount points, file system devices, and capacity and usage information for the file systems mounted on the instance. When groups are selected, metrics are averages for similar filesystem mount points.

Note

Remotely mounted file systems are not listed by default. To enable, add the entry 'remotefs.enabled = true' to the /opt/draios/bin/dragent.properties file on each instance.

  • Identify which file systems are filling up or being underutilized.

Network Dashboards

Dashboard

Description

Use Cases

Connections Table

The connections table dashboard displays a full list of the environment’s local and remote endpoints, and all network traffic resource statistics relevant to those endpoints.

  • Use this view to quickly find the top talkers on the network for the host under review.

Overview

The Network Overview dashboard provides a broad overview of network traffic for the environment, including total input and output, as well as traffic broken down by host, application, and process.

Response Times vs Resource Usage

The dashboard maps various usage statistics and response times, including memory and CPU usage, network response times, and total network and file bytes, across the time period specified.

  • Use this view to identify which resources impact response performance the most. Increase those resources as necessary to see if improved response rates result.

Top Ports

The top ports dashboard displays statistics broken down by the port, including the number of connections to each port, and the incoming, outgoing, and total bytes.

Kubernetes Resource Usage Dashboards

The Kubernetes * Health dashboards break down resource and performance metrics by various logical entities to allow for an in-depth analysis, and for critical issues to be identified and isolated. Each dashboard is built around the Golden Signals approach to monitoring: Latency, Traffic, Errors, and Saturation. Resource utilization metrics are oriented toward health and performance. These are aspects like CPU, memory, network, and storage usage by Kubernetes object. kube-state-metrics is about the status or count. Pairing kube-state-metrics with resource utilization metrics, each dashboard provides a comprehensive picture of what’s happening in your Kubernetes environment.

Dashboard

Description

Use Cases

Kubernetes Cluster Overview

Provides an overview of your Kubernetes cluster.

  • Identify performance bottlenecks.

  • Locate logical entities that are consuming too many cluster resources, or that are rapidly trending upwards towards unsustainable levels.

  • Dive deeper into specific entities to identify the root cause of problems.

  • Use usage percentages over time to better estimate expansion capacity.

For example:

  • A deployment with no available pods indicates that the corresponding app is not serving requests. Getting a dashboard on this condition means you can visualize the metrics and spring into action to find and resolve the issue quickly.

  • Dropping the number of pods available and remaining below the desired number indicate that your application performance is degraded or not running at the redundancy required. With these metrics represented on the dashboard, you get a quick glance of the severity of the impact on your app's user experience.

  • A lower number of replicas running during an extended period of time than the number of replicas desired indicates a symptom of entities not working properly, such as nodes or resources unavailability, Kubernetes or Docker Engine failure, broken Docker images, and so on. No replicas for a deployment object could potentially mean that the app is down.

  • A continuous loop of pod restart (CrashLoopBackOff) might be associated with missing dependencies or unmet requirements, or insufficient resources. In CrashLoopBackOff, pods never get into ready status and therefore are counted as unavailable and down.

Kubernetes Deployment Overview

Highlights whether each deployment has a sufficient number of available pods and resources, and indicates the number of pods running, desired, or have been updated.

Kubernetes Namespace Overview

Displays metrics such as resource requests and resource limits at the namespace level; identifies the performance of the Kubernetes entities such as pods, deployments, DaemonSet, Statefulset, and jobs, and compliance with replicaSets specs. Highlights the number of services, deployments, replicaSets, and jobs per namespace.

Kubernetes Node Overview

Highlights the number of nodes that are ready, unavailable, or out of disk; the number of nodes that are under the memory, disk, or network pressure; compares allocatable capacity with requested capacity on the node; provides the number of pod resources of a node that are available for scheduling and the available capacity to serve the pods running on the nodes.

Kubernetes Pod Overview

Helps identify potential bottlenecks by graphing the number of container restarts, the number of pods waiting to be scheduled, resource utilization of containers within each pod and available capacity to serve pod requests, the number of available pods compared to the desired pods, and the number of pods in available state and ready to serve requests.

Kubernetes StatefulSet Overview

Overview of the StatefulSet objects in your environment.

Kubernetes DaemonSet Overview

Overview of DaemonSet objects.

Kubernetes Job Overview

Overview of all the jobs and the performance information.

Kubernetes ReplicaSet Overview

Provides details such as the number of pods per replicaSet, the desired number of pods per replicaSet, and pods per replicaSet that are in a ready state.

Workloads CPU Usage and Allocation

Displays resource utilization of your workloads.

Workloads Memory Usage and Allocation

Displays memory utilization of your workloads.

CPU Allocation Optimization

Highlights CPU allocation optimization.

Memory Allocation Optimization

Highlights Memory allocation optimization.

Kubernetes Health Overview

Provides a comprehensive overview of the performance of the entire Kubernetes environment, broken down by various logical entities and underlying resource availability and usage. This dashboard breaks down resource and performance kube-state-metrics by the logical Kubernetes entities, such as pods, namespaces, deployments, and replicaSets, containers, and so on.

  • Use these three dashboards to provide a high-level overview of all aspects of the Kubernetes environment's performance and resource saturation status.

  • Set high-level alerts to narrow down areas of concern, before moving to the more in-depth dashboards.

  • Quickly identify major performance issues within each type of entity.

Kubernetes Cluster and Node Capacity

Highlights a comprehensive overview of the performance of the hosts or nodes that form the Kubernetes cluster, including CPU, memory, and file system usage, and network traffic.

Before analyzing the Dashboard, consider the following guidelines related to resource usage:

  • If Resource Limits is undefined for a container, Kubernetes does not default to a value.

  • if Resource Requests is unspecified for a container, Kubernetes defaults to Limits if that is explicitly specified, otherwise to an implementation-defined value. Limits do not default to any value.

  • If both Resource Limits and Resource Requests are not specified, no matter which value had been defaulted by Kubernetes, kube-state-metric (and hence Sysdig Monitor) reports zero. Therefore, only user-defined requested are reported by the kubernetes.pod.resourceRequests.memByte metric.

  • The memory used by a container (the value returned by memory.used.bytes) can be greater than the memory requested by a pod (the value returned by kubernetes.pod.resourceRequests.memByte). This is permissible in Kubernetes because Requests value determines the minimum amount of resources required.

For these reasons, it can be deduced that

  • In some cases, the value of Used Resources will be more than that of Resource Requests and Resource Limits, and the value of Resource RequestS could be more than that of Resource Limit.

  • The value of kubernetes.pod.resourceRequests.memByte<=memory.used.bytes<=kubernetes.pod.resourceLimits.memByte

Kubernetes State Dashboards

The Kubernetes * State dashboards provide insights into the state of a Kubernetes environment and help ensure container-based services are scheduled and running as expected.

Dashboard

Description

Use Cases

Kubernetes State Overview

(Deprecated in the 3.0.0 release.)

Provides an overview of the state of the Kubernetes environment. Lists the number of Kubernetes objects and determines whether each deployment has a sufficient number of available pods and containers.

  • Monitor the state of nodes, pods, and jobs; check compliance with replicaSets specs; identify resource requests and limits.

  • Use usage percentages over time to better estimate expansion capacity.

  • Quickly identify major performance issues within each type of entity.

  • Identify whether there are enough available pods compared to the desired pods.

  • Determine the number of pods that are in available status, ready to serve requests.

  • Monitor the pod resources of a node that are available for scheduling.

  • Identify the allocatable capacity with respect to the requested capacity on the node.

  • Determine the desired number of pods per replicaSet.

Kubernetes Daemonset State

(Deprecated in the 3.0.0 release.)

Highlights the list of pods that are ready, scheduled, unscheduled, and desired by each Daemonset.

Kubernetes Namespace State

(Deprecated in the 3.0.0 release.)

Displays the number of available Kubernetes objects at the namespace level; identifies potential bottlenecks by giving the number of pod restarts, and a summary of pod status and pod capacity.

Kubernetes Resource Quota State

Provides an overview of resource limit and request, and the number of replication controllers, services, service ports, service load balancers, configMap, and secrets.

Kubernetes Pod State

(Deprecated in the 3.0.0 release.)

Highlights the number of pods that are ready, the number of container per pod, and the total number of nodes. The pod capacity summary lists the state of pods and corresponding resource usage. Resource usage is color-coded to identify potential problems.

Kubernetes Stateful State

(Deprecated in the 3.0.0 release.)

Displays the number of pods that are ready and container per Statefulset. It highlights the number of pod restarts, pods per Statefulset and the desired number of pods. The pod capacity summary lists the state of pods and corresponding resource usage in each Statefulset. Resource usage is color-coded to identify potential problems.

Kubernetes Nodes State

(Deprecated in the 3.0.0 release.)

Highlights the number of nodes that are ready, unavailable, or out of disk, the number of nodes that are under the memory, disk, or network pressure; displays allocatable resource capacity on the node to serve the pods; provides a summary of pod capacity.

Kubernetes Deployment State

(Deprecated in the 3.0.0 release.)

Indicates the number of pods and replicas running, desired, available, paused, unavailable or have been updated in each deployment. Summarizes available and desired resource capacity for each deployment, and associated pods and namespaces.

Services Dashboards

Dashboard

Description

Use Cases

Overview by Service

The Overview by Service dashboard displays the size, performance, and limitations of each service running in the container image.

Service Overview

The Service Overview dashboard outlines a single service - what resources it is using, response times, the container and request counts, and how the response times measure up against the resource utilization.

Topology Dashboards

Dashboard

Description

Use Cases

CPU Usage

The CPU Usage dashboard uses the cpu.used.percent metric to show how CPU usage is spread across the entire environment.

  • Identify which instances are communicating, but do not have the Sysdig Monitor agent installed

  • Spot busy hosts when they become color-coded when CPU usage is elevated.

Network Traffic

The Network Traffic dashboard displays how network bandwidth is spread out across the environment.

  • Identify which instances are communicating, but do not have the Sysdig Monitor agent installed

Response Times

The Response Times dashboard uses the net.request.time.in metric to display the average network traffic response times between processes within the environment.