Advisor

Advisor brings your metrics, alerts, and events into a focused and curated view to help you operate and troubleshoot Kubernetes infrastructure.

Advisor is available to only our SaaS users. The feature is not currently available for on-prem environments.

Advisor presents your infrastructure grouped by cluster, namespace, workload, and pod. You cannot currently configure a custom grouping. Depending on the selection, you will see different curated views and you can switch between the following:

  • Advisories
  • Triggered alerts
  • Events from Kubernetes, container engines, and custom user events
  • Cluster usage and capacity
  • Key golden signals (requests, latency, errors) derived from system calls
  • Kubernetes metrics about the health and status of Kubernetes objects
  • Container live logs
  • Process and network telemetry (CPU, memory, network connections, etc.)
  • Monitoring Integrations

The time window of metrics displayed on Advisor is the last 1 hour of collected data. To see historical values for a metric, drill down to a related dashboard or explore a metric using the Explore UI.

Advisories

Advisories evaluate the thousands of data points being sent by the Sysdig agent, and display a prioritized view of key problems in your infrastructure that affect the health and availability of your clusters and the workloads running on them.

When you select an advisory, relevant information related to the issue is surfaced, such as metrics, events, live logs, and remediation guidance. This enables you to pinpoint and resolve problems faster. Following SRE best practices, they are not necessarily symptoms of a problem, but instead causes that may not necessarily want to be alerted upon.

Example Issues Detected

Problem

Description

CrashLoopBackOff

A CrashLoopBackOff means that you have a pod starting, crashing, starting again, and then crashing again. This could cause applications to be degraded or unavailable.

Container Error

Persistent application error resulting in containers being terminated. An application error, or exit code 1, means the container was terminated due to an application problem.

CPU Throttling

Containers are hitting their CPU limit and being throttled. CPU throttling will not result in the container being killed, but will be starved of CPU resulting in application slow down.

OOM Kill

When a container reaches its memory limit it is terminated with an OOMKilled status, or exit code 137. This can lead to application instability or unavailability.

Image Pull Error

A container is failing to start as it cannot pull the image.

Advisories are automatically resolved when the problem is no longer detected. You cannot customize the Advisories evaluated. These are fully managed by Sysdig.

Live Logs

Advisor can display live logs for a container, which is the equivalent of running kubectl logs. This is useful for troubleshooting application errors or problems such as pods in a CrashLoopBackOff state.

When selecting a Pod, a Logs tab will appear. If there are multiple containers within a pod, you can select the container you wish to view logs for. Once requested, logs are streamed for 3 minutes before the session is automatically closed (you can simply re-start streaming if necessary).

Live logs are tailed on-demand and thus not persisted. After a session is closed they are no longer accessible.

Manage Access to Live Logs

By default live logs is available for users within the scope of their Sysdig Team. Use Custom Roles to manage live logs permissions.

Configure Agent for Live Logs

Live logs are enabled by default in agent 12.7.0 or newer. Agent 12.6.0 supports live logs but must be manually enabled by setting enabled: true. Older versions of the Sysdig Agent do not support live logs.

Live logs can be enabled or disabled within the agent configuration.

To turn live logs off globally for a cluster, add the following in the dragent.yaml file:

live_logs:
  enabled: false

If using Helm, this is configured via sysdig.settings. For example:

sysdig:
 # Advanced settings. Any option in here will be directly translated into dragent.yaml in the Configmap
 settings:
   live_logs:
     enabled: false

Agent Errors

Live Logs reports the following agent errors:

Error CodeCause
401The kubelet doesn’t have the bearer token authorization enabled.
403The agent ClusterRole,sysdig-agent doesn’t have the node/proxy permission.