Troubleshoot Cluster Shield

This page provides troubleshooting guidance for Sysdig Cluster Shield.

For an overview of Cluster Shield and installation instructions, see Install Shield on Kubernetes.

API Slowness in EKS

Overview

If you experience slowness with the Kubernetes API server in an Elastic Kubernetes Service (EKS) cluster, and you have Cluster Shield with the Audit feature enabled, the issue could be related to connectivity problems between the EKS control plane and the audit webhook endpoint.

API slowness occurs because every API call to the Kubernetes API server must be validated by admission controllers before processing. These controllers enforce security and compliance checks on incoming requests.

If the admission controllers are unreachable—often due to networking issues preventing the EKS control plane from reaching the webhook endpoints—the API server waits while attempting to contact them. This waiting leads to retries and timeouts, significantly slowing down API response times. To prevent this, ensure that the admission controllers are accessible and properly connected.

Solution

To ensure proper connectivity and prevent slowness, do the following:

Allow API Server Connectivity to Pods:
For Cluster Shield’s Audit feature to function efficiently, the Kubernetes API server must be able to connect to the webhook endpoint.
In EKS, where a custom Container Network Interface (CNI) may block direct communication, ensure that the necessary ports are open and accessible
- Audit uses the port 6443 by default.
- Admission Control uses the port 8443 by default.
To customize these ports, see CNI on EKS.
Update Security Group Rules:
- Update the inbound rules for the Security Group associated with your EKS worker nodes to allow TCP traffic on the Audit port. 6443 is the default port used by Cluster Shield Audit.
- Ensure that the source of this traffic is the EKS cluster’s control plane security group.
This configuration allows the API server to reach the audit webhook endpoint without unnecessary delays.
Security Group Inbound Rule Requirements:
- Protocol: TCP
- Port: The port you specified for Cluster Shield Audit. 6443 is the default port.
- Source: The EKS cluster’s control plane security group

If you have also enabled the admission control feature of Cluster Shield, ensure that the TCP traffic on the admission controller port is allowed. 8443 is the default port for admission controller.

CNI on EKS

At times, you may need to change the default ports for Cluster Shield’s Audit and Admission Controller.

For instance, when using a custom Container Network Interface (CNI) on EKS, the API server may not be able to reach the webhook endpoint. This occurs because the control plane cannot be configured to run on a custom CNI in EKS.

To resolve this issue, when installing Cluster Shield via Helm, apply the following configurations:

cluster:
  host_network: true

features:
  admission_control:
    enabled: true
    http_port: 6000 # Or any other open and unused port > 1024
  detections:
    kubernetes_audit:
      enabled: true
      http_port: 5000 # Or any other open and unused port > 1024

Update the inbound rule in the EKS worker nodes security group, allowing TCP communication on the ports you specified from the EKS cluster security group. In this example, the ports you allow TCP communication are 5000 and 6000.

Failure Creating/Updating Control Plane Nodes

Overview

Some Kubernetes distributions started using Cluster API and allow kubeadm to talk directly with the local node API, making it unnecessary to wait for the CNI to come up before provisioning the node. Consequently, this means that the resources start getting created and reach out to the ValidatingWebhook for approval. Our ValidatingWebhook is implemented as a Deployment, and it doesn’t need to be on every node, or consume all those resources. This means the CNI needs to be available for a new node to contact it. The ValidatingWebhook should be configured to ignore failures, but due to concurrent timeouts, in some cases that might make the node provisioning fail.

See the related KB article for vSphere Kubernetes here.

Solution 1 - Temporarily Disable the ValidatingWebhook

Before provisioning new Control plane nodes (which includes upgrading them), you can temporarily disable the ValidatingWebhook and re-enable it afterward.

Set the features.detections.kubernetes_audit.enabled to false

features:
  detections:
    kubernetes_audit:
      enabled: false

Provision the new node and wait for it to be ready
Revert the change made at point 1: set features.detections.kubernetes_audit.enabled to true
```
features:
  detections:
    kubernetes_audit:
      enabled: true
```

This is a temporary solution, that must be executed every time you provision new nodes that are part of the Control plane. It creates a complete blind spot on detections based on Kubernetes actions, but only for a short time.

Solution 2 - Exclude Sensitive Namespaces from the ValidatingWebhook

You can tune the ValidatingWebhook to be bypassed for components required for node provisioning. This can be done either by ignoring namespaces, or by refining the ValidatingWebhook rule. The resources to bypass and how to bypass them depend on each single distribution. To understand it, you need to inspect the logs and see what resources fail to be provisioned.

To ignore namespaces, use features.detections.kubernetes_audit.excluded_namespaces attribute in the Shield chart and add those you want to exclude. For instance, with kube-system:

features:
  detections:
    kubernetes_audit:
      excluded_namespaces:
        - kube-system

You can also customize the ValidatingWebhook rule. For instance, to exclude Cluster-scoped resources:

features:
  detections:
    kubernetes_audit:
      webhook_rules:  # +doc-gen:break
        - apiGroups:
            - ""
            - apps
            - autoscaling
            - batch
            - networking.k8s.io
            - rbac.authorization.k8s.io
            - extensions
          apiVersions:
            - '*'
          operations:
            - '*'
          resources:
            - '*/*'
          scope: Namespaced

This is a longer-term solution, but it requires more effort and tuning to be set up. It’s also creating a persistent blind-spot, even if narrower.