Troubleshoot

Troubleshoot your AWS cloud connection and the cloud connector component itself.

General Troubleshooting

Find more troubleshooting options on the Terraform - AWS module source repository

Resolve 409 Conflict Error

This error may occur if the specified cloud account has already been onboarded to Sysdig.

Solution:

The cloud account can be imported into Terraform by running: 

terraform import module.cloud_bench.sysdig_secure_cloud_account.cloud_account CLOUD_ACCOUNT_ID

Resolve Permissions Error/Access Denied

This error may occur if your current AWS authentication session does not have the required permissions to create certain resources.

Solution:

Ensure you are authenticated to AWS using a user or role with the required permissions.

Tune Threat Detection

Note: the default ECS setup is meant to manage a low-to-medium load of AWS CloudTrail events. For use cases where the event load is high, consider scaling up the footprint according to the usage metrics.

To scale the Sysdig cloud component properly, look at the usage metrics such as CPU and RAM memory in the SQS service:

This example shows that the CPU does not have too much work to do, while memory is mostly under 25%, so no need to scale it, but if needed, you would update the Task Definition in ECS:

This is the current task under execution. Check the current task size:

This case is using half a GB of RAM and a quarter of CPU, but you can configure it and create a new revision that will be deployed by the service if we want to scale it vertically.

For horizontal scaling, update the number of replicas of the ECS Service:

Or update the service to increase the Number of Tasks:

All this scaling can be checked with both cloud connector component metrics and the SQS ingestion metrics. Those values (CPU, RAM and Replicas) can be tweaked until the CPU and RAM usage and the Message Age and Messages Delay are at acceptable levels.

Troubleshoot Cloud Connector in AWS Console

When using the cloud-connector-based installation (i.e. for CIEM installs) you may want to troubleshoot issues directly in the AWS console. This section describes some of the common problems associated with shipping CloudTrail logs for CIEM needs.

The CIEM workflow for AWS cloud requires collecting CloudTrail events from cloud accounts, inspecting those events to understand the current security posture, and producing best-effort recommendations to tighten the overall security across the cloud landscape.

The workflows rely on a “cloud connector” agent to ship the CloudTrail logs from cloud accounts to the Sysdig platform. If the cloud connector becomes unavailable, restoring it can involve different stages of inspecting the health and tuning the configuration to keep it running.

Check Cloud Connector Health

Follow the below steps:

  1. Navigate to Amazon Elastic Container Service -> Clusters

  2. Choose the cluster where the cloud connector service is deployed.

  3. Select the cloud connector service and check the health metrics.

  4. Choose different time windows and inspect the resource consumption trend.

    The above health metrics show moderate resource consumption because the events processing rate is relatively low. However, the rate of events might not always be steady; it can increase at a higher rate, and the default configuration (intended for low to medium load) might not be sufficient for stable processing.

    In such a situation, when the active instance(s) cannot handle the current load, check the current resource configuration and consider scaling it up.

  5. You can inspect the current resource configuration by inspecting the task.

    When a single instance is insufficient (even after multiple scale-up attempts), you try to scale it out by adjusting the deployment configurations and forcing a deployment. Contact the Sysdig support team for assistance.

Check Heartbeat Signals

The Sysdig Secure CIEM server (also called “cloudsec server”) relies on heartbeats from the cloud connector to monitor the communication channel between the cloud connector and the events pipeline. All the dispatched events from the cloud connector are indexed in the remote storage via the events pipeline. A heartbeat is a small packet of data sent from the cloud connector to the cloudsec server regularly, by default every 5 minutes, via HTTPS.

When the cloud connector fails to send a heartbeat consistently, the cloud account status displayed on the Identity and Access Management page(s) will eventually become disconnected. We recommend checking the ECS task log and see if the cloud connector heartbeat is sending the signal.

Navigate to Amazon Elastic Container Service -> Clusters -> [Service Name] -> Tasks -> [Task ID] -> Logs.

A healthy heartbeat signal request and reply look like this:

The first log line indicates the record of the heartbeat signal sent from the cloud connector, and the second one represents the reply received from the cloudsec server. Contact the Sysdig support team if you don’t see this heartbeat signal traces for an extended period.

Check CIEM Feature is Enabled

If the CIEM feature is toggled off, the collected events will not be processed or dispatched to the remote storage. An active cloud connector can detect the feature status and report the current status in the logs, which looks like this:

If the message instead states the feature is disabled, you can enable it via Sysdig Secure API or contact Sysdig support for assistance.