Calico

Metrics, Dashboards, Alerts and more for Calico Integration in Sysdig Monitor.
Calico

This integration is disabled by default. Please refer to Enable and Disable Integrations to enable it in your account.

Versions supported: 3.23.3

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 22 metrics.

Timeseries generated: 838 Timeseries

List of Alerts

AlertDescriptionFormat
[Calico-Node] Dataplane Updates Are Failing and RetryingThe update actions for dataplane are failing and retrying several timesPrometheus
[Calico-Node] IP Set Command FailuresEncountered a number of ipset command failuresPrometheus
[Calico-Node] IP Tables Restore FailuresEncountered a number of iptable restore failuresPrometheus
[Calico-Node] IP Tables Save FailuresEncountered a number of iptable restore failuresPrometheus
[Calico-Node] Errors While LoggingEncountered a number of errors while loggingPrometheus
[Calico-Node] Latency Increase in Datastore OnUpdate CallThe duration of datastore OnUpdate calls are increasingPrometheus
[Calico-Node] Latency Increase in Dataplane UpdateIncreased response time for dataplane updatesPrometheus
[Calico-Node] Latency Increase in Acquire Iptables LockIncreased response time for dataplane updatesPrometheus
[Calico-Node] Latency Increase While Listing All the Interfaces during a ResyncIncreased response time for interface listing during a resyncPrometheus
[Calico-Node] Latency Increase in Interface ResyncIncreased response time for interface resyncPrometheus
[Calico-Node] Fork/Exec Child Processes Results in High LatencyIncreased response time for Fork/Exec child processesPrometheus

List of Dashboards

Calico

The dashboard provides information on the Calico integration. Calico

List of Metrics

Metric name
felix_calc_graph_update_time_seconds
felix_cluster_num_hosts
felix_cluster_num_policies
felix_cluster_num_profiles
felix_exec_time_micros
felix_int_dataplane_addr_msg_batch_size
felix_int_dataplane_apply_time_seconds
felix_int_dataplane_failures
felix_int_dataplane_iface_msg_batch_size
felix_int_dataplane_msg_batch_size
felix_ipset_calls
felix_ipset_errors
felix_ipset_lines_executed
felix_iptables_lines_executed
felix_iptables_lock_acquire_secs
felix_iptables_restore_calls
felix_iptables_restore_errors
felix_iptables_save_calls
felix_iptables_save_errors
felix_log_errors
felix_route_table_list_seconds
felix_route_table_per_iface_sync_seconds

Prerequisites

Verify Calico-Node Pods

Once you configure calico on your cluster, you should see the calico-node pods deployed in your nodes. The calico-node daemonset deploys one pod per node. If deployment takes too long, verify your nodes’ labels. Describe your nodes and check if the projectcalico.org/ds-ready=true label exists. If this label is missing, add it to your nodes using the following command:

kubectl label nodes <node-name> projectcalico.org/ds-ready=true

Enable Calico Prometheus Metrics

Calico can expose Prometheus metrics natively, however, this is an option that is not always enabled.

You can use the following command to turn Prometheus metrics on:

kubectl patch felixconfiguration default --type merge --patch '{"spec":{"prometheusMetricsEnabled": true}}'

You should see and output like below:

felixconfiguration.projectcalico.org/default patched

Installation

Installing an exporter is not required for this integration.

Monitoring and Troubleshooting Calico

Here are some interesting metrics and queries to monitor and troubleshoot Calico.

About the Calico User

Hosts

A host endpoint resource (HostEndpoint) represents one or more real or virtual interfaces attached to a host that is running Calico. It enforces Calico policy on the traffic that is entering or leaving the host’s default network namespace through those interfaces.

  • A host endpoint with interfaceName: * represents all of a host’s real or virtual interfaces.

  • A host endpoint for one specific real interface is configured by interfaceName: , for example interfaceName: eth0, or by leaving interfaceName empty and including one of the interface’s IPs in expectedIPs.

Each host endpoint may include a set of labels and list of profiles that Calico will use to apply policy to the interface.

Profiles

Profiles provide a way to group multiple endpoints so that they inherit a shared set of labels. For historic reasons, Profiles can also include policy rules, but that feature is deprecated in favor of the much more flexible NetworkPolicy and GlobalNetworkPolicy resources.

Each Calico endpoint or host endpoint can be assigned to zero or more profiles.

Policies

If you are new to Kubernetes, start with “Kubernetes policy” and learn the basics of enforcing policy for pod traffic. The good news is, Kubernetes and Calico policies are very similar and work alongside each other – so managing both types is easy.

Kubernetes network policy lets developers secure access to and from their applications using the same simple language they use to deploy them. Developers can focus on their applications without understanding low-level networking concepts. Enabling developers to easily secure their applications using network policies supports a shift left DevOps environment.

Errors

Dataplane Updates Failures and Retries

Dataplane is base of work for Calico. It has three different types of Dataplanes (Linux eBPF, Standard Linux and Windows HNS). Dataplane is responsible for main important parts in Calico: base networking, network policy and IP address management capabilities. So be aware of possible errors in dataplane is keystone for Calico monitoring.

rate(felix_int_dataplane_failures[5m])

Ipset Command Failures

IP sets are stored collections of IP addresses, network ranges, MAC addresses, port numbers, and network interface names. The iptables tool can leverage IP sets for more efficient rule matching.

For example, let’s say you want to drop traffic that originates from one of several IP address ranges that you know to be malicious. Instead of configuring rules for each range in iptables directly, you can create an IP set and then reference that set in an iptables rule. This makes your rule sets dynamic and therefore easier to configure; whenever you need to add or swap out network identifiers that are handled by the firewall, you simply change the IP set.

For that reason we need to monitor failures fot his kind of command in calico.

rate(felix_ipset_errors[5m])

Iptables Save Failures and Iptables Restore Failures

The actual iptables rules are created and customized on the command line with the command iptables for IPv4 and ip6tables for IPv6.

These can be saved in a file with the command iptables-save for IPv4.

Debian/Ubuntu: iptables-save > /etc/iptables/rules.v4
RHEL/CentOS: iptables-save > /etc/sysconfig/iptables

These files can be loaded again with the command iptables-restore for IPv4.

Debian/Ubuntu: iptables-restore < /etc/iptables/rules.v4
RHEL/CentOS: iptables-restore < /etc/sysconfig/iptables

This is basically the main purpose of calico, so monitor failures of the features is very important.

rate(felix_iptables_save_errors[5m])
rate(felix_iptables_restore_errors[5m])

Latency

Most usefull way to inform about latency is show some alert with quantiles.

Calico metrics does not provides buckets, it summarizes all that info with specific labels. For Latency metrics Calico provides quantile labels 0.5, 0.9 and 0.99.

Latency in Datastore OnUpdate Call

# Latency on dataplane update
felix_calc_graph_update_time_seconds{quantile="0.99"}

# Latency on acquire iptables lock
felix_int_dataplane_apply_time_seconds{quantile="0.99"}

# Latency to list all the interfaces during a resync
felix_iptables_lock_acquire_secs{quantile="0.99"}

Saturation

The way to monitor saturation in Calico is batch size. Here we can analyze three kinds of batches and also analyze them by quantiles.

# Number of messages processed in each batch
felix_int_dataplane_msg_batch_size{quantile="0.99"}

# Interface state messages processed in each batch
felix_int_dataplane_iface_msg_batch_size{quantile="0.99"}

# Interface address messages processed in each batch
felix_int_dataplane_addr_msg_batch_size{quantile="0.99"}

Traffic

One of the four golden signals we have to monitor to is traffic, in this case for calico, we need to monitor the most core network requests. Ipset and Iptables commands are the lowest level interaction in calico, in order to create that traffic Calico needs to create, destroy and update any policy network.

# Number of ipset commands executed.
rate(felix_ipset_calls[5m])

# Number of ipset operations executed.
rate(felix_ipset_lines_executed[5m])

# Number of iptables rule updates executed.
rate(felix_iptables_lines_executed[5m])

# Number of iptables-restore calls.
rate(felix_iptables_restore_calls[5m])

# Number of iptables-save calls.
rate(felix_iptables_save_calls[$__interval])

Agent Configuration

The default agent jobs for this integration are as follows:

- job_name: 'calico-node-default'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - source_labels: [__meta_kubernetes_pod_phase]
    action: keep
    regex: Running
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (calico-node);(.{0}$)
    replacement: calico
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "calico"
  - action: replace
    source_labels: [__address__]
    regex: ([^:]+)(?::\d+)?
    replacement: $1:9091
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (felix_calc_graph_update_time_seconds|felix_cluster_num_hosts|felix_cluster_num_policies|felix_cluster_num_profiles|felix_exec_time_micros|felix_int_dataplane_addr_msg_batch_size|felix_int_dataplane_apply_time_seconds|felix_int_dataplane_failures|felix_int_dataplane_iface_msg_batch_size|felix_int_dataplane_msg_batch_size|felix_ipset_calls|felix_ipset_errors|felix_ipset_lines_executed|felix_iptables_lines_executed|felix_iptables_lock_acquire_secs|felix_iptables_restore_calls|felix_iptables_restore_errors|felix_iptables_save_calls|felix_iptables_save_errors|felix_log_errors|felix_route_table_list_seconds|felix_route_table_per_iface_sync_seconds)
    action: keep
- job_name: 'calico-controller-default'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - source_labels: [__meta_kubernetes_pod_phase]
    action: keep
    regex: Running
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    separator: ;
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (calico-kube-controllers);(.{0}$)
    replacement: calico-controller
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "calico-controller"
  - action: replace
    source_labels: [__address__]
    regex: ([^:]+)(?::\d+)?
    replacement: $1:9094
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name