Calico

Calico

Calico

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

List of Alerts:

AlertDescriptionFormat
[Calico-Node] Dataplane Updates Are Failing and RetryingThe update actions for dataplane are failing and retrying several timesPrometheus
[Calico-Node] IP Set Command FailuresEncountered a number of ipset command failuresPrometheus
[Calico-Node] IP Tables Restore FailuresEncountered a number of iptable restore failuresPrometheus
[Calico-Node] IP Tables Save FailuresEncountered a number of iptable restore failuresPrometheus
[Calico-Node] Errors While LoggingEncountered a number of errors while loggingPrometheus
[Calico-Node] Latency Increase in Datastore OnUpdate CallThe duration of datastore OnUpdate calls are increasingPrometheus
[Calico-Node] Latency Increase in Dataplane UpdateIncreased response time for dataplane updatesPrometheus
[Calico-Node] Latency Increase in Acquire Iptables LockIncreased response time for dataplane updatesPrometheus
[Calico-Node] Latency Increase While Listing All the Interfaces during a ResyncIncreased response time for interface listing during a resyncPrometheus
[Calico-Node] Latency Increase in Interface ResyncIncreased response time for interface resyncPrometheus
[Calico-Node] Fork/Exec Child Processes Results in High LatencyIncreased response time for Fork/Exec child processesPrometheus

List of Dashboards:

  • Calico Calico

List of Metrics:

  • felix_calc_graph_update_time_seconds
  • felix_cluster_num_hosts
  • felix_cluster_num_policies
  • felix_cluster_num_profiles
  • felix_exec_time_micros
  • felix_int_dataplane_addr_msg_batch_size
  • felix_int_dataplane_apply_time_seconds
  • felix_int_dataplane_failures
  • felix_int_dataplane_iface_msg_batch_size
  • felix_int_dataplane_msg_batch_size
  • felix_ipset_calls
  • felix_ipset_errors
  • felix_ipset_lines_executed
  • felix_iptables_lines_executed
  • felix_iptables_lock_acquire_secs
  • felix_iptables_restore_calls
  • felix_iptables_restore_errors
  • felix_iptables_save_calls
  • felix_iptables_save_errors
  • felix_log_errors
  • felix_route_table_list_seconds
  • felix_route_table_per_iface_sync_seconds
  • go_build_info
  • go_gc_duration_seconds
  • go_gc_duration_seconds_count
  • go_gc_duration_seconds_sum
  • go_goroutines
  • go_memstats_buck_hash_sys_bytes
  • go_memstats_gc_sys_bytes
  • go_memstats_heap_alloc_bytes
  • go_memstats_heap_idle_bytes
  • go_memstats_heap_inuse_bytes
  • go_memstats_heap_released_bytes
  • go_memstats_heap_sys_bytes
  • go_memstats_lookups_total
  • go_memstats_mallocs_total
  • go_memstats_mcache_inuse_bytes
  • go_memstats_mcache_sys_bytes
  • go_memstats_mspan_inuse_bytes
  • go_memstats_mspan_sys_bytes
  • go_memstats_next_gc_bytes
  • go_memstats_stack_inuse_bytes
  • go_memstats_stack_sys_bytes
  • go_memstats_sys_bytes
  • go_threads
  • process_cpu_seconds_total
  • process_max_fds
  • process_open_fds

Monitoring and Troubleshooting Calico

Here are some interesting metrics and queries to monitor and troubleshoot Calico.

About the Calico User

Hosts

A host endpoint resource (HostEndpoint) represents one or more real or virtual interfaces attached to a host that is running Calico. It enforces Calico policy on the traffic that is entering or leaving the host’s default network namespace through those interfaces.

  • A host endpoint with interfaceName: * represents all of a host’s real or virtual interfaces.

  • A host endpoint for one specific real interface is configured by interfaceName: , for example interfaceName: eth0, or by leaving interfaceName empty and including one of the interface’s IPs in expectedIPs.

Each host endpoint may include a set of labels and list of profiles that Calico will use to apply policy to the interface.

Profiles

Profiles provide a way to group multiple endpoints so that they inherit a shared set of labels. For historic reasons, Profiles can also include policy rules, but that feature is deprecated in favor of the much more flexible NetworkPolicy and GlobalNetworkPolicy resources.

Each Calico endpoint or host endpoint can be assigned to zero or more profiles.

Policies

If you are new to Kubernetes, start with “Kubernetes policy” and learn the basics of enforcing policy for pod traffic. The good news is, Kubernetes and Calico policies are very similar and work alongside each other – so managing both types is easy.

Kubernetes network policy lets developers secure access to and from their applications using the same simple language they use to deploy them. Developers can focus on their applications without understanding low-level networking concepts. Enabling developers to easily secure their applications using network policies supports a shift left DevOps environment.

Errors

Dataplane Updates Failures and Retries

Dataplane is base of work for Calico. It has three different types of Dataplanes (Linux eBPF, Standard Linux and Windows HNS). Dataplane is responsible for main important parts in Calico: base networking, network policy and IP address management capabilities. So be aware of possible errors in dataplane is keystone for Calico monitoring.

rate(felix_int_dataplane_failures[5m])

Ipset Command Failures

IP sets are stored collections of IP addresses, network ranges, MAC addresses, port numbers, and network interface names. The iptables tool can leverage IP sets for more efficient rule matching.

For example, let’s say you want to drop traffic that originates from one of several IP address ranges that you know to be malicious. Instead of configuring rules for each range in iptables directly, you can create an IP set and then reference that set in an iptables rule. This makes your rule sets dynamic and therefore easier to configure; whenever you need to add or swap out network identifiers that are handled by the firewall, you simply change the IP set.

For that reason we need to monitor failures fot his kind of command in calico.

rate(felix_ipset_errors[5m])

Iptables Save Failures and Iptables Restore Failures

The actual iptables rules are created and customized on the command line with the command iptables for IPv4 and ip6tables for IPv6.

These can be saved in a file with the command iptables-save for IPv4.

Debian/Ubuntu: iptables-save > /etc/iptables/rules.v4
RHEL/CentOS: iptables-save > /etc/sysconfig/iptables

These files can be loaded again with the command iptables-restore for IPv4.

Debian/Ubuntu: iptables-restore < /etc/iptables/rules.v4
RHEL/CentOS: iptables-restore < /etc/sysconfig/iptables

This is basically the main purpose of calico, so monitor failures of the features is very important.

rate(felix_iptables_save_errors[5m])
rate(felix_iptables_restore_errors[5m])

Latency

Most usefull way to inform about latency is show some alert with quantiles.

Calico metrics does not provides buckets, it summarizes all that info with specific labels. For Latency metrics Calico provides quantile labels 0.5, 0.9 and 0.99.

Latency in Datastore OnUpdate Call

# Latency on dataplane update
felix_calc_graph_update_time_seconds{quantile="0.99"}

# Latency on acquire iptables lock
felix_int_dataplane_apply_time_seconds{quantile="0.99"}

# Latency to list all the interfaces during a resync
felix_iptables_lock_acquire_secs{quantile="0.99"}

Saturation

The way to monitor saturation in Calico is batch size. Here we can analyze three kinds of batches and also analyze them by quantiles.

# Number of messages processed in each batch
felix_int_dataplane_msg_batch_size{quantile="0.99"}

# Interface state messages processed in each batch
felix_int_dataplane_iface_msg_batch_size{quantile="0.99"}

# Interface address messages processed in each batch
felix_int_dataplane_addr_msg_batch_size{quantile="0.99"}

Traffic

One of the four golden signals we have to monitor to is traffic, in this case for calico, we need to monitor the most core network requests. Ipset and Iptables commands are the lowest level interaction in calico, in order to create that traffic Calico needs to create, destroy and update any policy network.

# Number of ipset commands executed.
rate(felix_ipset_calls[5m])

# Number of ipset operations executed.
rate(felix_ipset_lines_executed[5m])

# Number of iptables rule updates executed.
rate(felix_iptables_lines_executed[5m])

# Number of iptables-restore calls.
rate(felix_iptables_restore_calls[5m])

# Number of iptables-save calls.
rate(felix_iptables_save_calls[$__interval])