Calico
This integration is disabled by default. Please refer to Enable and Disable Integrations to enable it in your account.
Versions supported: 3.23.3
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 22 metrics.
Timeseries generated: 838 Timeseries
List of Alerts
Alert | Description | Format |
---|---|---|
[Calico-Node] Dataplane Updates Are Failing and Retrying | The update actions for dataplane are failing and retrying several times | Prometheus |
[Calico-Node] IP Set Command Failures | Encountered a number of ipset command failures | Prometheus |
[Calico-Node] IP Tables Restore Failures | Encountered a number of iptable restore failures | Prometheus |
[Calico-Node] IP Tables Save Failures | Encountered a number of iptable restore failures | Prometheus |
[Calico-Node] Errors While Logging | Encountered a number of errors while logging | Prometheus |
[Calico-Node] Latency Increase in Datastore OnUpdate Call | The duration of datastore OnUpdate calls are increasing | Prometheus |
[Calico-Node] Latency Increase in Dataplane Update | Increased response time for dataplane updates | Prometheus |
[Calico-Node] Latency Increase in Acquire Iptables Lock | Increased response time for dataplane updates | Prometheus |
[Calico-Node] Latency Increase While Listing All the Interfaces during a Resync | Increased response time for interface listing during a resync | Prometheus |
[Calico-Node] Latency Increase in Interface Resync | Increased response time for interface resync | Prometheus |
[Calico-Node] Fork/Exec Child Processes Results in High Latency | Increased response time for Fork/Exec child processes | Prometheus |
List of Dashboards
Calico
The dashboard provides information on the Calico integration.
List of Metrics
Metric name |
---|
felix_calc_graph_update_time_seconds |
felix_cluster_num_hosts |
felix_cluster_num_policies |
felix_cluster_num_profiles |
felix_exec_time_micros |
felix_int_dataplane_addr_msg_batch_size |
felix_int_dataplane_apply_time_seconds |
felix_int_dataplane_failures |
felix_int_dataplane_iface_msg_batch_size |
felix_int_dataplane_msg_batch_size |
felix_ipset_calls |
felix_ipset_errors |
felix_ipset_lines_executed |
felix_iptables_lines_executed |
felix_iptables_lock_acquire_secs |
felix_iptables_restore_calls |
felix_iptables_restore_errors |
felix_iptables_save_calls |
felix_iptables_save_errors |
felix_log_errors |
felix_route_table_list_seconds |
felix_route_table_per_iface_sync_seconds |
Prerequisites
Verify Calico-Node Pods
Once you configure calico on your cluster, you should see the calico-node pods deployed in your nodes. The calico-node daemonset deploys one pod per node. If deployment takes too long, verify your nodes’ labels.
Describe your nodes and check if the projectcalico.org/ds-ready=true
label exists. If this label is missing, add it to your nodes using the following command:
kubectl label nodes <node-name> projectcalico.org/ds-ready=true
Enable Calico Prometheus Metrics
Calico can expose Prometheus metrics natively, however, this is an option that is not always enabled.
You can use the following command to turn Prometheus metrics on:
kubectl patch felixconfiguration default --type merge --patch '{"spec":{"prometheusMetricsEnabled": true}}'
You should see and output like below:
felixconfiguration.projectcalico.org/default patched
Installation
Installing an exporter is not required for this integration.
Monitoring and Troubleshooting Calico
Here are some interesting metrics and queries to monitor and troubleshoot Calico.
About the Calico User
Hosts
A host endpoint resource (HostEndpoint) represents one or more real or virtual interfaces attached to a host that is running Calico. It enforces Calico policy on the traffic that is entering or leaving the host’s default network namespace through those interfaces.
A host endpoint with interfaceName: * represents all of a host’s real or virtual interfaces.
A host endpoint for one specific real interface is configured by interfaceName:
, for example interfaceName: eth0, or by leaving interfaceName empty and including one of the interface’s IPs in expectedIPs.
Each host endpoint may include a set of labels and list of profiles that Calico will use to apply policy to the interface.
Profiles
Profiles provide a way to group multiple endpoints so that they inherit a shared set of labels. For historic reasons, Profiles can also include policy rules, but that feature is deprecated in favor of the much more flexible NetworkPolicy and GlobalNetworkPolicy resources.
Each Calico endpoint or host endpoint can be assigned to zero or more profiles.
Policies
If you are new to Kubernetes, start with “Kubernetes policy” and learn the basics of enforcing policy for pod traffic. The good news is, Kubernetes and Calico policies are very similar and work alongside each other – so managing both types is easy.
Kubernetes network policy lets developers secure access to and from their applications using the same simple language they use to deploy them. Developers can focus on their applications without understanding low-level networking concepts. Enabling developers to easily secure their applications using network policies supports a shift left DevOps environment.
Errors
Dataplane Updates Failures and Retries
Dataplane is base of work for Calico. It has three different types of Dataplanes (Linux eBPF, Standard Linux and Windows HNS). Dataplane is responsible for main important parts in Calico: base networking, network policy and IP address management capabilities. So be aware of possible errors in dataplane is keystone for Calico monitoring.
rate(felix_int_dataplane_failures[5m])
Ipset Command Failures
IP sets are stored collections of IP addresses, network ranges, MAC addresses, port numbers, and network interface names. The iptables tool can leverage IP sets for more efficient rule matching.
For example, let’s say you want to drop traffic that originates from one of several IP address ranges that you know to be malicious. Instead of configuring rules for each range in iptables directly, you can create an IP set and then reference that set in an iptables rule. This makes your rule sets dynamic and therefore easier to configure; whenever you need to add or swap out network identifiers that are handled by the firewall, you simply change the IP set.
For that reason we need to monitor failures fot his kind of command in calico.
rate(felix_ipset_errors[5m])
Iptables Save Failures and Iptables Restore Failures
The actual iptables rules are created and customized on the command line with the command iptables
for IPv4 and ip6tables
for IPv6.
These can be saved in a file with the command iptables-save
for IPv4.
Debian/Ubuntu: iptables-save > /etc/iptables/rules.v4
RHEL/CentOS: iptables-save > /etc/sysconfig/iptables
These files can be loaded again with the command iptables-restore
for IPv4.
Debian/Ubuntu: iptables-restore < /etc/iptables/rules.v4
RHEL/CentOS: iptables-restore < /etc/sysconfig/iptables
This is basically the main purpose of calico, so monitor failures of the features is very important.
rate(felix_iptables_save_errors[5m])
rate(felix_iptables_restore_errors[5m])
Latency
Most usefull way to inform about latency is show some alert with quantiles.
Calico metrics does not provides buckets, it summarizes all that info with specific labels. For Latency metrics Calico provides quantile labels 0.5, 0.9 and 0.99.
Latency in Datastore OnUpdate Call
# Latency on dataplane update
felix_calc_graph_update_time_seconds{quantile="0.99"}
# Latency on acquire iptables lock
felix_int_dataplane_apply_time_seconds{quantile="0.99"}
# Latency to list all the interfaces during a resync
felix_iptables_lock_acquire_secs{quantile="0.99"}
Saturation
The way to monitor saturation in Calico is batch size. Here we can analyze three kinds of batches and also analyze them by quantiles.
# Number of messages processed in each batch
felix_int_dataplane_msg_batch_size{quantile="0.99"}
# Interface state messages processed in each batch
felix_int_dataplane_iface_msg_batch_size{quantile="0.99"}
# Interface address messages processed in each batch
felix_int_dataplane_addr_msg_batch_size{quantile="0.99"}
Traffic
One of the four golden signals we have to monitor to is traffic, in this case for calico, we need to monitor the most core network requests.
Ipset
and Iptables
commands are the lowest level interaction in calico, in order to create that traffic Calico needs to create, destroy and update any policy network.
# Number of ipset commands executed.
rate(felix_ipset_calls[5m])
# Number of ipset operations executed.
rate(felix_ipset_lines_executed[5m])
# Number of iptables rule updates executed.
rate(felix_iptables_lines_executed[5m])
# Number of iptables-restore calls.
rate(felix_iptables_restore_calls[5m])
# Number of iptables-save calls.
rate(felix_iptables_save_calls[$__interval])
Agent Configuration
The default agent jobs for this integration are as follows:
- job_name: 'calico-node-default'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- source_labels: [__meta_kubernetes_pod_phase]
action: keep
regex: Running
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (calico-node);(.{0}$)
replacement: calico
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "calico"
- action: replace
source_labels: [__address__]
regex: ([^:]+)(?::\d+)?
replacement: $1:9091
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (felix_calc_graph_update_time_seconds|felix_cluster_num_hosts|felix_cluster_num_policies|felix_cluster_num_profiles|felix_exec_time_micros|felix_int_dataplane_addr_msg_batch_size|felix_int_dataplane_apply_time_seconds|felix_int_dataplane_failures|felix_int_dataplane_iface_msg_batch_size|felix_int_dataplane_msg_batch_size|felix_ipset_calls|felix_ipset_errors|felix_ipset_lines_executed|felix_iptables_lines_executed|felix_iptables_lock_acquire_secs|felix_iptables_restore_calls|felix_iptables_restore_errors|felix_iptables_save_calls|felix_iptables_save_errors|felix_log_errors|felix_route_table_list_seconds|felix_route_table_per_iface_sync_seconds)
action: keep
- job_name: 'calico-controller-default'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- source_labels: [__meta_kubernetes_pod_phase]
action: keep
regex: Running
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
separator: ;
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (calico-kube-controllers);(.{0}$)
replacement: calico-controller
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "calico-controller"
- action: replace
source_labels: [__address__]
regex: ([^:]+)(?::\d+)?
replacement: $1:9094
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.