Istio

Metrics, Dashboards, Alerts and more for Istio Integration in Sysdig Monitor.
Istio

This integration is enabled by default.

Versions supported: 1.19

This integration is out-of-the-box, so it doesn’t require any exporter.

This integration has 28 metrics.

Timeseries generated: For a ten-node cluster, the number of timeseries can vary between 1K and 10K due to the presence of high cardinality metrics. The number of workloads in your cluster, the number of Istio policies, and the Istio network rules applied determine this number.

List of Alerts

AlertDescriptionFormat
[Istio-Citadel] CSR without successSome of the Certificate Signing Request (CSR) were not correctly requestedPrometheus
[Istio-Pilot] Inbound listener rules conflictsThere are some conflict with inbound listener rulesPrometheus
[Istio-Pilot] Endpoint found in unready stateEndpoint found in unready statePrometheus
[Istio] Unstable requests for sidecar injectionsSidecar injections requests are failingPrometheus
[Istiod] Istiod Uptime issueIstiod UpTime is taking more time than usualPrometheus

List of Dashboards

Istio Control Plane

The dashboard provides information on the Istio Control Plane, Pilot, Galley, Mixer and Citadel. Istio Control Plane

List of Metrics

Metric name
citadel_server_csr_count
citadel_server_success_cert_issuance_count
galley_validation_failed
galley_validation_passed
istiod_uptime_seconds
pilot_conflict_inbound_listener
pilot_conflict_outbound_listener_http_over_current_tcp
pilot_conflict_outbound_listener_tcp_over_current_http
pilot_conflict_outbound_listener_tcp_over_current_tcp
pilot_endpoint_not_ready
pilot_services
pilot_total_xds_internal_errors
pilot_total_xds_rejects
pilot_virt_services
pilot_xds
pilot_xds_cds_reject
pilot_xds_config_size_bytes_bucket
pilot_xds_eds_reject
pilot_xds_lds_reject
pilot_xds_push_context_errors
pilot_xds_push_time_bucket
pilot_xds_pushes
pilot_xds_rds_reject
pilot_xds_send_time_bucket
pilot_xds_write_timeout
sidecar_injection_failure_total
sidecar_injection_requests_total
sidecar_injection_success_total

Prerequisites

None.

Installation

Installing an exporter is not required for this integration.

Monitoring and Troubleshooting Istio

This document describes resumed alarms and dashboards for Istio Service. Istio Services are based on network rules as the foundation, so all the alarms and dashboards monitor any problem related to traffic and connections from source and destination.

Alarms

Most of the alarms associated with Istio configuration notifies problems with the Pilot or Citadel server. These servers are responsible for important Istio configuration.

Citadel controls authentication and identity management between services, and manages certificates in every workload.

Pilot accepts the rules created for traffic behavior provided by the control plane, and converts them into configurations applied by Envoy, based on how configuration aspects are managed locally. Basically, Pilot is responsible for iptables configuration in every workload.

CSR Without Success

Alarms are defined to notify you of faulty Certificate Signing Requests (CSRs). In order to collect that information, the following metrics are used:

  • citadel_server_csr_count
  • citadel_server_success_cert_issuance_count
rate(citadel_server_csr_count[5m]) - rate(citadel_server_success_cert_issuance_count[5m]) > 0

What is CSR: A certificate signing request (CSR) is one of the first steps towards getting your own SSL/TLS certificate. Generated on the same server you plan to install the certificate on, the CSR contains information such as common name, organization, and country. The Certificate Authority (CA) will use CSR to create your certificate. CSR also contains the public key that will be included in your certificate and is signed with the corresponding private key.

Inbound Listener Rules Conflicts

Because Istio works with networking rules, and configures IP addresses, ports, sockets, and so on to send or received traffic. The term listeners refers to these configurable values. Be aware of possible errors or conflicts with these rules.

pilot_conflict_inbound_listener > 0

Endpoint Found in Unready State

In order to have a stable platform, you need to verify that all endpoints in your network are perfectly working. Use the following alarm to collect that information:

pilot_endpoint_not_ready > 0

Unstable Requests for Sidecar Injections

Istio configures sidecar containers in every pod, and use this sidecar as the frontend server for all the requests that goes to or from that workload. To check if this sidecar injection is properly work, use the following query:

rate(sidecar_injection_requests_total [5m]) -  rate(sidecar_injection_success_total [5m]) > 0

Dashboards

Traffic

Traffic is the first golden signal that has to be gathered. Because Istio provides traffic management itself the information it provides will be detailed. Istio has three different parts that you can monitor and specify different metrics: control plane, envoy, and service itself.

This example shows gathering information about Istio service traffic.

Use the istio_requests_total with relevant labels to colloect wideband of information on different panels.

Client Request Volume and Server Request Volume

The istio_requests_total metric shows the total request traffic from both sides of the connection, using the reporter label.

The reporter label identifies the reporter of the request. It is set to destination if report is from an Istio proxy server. It will be set to source if the report is from a Istio proxy client or a gateway.

sum (irate(istio_requests_total{reporter="source"}[5m]))
sum (irate(istio_requests_total{reporter="destination"}[5m]))

Incoming Request by Source/Destination and Response Code

This dashboard shows the requests received by both source and destination using the reporter label. The following query segments the HTTP codes with the response_code label.

sum(irate(istio_requests_total{reporter="source"}[5m])) by (source_workload, source_workload_namespace, response_code)
sum(irate(istio_requests_total{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, response_code)

Client/Server Success Rate (non-5xx responses)

The following query builds a dashboard to monitor all the traffic except related to the internal server errors. The reporter label is used to segment on both source and destination.

100*(sum by (destination_service_name)(irate(istio_requests_total{reporter="destination",response_code!~"5.*"}[5m])) / sum by (destination_service_name)(irate(istio_requests_total{reporter="destination"}[5m])))
100*(sum by (destination_service_name)(irate(istio_requests_total{reporter="source",response_code!~"5.*"}[5m])) / sum by (destination_service_name)(irate(istio_requests_total{reporter="source"}[5m])))

Errors

The errors summarized in these dashboards are related with HTTP traffic managed by Istio proxies.

4xx Response Code by Source/Destination

The following query builds a dashboard that reports all the bad requests. It uses the reporter label on both source and destination.

 sum by (response_code)(irate(istio_requests_total{reporter="source", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="source", response_code=~"4.*"}) -1
 sum by (response_code)(irate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="destination", response_code=~"4.*"}) -1

5xx Response Code by Source/Destination

The following query builds a dashboard to show all the internal server errors requests. The query uses the reporter label on both source and destination.

sum by (response_code)(irate(istio_requests_total{reporter="source", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="source", response_code=~"4.*"}) -1
sum by (response_code)(irate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="destination", response_code=~"4.*"}) -1

Latency and Saturation

Both latency and saturation are reported on these dashboards because both are related to request duration and package size.

Client/Server Request Duration

The following query builds a dashboard to show critical duration of some requests using quantiles.

Note: quantiles can be modified.

histogram_quantile(0.50, sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m])) by (le, source_service_name)) / 1000
histogram_quantile(0.50, sum(irate(istio_request_duration_milliseconds_bucket{reporter="destination"}[1m])) by (le, destination_service_name)) / 1000

Incoming Request Size by Source/Destination

The following query builds a dashboard to show critical size of some requests using quantiles.

Note: quantiles can be modified.

histogram_quantile(0.50, sum(irate(istio_request_bytes_bucket{reporter="source"}[5m])) by (source_workload, source_workload_namespace, le))
histogram_quantile(0.50, sum(irate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, le))

Response Size By Source/Destination

The following query builds a dashboard to show critical size of some responses using quantiles.

Note: quantiles can be modified.

histogram_quantile(0.50, sum(irate(istio_response_bytes_bucket{reporter="source"}[5m])) by (source_workload, source_workload_namespace, le))
histogram_quantile(0.50, sum(irate(istio_response_bytes_bucket{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, le))

Agent Configuration

The default agent job for this integration is as follows:

- job_name: 'istiod'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - source_labels: [__meta_kubernetes_pod_phase]
    action: keep
    regex: Running
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (discovery);(.{0}$)
    replacement: istiod
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "istiod"
  - action: replace
    source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    target_label: __address__
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (citadel_server_csr_count|citadel_server_success_cert_issuance_count|galley_validation_failed|galley_validation_passed|istiod_uptime_seconds|pilot_conflict_inbound_listener|pilot_conflict_outbound_listener_http_over_current_tcp|pilot_conflict_outbound_listener_tcp_over_current_http|pilot_conflict_outbound_listener_tcp_over_current_tcp|pilot_endpoint_not_ready|pilot_services|pilot_total_xds_internal_errors|pilot_total_xds_rejects|pilot_virt_services|pilot_xds|pilot_xds_cds_reject|pilot_xds_config_size_bytes_bucket|pilot_xds_eds_reject|pilot_xds_lds_reject|pilot_xds_push_context_errors|pilot_xds_push_time_bucket|pilot_xds_pushes|pilot_xds_rds_reject|pilot_xds_send_time_bucket|pilot_xds_write_timeout|sidecar_injection_failure_total|sidecar_injection_requests_total|sidecar_injection_success_total|istio_build|istio_request_bytes_bucket|istio_request_duration_milliseconds_bucket|istio_requests_total|istio_response_bytes_bucket|istio_tcp_received_bytes_total|istio_tcp_sent_bytes_total|pilot_proxy_convergence_time_bucket)
    action: keep