Istio
This integration is enabled by default.
Versions supported: 1.19
This integration is out-of-the-box, so it doesn’t require any exporter.
This integration has 28 metrics.
Timeseries generated: For a ten-node cluster, the number of timeseries can vary between 1K and 10K due to the presence of high cardinality metrics. The number of workloads in your cluster, the number of Istio policies, and the Istio network rules applied determine this number.
List of Alerts
Alert | Description | Format |
---|---|---|
[Istio-Citadel] CSR without success | Some of the Certificate Signing Request (CSR) were not correctly requested | Prometheus |
[Istio-Pilot] Inbound listener rules conflicts | There are some conflict with inbound listener rules | Prometheus |
[Istio-Pilot] Endpoint found in unready state | Endpoint found in unready state | Prometheus |
[Istio] Unstable requests for sidecar injections | Sidecar injections requests are failing | Prometheus |
[Istiod] Istiod Uptime issue | Istiod UpTime is taking more time than usual | Prometheus |
List of Dashboards
Istio Control Plane
The dashboard provides information on the Istio Control Plane, Pilot, Galley, Mixer and Citadel.
List of Metrics
Metric name |
---|
citadel_server_csr_count |
citadel_server_success_cert_issuance_count |
galley_validation_failed |
galley_validation_passed |
istiod_uptime_seconds |
pilot_conflict_inbound_listener |
pilot_conflict_outbound_listener_http_over_current_tcp |
pilot_conflict_outbound_listener_tcp_over_current_http |
pilot_conflict_outbound_listener_tcp_over_current_tcp |
pilot_endpoint_not_ready |
pilot_services |
pilot_total_xds_internal_errors |
pilot_total_xds_rejects |
pilot_virt_services |
pilot_xds |
pilot_xds_cds_reject |
pilot_xds_config_size_bytes_bucket |
pilot_xds_eds_reject |
pilot_xds_lds_reject |
pilot_xds_push_context_errors |
pilot_xds_push_time_bucket |
pilot_xds_pushes |
pilot_xds_rds_reject |
pilot_xds_send_time_bucket |
pilot_xds_write_timeout |
sidecar_injection_failure_total |
sidecar_injection_requests_total |
sidecar_injection_success_total |
Prerequisites
None.
Installation
Installing an exporter is not required for this integration.
Monitoring and Troubleshooting Istio
This document describes resumed alarms and dashboards for Istio Service. Istio Services are based on network rules as the foundation, so all the alarms and dashboards monitor any problem related to traffic and connections from source and destination.
Alarms
Most of the alarms associated with Istio configuration notifies problems with the Pilot
or Citadel
server. These servers are responsible for important Istio configuration.
Citadel
controls authentication and identity management between services, and manages certificates in every workload.
Pilot
accepts the rules created for traffic behavior provided by the control plane, and converts them into configurations applied by Envoy, based on how configuration aspects are managed locally. Basically, Pilot
is responsible for iptables configuration in every workload.
CSR Without Success
Alarms are defined to notify you of faulty Certificate Signing Requests (CSRs). In order to collect that information, the following metrics are used:
citadel_server_csr_count
citadel_server_success_cert_issuance_count
rate(citadel_server_csr_count[5m]) - rate(citadel_server_success_cert_issuance_count[5m]) > 0
What is CSR: A certificate signing request (CSR) is one of the first steps towards getting your own SSL/TLS certificate. Generated on the same server you plan to install the certificate on, the CSR contains information such as common name, organization, and country. The Certificate Authority (CA) will use CSR to create your certificate. CSR also contains the public key that will be included in your certificate and is signed with the corresponding private key.
Inbound Listener Rules Conflicts
Because Istio works with networking rules, and configures IP addresses, ports, sockets, and so on to send or received traffic. The term listeners
refers to these configurable values. Be aware of possible errors or conflicts with these rules.
pilot_conflict_inbound_listener > 0
Endpoint Found in Unready State
In order to have a stable platform, you need to verify that all endpoints in your network are perfectly working. Use the following alarm to collect that information:
pilot_endpoint_not_ready > 0
Unstable Requests for Sidecar Injections
Istio configures sidecar containers in every pod, and use this sidecar as the frontend server for all the requests that goes to or from that workload. To check if this sidecar injection is properly work, use the following query:
rate(sidecar_injection_requests_total [5m]) - rate(sidecar_injection_success_total [5m]) > 0
Dashboards
Traffic
Traffic is the first golden signal that has to be gathered. Because Istio provides traffic management itself the information it provides will be detailed. Istio has three different parts that you can monitor and specify different metrics: control plane, envoy, and service itself.
This example shows gathering information about Istio service traffic.
Use the istio_requests_total
with relevant labels to colloect wideband of information on different panels.
Client Request Volume and Server Request Volume
The istio_requests_total
metric shows the total request traffic from both sides of the connection, using the reporter
label.
The reporter
label identifies the reporter of the request. It is set to destination if report is from an Istio proxy server. It will be set to source if the report is from a Istio proxy client or a gateway.
sum (irate(istio_requests_total{reporter="source"}[5m]))
sum (irate(istio_requests_total{reporter="destination"}[5m]))
Incoming Request by Source/Destination and Response Code
This dashboard shows the requests received by both source and destination using the reporter
label. The following query segments the HTTP codes with the response_code
label.
sum(irate(istio_requests_total{reporter="source"}[5m])) by (source_workload, source_workload_namespace, response_code)
sum(irate(istio_requests_total{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, response_code)
Client/Server Success Rate (non-5xx responses)
The following query builds a dashboard to monitor all the traffic except related to the internal server errors. The reporter
label is used to segment on both source and destination.
100*(sum by (destination_service_name)(irate(istio_requests_total{reporter="destination",response_code!~"5.*"}[5m])) / sum by (destination_service_name)(irate(istio_requests_total{reporter="destination"}[5m])))
100*(sum by (destination_service_name)(irate(istio_requests_total{reporter="source",response_code!~"5.*"}[5m])) / sum by (destination_service_name)(irate(istio_requests_total{reporter="source"}[5m])))
Errors
The errors summarized in these dashboards are related with HTTP traffic managed by Istio proxies.
4xx Response Code by Source/Destination
The following query builds a dashboard that reports all the bad requests. It uses the reporter
label on both source and destination.
sum by (response_code)(irate(istio_requests_total{reporter="source", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="source", response_code=~"4.*"}) -1
sum by (response_code)(irate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="destination", response_code=~"4.*"}) -1
5xx Response Code by Source/Destination
The following query builds a dashboard to show all the internal server errors requests. The query uses the reporter
label on both source and destination.
sum by (response_code)(irate(istio_requests_total{reporter="source", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="source", response_code=~"4.*"}) -1
sum by (response_code)(irate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) or absent(istio_requests_total{reporter="destination", response_code=~"4.*"}) -1
Latency and Saturation
Both latency and saturation are reported on these dashboards because both are related to request duration and package size.
Client/Server Request Duration
The following query builds a dashboard to show critical duration of some requests using quantiles.
Note: quantiles can be modified.
histogram_quantile(0.50, sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m])) by (le, source_service_name)) / 1000
histogram_quantile(0.50, sum(irate(istio_request_duration_milliseconds_bucket{reporter="destination"}[1m])) by (le, destination_service_name)) / 1000
Incoming Request Size by Source/Destination
The following query builds a dashboard to show critical size of some requests using quantiles.
Note: quantiles can be modified.
histogram_quantile(0.50, sum(irate(istio_request_bytes_bucket{reporter="source"}[5m])) by (source_workload, source_workload_namespace, le))
histogram_quantile(0.50, sum(irate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, le))
Response Size By Source/Destination
The following query builds a dashboard to show critical size of some responses using quantiles.
Note: quantiles can be modified.
histogram_quantile(0.50, sum(irate(istio_response_bytes_bucket{reporter="source"}[5m])) by (source_workload, source_workload_namespace, le))
histogram_quantile(0.50, sum(irate(istio_response_bytes_bucket{reporter="destination"}[5m])) by (destination_workload, destination_workload_namespace, le))
Agent Configuration
The default agent job for this integration is as follows:
- job_name: 'istiod'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- source_labels: [__meta_kubernetes_pod_phase]
action: keep
regex: Running
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (discovery);(.{0}$)
replacement: istiod
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "istiod"
- action: replace
source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (citadel_server_csr_count|citadel_server_success_cert_issuance_count|galley_validation_failed|galley_validation_passed|istiod_uptime_seconds|pilot_conflict_inbound_listener|pilot_conflict_outbound_listener_http_over_current_tcp|pilot_conflict_outbound_listener_tcp_over_current_http|pilot_conflict_outbound_listener_tcp_over_current_tcp|pilot_endpoint_not_ready|pilot_services|pilot_total_xds_internal_errors|pilot_total_xds_rejects|pilot_virt_services|pilot_xds|pilot_xds_cds_reject|pilot_xds_config_size_bytes_bucket|pilot_xds_eds_reject|pilot_xds_lds_reject|pilot_xds_push_context_errors|pilot_xds_push_time_bucket|pilot_xds_pushes|pilot_xds_rds_reject|pilot_xds_send_time_bucket|pilot_xds_write_timeout|sidecar_injection_failure_total|sidecar_injection_requests_total|sidecar_injection_success_total|istio_build|istio_request_bytes_bucket|istio_request_duration_milliseconds_bucket|istio_requests_total|istio_response_bytes_bucket|istio_tcp_received_bytes_total|istio_tcp_sent_bytes_total|pilot_proxy_convergence_time_bucket)
action: keep
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.