Cassandra
This integration is enabled by default.
Versions supported: > v3.x
This integration uses a sidecar exporter that is available in UBI or scratch base image.
This integration has 30 metrics.
Timeseries generated: The JMX-Exporter generates ~850 timeseries (the number of keyspaces and tables).
List of Alerts
Alert | Description | Format |
---|---|---|
[Cassandra] Compaction Task Pending | There are many Cassandra compaction tasks pending. | Prometheus |
[Cassandra] Commitlog Pending Tasks | There are many Cassandra Commitlog tasks pending. | Prometheus |
[Cassandra] Compaction Executor Blocked Tasks | There are many Cassandra compaction executor blocked tasks. | Prometheus |
[Cassandra] Flush Writer Blocked Tasks | There are many Cassandra flush writer blocked tasks. | Prometheus |
[Cassandra] Storage Exceptions | There are storage exceptions in Cassandra node. | Prometheus |
[Cassandra] High Tombstones Scanned | There is a high number of tombstones scanned. | Prometheus |
[Cassandra] JVM Heap Memory | High JVM Heap Memory. | Prometheus |
List of Dashboards
Cassandra
The dashboard provides information on the status of Cassandra.
List of Metrics
Metric name |
---|
cassandra_bufferpool_misses_total |
cassandra_bufferpool_size_total |
cassandra_client_connected_clients |
cassandra_client_request_read_latency |
cassandra_client_request_read_timeouts |
cassandra_client_request_read_unavailables |
cassandra_client_request_write_latency |
cassandra_client_request_write_timeouts |
cassandra_client_request_write_unavailables |
cassandra_commitlog_completed_tasks |
cassandra_commitlog_pending_tasks |
cassandra_commitlog_total_size |
cassandra_compaction_compacted_bytes_total |
cassandra_compaction_completed_tasks |
cassandra_compaction_pending_tasks |
cassandra_cql_prepared_statements_executed_total |
cassandra_cql_regular_statements_executed_total |
cassandra_dropped_messages_mutation |
cassandra_dropped_messages_read |
cassandra_jvm_gc_collection_count |
cassandra_jvm_gc_duration_seconds |
cassandra_jvm_memory_usage_max_bytes |
cassandra_jvm_memory_usage_used_bytes |
cassandra_storage_internal_exceptions_total |
cassandra_storage_load_bytes_total |
cassandra_table_read_requests_per_second |
cassandra_table_tombstoned_scanned |
cassandra_table_total_disk_space_used |
cassandra_table_write_requests_per_second |
cassandra_threadpool_blocked_tasks_total |
Prerequisites
Create ConfigMap for the JMX-Exporter
The JMX-Exporter requires a ConfigMap with the Cassandra JXM configurations, which can be easily installed using a simple command. The following example is for a Cassandra cluster which exposes the jmx port 7199 and it’s deployed in the ‘cassandra’ namespace (modify the jmx port and the namespace as per your needs):
helm repo add promcat-charts https://sysdiglabs.github.io/integrations-charts
helm repo update
helm -n cassandra install cassandra-jmx-exporter promcat-charts/jmx-exporter --set jmx_port=7199 --set integrationType=cassandra --set onlyCreateJMXConfigMap=true
Installation
An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. Expert users can also use the Helm chart for installation: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/jmx-exporter
Monitoring and Troubleshooting Cassandra
Here are some interesting metrics and queries to monitor and troubleshoot Cassandra.
General Stats
Node Down
Let’s get the number of expected of nodes, and the actual number of nodes up and running. If the number is not the same, then there might a problem.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_desired)
-
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_ready)
> 0
Dropped Messages
Dropped Messages Mutation
If there are dropped mutation messages then we probably have write/read failures due to timeouts.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_mutation)
Dropped Messages Read
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_read)
Buffer Pool
Buffer Pool Size
This buffer is allocated as off-heap in addition to the memory allocated for heap. Memory is allocated when needed. Check if miss rate is high.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_size_total)
Buffer Pool Misses
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_misses_total)
CQL Statements
CQL Prepared Statements
Use prepared statements (query with bound variables) as they are more secure and can be cached.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_prepared_statements_executed_total[$__interval]))
CQL Regular Statements
This value should be as low as possible if you are looking for good performance.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_regular_statements_executed_total[$__interval]))
Connected Clients
The number of current client connections in each node.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_connected_clients)
Client Request Latency
Write Latency
95th percentile client request write latency.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_write_latency{quantile="0.95"})
Read Latency
95th percentile client request read latency.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_read_latency{quantile="0.95"})
Unavailable Exceptions
Number of exceptions encountered in regular reads / writes. This number should be near 0 in a healthy cluster.
Read Unavailable Exceptions
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_unavailables[$__interval]))
Write Unavailable Exceptions
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_write_unavailables[$__interval]))
Write Unavailable Exceptions
Write / read request timeouts in Cassandra nodes. If there are timeouts, check for:
1.- ‘read_request_timeout_in_ms’ value in cassandra.yaml in case it is too low. 2.- Check tombstones that can degrade performance. You can find tombstones query below
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)
Client Request Read Timeout
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_timeouts[$__interval]))
Client Request Write Timeout
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_write_request_read_timeouts[$__interval]))
Threadpool Blocked Tasks
Compaction Blocked Tasks
Pending compactions that are blocked. This metric could deviate from “pending compactions” which includes an estimate of tasks that these pending tasks might create after completion.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="CompactionExecutor"}[$__interval]))
Flush Writer Blocked Tasks
The writer flush defines the number of parallel writes on disk. This value should be near 0. Check your “memtable_flush_writers” value to match with your number of cores if you are using SSD disks.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="MemtableFlushWriter"}[$__interval]))
Compactions
Pending Compactions
Compactions that are queued. This value should be as low as possible. If it reaches more than 50 you can start having CPU and Memory pressure.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_compaction_pending_tasks)
Total Size Compacted
Cassandra triggers minor compactions automatically so the compacted size should be low unless you trigger a major compaction across the node.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_compaction_compacted_bytes_total[$__interval]))
Commit Log
Commit Log Pending Tasks
This value should be under 15-20 for performance purposes.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_commitlog_pending_tasks)
Storage
Storage Exceptions
Look carefully at this value as any storage error over 0 is critical for Cassandra.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_storage_internal_exceptions_total)
JVM and GC
JVM Heap Usage
If you want to tune your Heap memory you can use this query.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="Heap"})
If you want to know the maximum heap memory you can use this query.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_max_bytes{area="Heap"})
JVM NonHeap Usage
Use this query for NonHeap memory.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="NonHeap"})
GC Info
If there is memory pressure the max GC duration will start increasing.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_gc_duration_seconds)
Keyspaces and Tables
Keyspace Size
This query gives you information of all keyspaces.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_table_total_disk_space_used)
Table Size
This query gives you information of all tables.
Table Highest Increase Size
Very useful to know what tables are growing too fast.
topk(10,sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(delta(cassandra_table_total_disk_space_used[$__interval])))
Tombstones Scanned
Cassandra does not delete data from disk at once. Instead, it writes a tombstone with a value that indicates the data has been deleted.
A high value (more than 1000) can cause GC pauses, latency and read failures. Sometimes you need to issue a manual compaction from nodetool.
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)
Agent Configuration
The default agent job for this integration is as follows:
- job_name: 'cassandra-default'
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
source_labels: [__meta_kubernetes_pod_host_ip]
regex: __HOSTIPS__
- action: drop
source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
regex: true
- source_labels: [__meta_kubernetes_pod_phase]
action: keep
regex: Running
- action: replace
source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
target_label: __scheme__
regex: (https?)
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: (cassandra-exporter);(.{0}$)
replacement: cassandra
target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
- action: keep
source_labels:
- __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
regex: "cassandra"
- action: replace
source_labels: [__meta_kubernetes_pod_uid]
target_label: sysdig_k8s_pod_uid
- action: replace
source_labels: [__meta_kubernetes_pod_container_name]
target_label: sysdig_k8s_pod_container_name
metric_relabel_configs:
- source_labels: [__name__]
regex: (cassandra_bufferpool_misses_total|cassandra_bufferpool_size_total|cassandra_client_connected_clients|cassandra_client_request_read_latency|cassandra_client_request_read_timeouts|cassandra_client_request_read_unavailables|cassandra_client_request_write_latency|cassandra_client_request_write_timeouts|cassandra_client_request_write_unavailables|cassandra_commitlog_completed_tasks|cassandra_commitlog_pending_tasks|cassandra_commitlog_total_size|cassandra_compaction_compacted_bytes_total|cassandra_compaction_completed_tasks|cassandra_compaction_pending_tasks|cassandra_cql_prepared_statements_executed_total|cassandra_cql_regular_statements_executed_total|cassandra_dropped_messages_mutation|cassandra_dropped_messages_read|cassandra_jvm_gc_collection_count|cassandra_jvm_gc_duration_seconds|cassandra_jvm_memory_usage_max_bytes|cassandra_jvm_memory_usage_used_bytes|cassandra_storage_internal_exceptions_total|cassandra_storage_load_bytes_total|cassandra_table_read_requests_per_second|cassandra_table_tombstoned_scanned|cassandra_table_total_disk_space_used|cassandra_table_write_requests_per_second|cassandra_threadpool_blocked_tasks_total)
action: keep
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.