Cassandra

Metrics, Dashboards, Alerts and more for Cassandra Integration in Sysdig Monitor.
Cassandra

This integration is enabled by default.

Versions supported: > v3.x

This integration uses a sidecar exporter that is available in UBI or scratch base image.

This integration has 30 metrics.

Timeseries generated: The JMX-Exporter generates ~850 timeseries (the number of keyspaces and tables).

List of Alerts

AlertDescriptionFormat
[Cassandra] Compaction Task PendingThere are many Cassandra compaction tasks pending.Prometheus
[Cassandra] Commitlog Pending TasksThere are many Cassandra Commitlog tasks pending.Prometheus
[Cassandra] Compaction Executor Blocked TasksThere are many Cassandra compaction executor blocked tasks.Prometheus
[Cassandra] Flush Writer Blocked TasksThere are many Cassandra flush writer blocked tasks.Prometheus
[Cassandra] Storage ExceptionsThere are storage exceptions in Cassandra node.Prometheus
[Cassandra] High Tombstones ScannedThere is a high number of tombstones scanned.Prometheus
[Cassandra] JVM Heap MemoryHigh JVM Heap Memory.Prometheus

List of Dashboards

Cassandra

The dashboard provides information on the status of Cassandra. Cassandra

List of Metrics

Metric name
cassandra_bufferpool_misses_total
cassandra_bufferpool_size_total
cassandra_client_connected_clients
cassandra_client_request_read_latency
cassandra_client_request_read_timeouts
cassandra_client_request_read_unavailables
cassandra_client_request_write_latency
cassandra_client_request_write_timeouts
cassandra_client_request_write_unavailables
cassandra_commitlog_completed_tasks
cassandra_commitlog_pending_tasks
cassandra_commitlog_total_size
cassandra_compaction_compacted_bytes_total
cassandra_compaction_completed_tasks
cassandra_compaction_pending_tasks
cassandra_cql_prepared_statements_executed_total
cassandra_cql_regular_statements_executed_total
cassandra_dropped_messages_mutation
cassandra_dropped_messages_read
cassandra_jvm_gc_collection_count
cassandra_jvm_gc_duration_seconds
cassandra_jvm_memory_usage_max_bytes
cassandra_jvm_memory_usage_used_bytes
cassandra_storage_internal_exceptions_total
cassandra_storage_load_bytes_total
cassandra_table_read_requests_per_second
cassandra_table_tombstoned_scanned
cassandra_table_total_disk_space_used
cassandra_table_write_requests_per_second
cassandra_threadpool_blocked_tasks_total

Preparing the Integration

Create ConfigMap for the JMX-Exporter

The JMX-Exporter requires a ConfigMap with the Cassandra JXM configurations, which can be easily installed using a simple command. The following example is for a Cassandra cluster which exposes the jmx port 7199 and it’s deployed in the ‘cassandra’ namespace (modify the jmx port and the namespace as per your needs):

helm repo add promcat-charts https://sysdiglabs.github.io/integrations-charts 
helm repo update
helm -n cassandra install cassandra-jmx-exporter promcat-charts/jmx-exporter --set jmx_port=7199 --set integrationType=cassandra --set onlyCreateJMXConfigMap=true

Installing

An automated wizard is present in the Monitoring Integrations in Sysdig Monitor. However, you can also use this Helm chart for expert users: https://github.com/sysdiglabs/integrations-charts/tree/main/charts/jmx-exporter

Monitoring and Troubleshooting Cassandra

Here are some interesting metrics and queries to monitor and troubleshoot Cassandra.

General Stats

Node Down

Let’s get the number of expected of nodes, and the actual number of nodes up and running. If the number is not the same, then there might a problem.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_desired)
-
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_ready)
> 0

Dropped Messages

Dropped Messages Mutation

If there are dropped mutation messages then we probably have write/read failures due to timeouts.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_mutation)
Dropped Messages Read
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_read)

Buffer Pool

Buffer Pool Size

This buffer is allocated as off-heap in addition to the memory allocated for heap. Memory is allocated when needed. Check if miss rate is high.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_size_total)
Buffer Pool Misses
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_misses_total)

CQL Statements

CQL Prepared Statements

Use prepared statements (query with bound variables) as they are more secure and can be cached.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_prepared_statements_executed_total[$__interval]))
CQL Regular Statements

This value should be as low as possible if you are looking for good performance.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_regular_statements_executed_total[$__interval]))

Connected Clients

The number of current client connections in each node.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_connected_clients)

Client Request Latency

Write Latency

95th percentile client request write latency.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_write_latency{quantile="0.95"})
Read Latency

95th percentile client request read latency.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_read_latency{quantile="0.95"})

Unavailable Exceptions

Number of exceptions encountered in regular reads / writes. This number should be near 0 in a healthy cluster.

Read Unavailable Exceptions
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_unavailables[$__interval]))
Write Unavailable Exceptions
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_write_unavailables[$__interval]))

Write Unavailable Exceptions

Write / read request timeouts in Cassandra nodes. If there are timeouts, check for:

1.- ‘read_request_timeout_in_ms’ value in cassandra.yaml in case it is too low. 2.- Check tombstones that can degrade performance. You can find tombstones query below

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)
Client Request Read Timeout
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_timeouts[$__interval]))
Client Request Write Timeout
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_write_request_read_timeouts[$__interval]))

Threadpool Blocked Tasks

Compaction Blocked Tasks

Pending compactions that are blocked. This metric could deviate from “pending compactions” which includes an estimate of tasks that these pending tasks might create after completion.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="CompactionExecutor"}[$__interval]))
Flush Writer Blocked Tasks

The writer flush defines the number of parallel writes on disk. This value should be near 0. Check your “memtable_flush_writers” value to match with your number of cores if you are using SSD disks.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="MemtableFlushWriter"}[$__interval]))

Compactions

Pending Compactions

Compactions that are queued. This value should be as low as possible. If it reaches more than 50 you can start having CPU and Memory pressure.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_compaction_pending_tasks)
Total Size Compacted

Cassandra triggers minor compactions automatically so the compacted size should be low unless you trigger a major compaction across the node.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_compaction_compacted_bytes_total[$__interval]))

Commit Log

Commit Log Pending Tasks

This value should be under 15-20 for performance purposes.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_commitlog_pending_tasks)

Storage

Storage Exceptions

Look carefully at this value as any storage error over 0 is critical for Cassandra.

sum  by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_storage_internal_exceptions_total)

JVM and GC

JVM Heap Usage

If you want to tune your Heap memory you can use this query.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="Heap"})

If you want to know the maximum heap memory you can use this query.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_max_bytes{area="Heap"})
JVM NonHeap Usage

Use this query for NonHeap memory.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="NonHeap"})
GC Info

If there is memory pressure the max GC duration will start increasing.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_gc_duration_seconds)

Keyspaces and Tables

Keyspace Size

This query gives you information of all keyspaces.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_table_total_disk_space_used)
Table Size

This query gives you information of all tables.

Table Highest Increase Size

Very useful to know what tables are growing too fast.

topk(10,sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(delta(cassandra_table_total_disk_space_used[$__interval])))
Tombstones Scanned

Cassandra does not delete data from disk at once. Instead, it writes a tombstone with a value that indicates the data has been deleted.

A high value (more than 1000) can cause GC pauses, latency and read failures. Sometimes you need to issue a manual compaction from nodetool.

sum  by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)

Agent Configuration

This is the default agent job for this integration:

- job_name: 'cassandra-default'
  tls_config:
    insecure_skip_verify: true
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - action: keep
    source_labels: [__meta_kubernetes_pod_host_ip]
    regex: __HOSTIPS__
  - action: drop
    source_labels: [__meta_kubernetes_pod_annotation_promcat_sysdig_com_omit]
    regex: true
  - action: replace
    source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    target_label: __scheme__
    regex: (https?)
  - action: replace
    source_labels:
    - __meta_kubernetes_pod_container_name
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: (cassandra-exporter);(.{0}$)
    replacement: cassandra
    target_label: __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
  - action: keep
    source_labels:
    - __meta_kubernetes_pod_annotation_promcat_sysdig_com_integration_type
    regex: "cassandra"
  - action: replace
    source_labels: [__meta_kubernetes_pod_uid]
    target_label: sysdig_k8s_pod_uid
  - action: replace
    source_labels: [__meta_kubernetes_pod_container_name]
    target_label: sysdig_k8s_pod_container_name
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: (cassandra_bufferpool_misses_total|cassandra_bufferpool_size_total|cassandra_client_connected_clients|cassandra_client_request_read_latency|cassandra_client_request_read_timeouts|cassandra_client_request_read_unavailables|cassandra_client_request_write_latency|cassandra_client_request_write_timeouts|cassandra_client_request_write_unavailables|cassandra_commitlog_completed_tasks|cassandra_commitlog_pending_tasks|cassandra_commitlog_total_size|cassandra_compaction_compacted_bytes_total|cassandra_compaction_completed_tasks|cassandra_compaction_pending_tasks|cassandra_cql_prepared_statements_executed_total|cassandra_cql_regular_statements_executed_total|cassandra_dropped_messages_mutation|cassandra_dropped_messages_read|cassandra_jvm_gc_collection_count|cassandra_jvm_gc_duration_seconds|cassandra_jvm_memory_usage_max_bytes|cassandra_jvm_memory_usage_used_bytes|cassandra_storage_internal_exceptions_total|cassandra_storage_load_bytes_total|cassandra_table_read_requests_per_second|cassandra_table_tombstoned_scanned|cassandra_table_total_disk_space_used|cassandra_table_write_requests_per_second|cassandra_threadpool_blocked_tasks_total)
    action: keep