Cassandra

Cassandra

Cassandra

This integration is enabled by default.

List of Alerts:

AlertDescriptionFormat
[Cassandra] Compaction Task PendingThere are many Cassandra compaction tasks pending.Prometheus
[Cassandra] Commitlog Pending TasksThere are many Cassandra Commitlog tasks pending.Prometheus
[Cassandra] Compaction Executor Blocked TasksThere are many Cassandra compaction executor blocked tasks.Prometheus
[Cassandra] Flush Writer Blocked TasksThere are many Cassandra flush writer blocked tasks.Prometheus
[Cassandra] Storage ExceptionsThere are storage exceptions in Cassandra node.Prometheus
[Cassandra] High Tombstones ScannedThere is a high number of tombstones scanned.Prometheus
[Cassandra] JVM Heap MemoryHigh JVM Heap Memory.Prometheus

List of Dashboards:

  • Cassandra Cassandra

List of Metrics:

  • cassandra_bufferpool_misses_total
  • cassandra_bufferpool_size_total
  • cassandra_client_connected_clients
  • cassandra_client_request_read_latency
  • cassandra_client_request_read_timeouts
  • cassandra_client_request_read_unavailables
  • cassandra_client_request_write_latency
  • cassandra_client_request_write_timeouts
  • cassandra_client_request_write_unavailables
  • cassandra_commitlog_completed_tasks
  • cassandra_commitlog_pending_tasks
  • cassandra_commitlog_total_size
  • cassandra_compaction_compacted_bytes_total
  • cassandra_compaction_completed_tasks
  • cassandra_compaction_pending_tasks
  • cassandra_cql_prepared_statements_executed_total
  • cassandra_cql_regular_statements_executed_total
  • cassandra_dropped_messages_mutation
  • cassandra_dropped_messages_read
  • cassandra_jvm_gc_collection_count
  • cassandra_jvm_gc_duration_seconds
  • cassandra_jvm_memory_usage_max_bytes
  • cassandra_jvm_memory_usage_used_bytes
  • cassandra_storage_internal_exceptions_total
  • cassandra_storage_load_bytes_total
  • cassandra_table_read_requests_per_second
  • cassandra_table_tombstoned_scanned
  • cassandra_table_total_disk_space_used
  • cassandra_table_write_requests_per_second
  • cassandra_threadpool_blocked_tasks_total

Monitoring and troubleshooting Cassandra

Here are some interesting metrics and queries to monitor and troubleshoot Cassandra.

General stats

Node Down:

Let’s get the number of expected of nodes, and the actual number of nodes up and running. If the number is not the same, then there might a problem.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_desired)
-
sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(kube_workload_status_ready)
> 0

Dropped Messages

Dropped messages mutation

If there are dropped mutation messages then we probably have write/read failures due to timeouts.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_mutation)

Dropped messages read

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_dropped_messages_read)

Buffer Pool

Buffer Pool size

This buffer is allocated as off-heap in addition to the memory allocated for heap. Memory is allocated when needed. Check if miss rate is high.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_size_total)

Buffer pool misses

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_bufferpool_misses_total)

CQL Statements

CQL Prepared statements

Use prepared statements (query with bound variables) as they are more secure and can be cached.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_prepared_statements_executed_total[$__interval]))

CQL Regular statements

This value should be as low as possible if you are looking for good performance.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_cql_regular_statements_executed_total[$__interval]))

Connected clients

The number of current client connections in each node.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_connected_clients)

Client Request Latency

Write Latency

95th percentile client request write latency.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_write_latency{quantile="0.95"})

Read Latency

95th percentile client request read latency.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_client_request_read_latency{quantile="0.95"})

Unavailable Exceptions

Number of exceptions encountered in regular reads / writes. This number should be near 0 in a healthy cluster.

Read unavailable exceptions

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_unavailables[$__interval]))

Write unavailable exceptions

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_write_unavailables[$__interval]))

Client Request timeouts

Write / read request timeouts in Cassandra nodes. If there are timeouts, check for:

1.- ‘read_request_timeout_in_ms’ value in cassandra.yaml in case it is too low. 2.- Check tombstones that can degrade performance. You can find tombstones query below

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)

Client request read timeout

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_client_request_read_timeouts[$__interval]))

Client request write timeout

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_write_request_read_timeouts[$__interval]))

Threadpool blocked tasks

Compaction blocked tasks

Pending compactions that are blocked. This metric could deviate from “pending compactions” which includes an estimate of tasks that these pending tasks might create after completion.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="CompactionExecutor"}[$__interval]))

Flush writer blocked tasks

The writer flush defines the number of parallel writes on disk. This value should be near 0. Check your “memtable_flush_writers” value to match with your number of cores if you are using SSD disks.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_threadpool_blocked_tasks_total{pool="MemtableFlushWriter"}[$__interval]))

Compactions

Pending Compactions

Compactions that are queued. This value should be as low as possible. If it reaches more than 50 you can start having CPU and Memory pressure.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_compaction_pending_tasks)

Total Size compacted

Cassandra triggers minor compactions automatically so the compacted size should be low unless you trigger a major compaction across the node.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(rate(cassandra_compaction_compacted_bytes_total[$__interval]))

Commit Log

Commit Log pending tasks

This value should be under 15-20 for performance purposes.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_commitlog_pending_tasks)

Storage

Storage Exceptions

Look carefully at this value as any storage error over 0 is critical for Cassandra.

sum  by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_storage_internal_exceptions_total)

JVM and GC

JVM Heap Usage

If you want to tune your Heap memory you can use this query.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="Heap"})

If you want to know the maximum heap memory you can use this query.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_max_bytes{area="Heap"})

JVM NonHeap usage

Use this query for NonHeap memory.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_memory_usage_used_bytes{area="NonHeap"})

GC Info

If there is memory pressure the max GC duration will start increasing.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_jvm_gc_duration_seconds)

Keyspaces and Tables

Keyspace Size

This query gives you information of all keyspaces.

sum by (kube_cluster_name,kube_namespace_name, kube_workload_name)(cassandra_table_total_disk_space_used)

Table Size

This query gives you information of all tables.

Table highest increase size

Very useful to know what tables are growing too fast.

topk(10,sum by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(delta(cassandra_table_total_disk_space_used[$__interval])))

Tombstones scanned

Cassandra does not delete data from disk at once. Instead, it writes a tombstone with a value that indicates the data has been deleted.

A high value (more than 1000) can cause GC pauses, latency and read failures. Sometimes you need to issue a manual compaction from nodetool.

sum  by (kube_cluster_name,kube_namespace_name, kube_workload_name,keyspace,table)(cassandra_table_tombstoned_scanned)