Windows

Metrics, Dashboards, Alerts and more for Windows Integration in Sysdig Monitor.
Windows

This integration is disabled by default. Please contact Sysdig Support to enable it in your account.

This integration has 77 metrics.

List of Alerts

AlertDescriptionFormat
[Windows] High CPU UsageThe CPU of the Windows instance reached 95% of usePrometheus
[Windows] High Disk UsageDisk full over 95% in instance {{$labels.instance}}Prometheus
[Windows] High Physical Memory UsageHigh physical memory usage in instancePrometheus
[Windows] High Network Inbound ErrorsHigh inbound network error rate in instancePrometheus
[Windows] High Network Outbound ErrorsHigh outbound network error rate in instancePrometheus
[Windows] Increase of Disk writes timeIncrease of Disk writes timePrometheus
[Windows] Queue of Writes and reads Disk operations is growingThe queue for writes and reads disk operations is growingPrometheus
[Windows] High percent of swap space usedThe swap space has reached high amount of usedPrometheus
[Windows] Network bandwidth is reaching its limitNetwork Bandwith use is reaching its limitPrometheus
[Windows] High number of transitions virtual addresses into diskThe rate at which pages transition to resident memory without being written to disk has reached problematic limitPrometheus

List of Dashboards

Windows Host Overview

The dashboard provides information about the Windows host. Windows Host Overview

Windows Process Overview

The dashboard provides information about the Windows processes. Windows Process Overview

Windows Services Overview

The dashboard provides information about the Windows services. Windows Services Overview

Windows Node Overview (Legacy)

The dashboard provides information about the Windows nodes (legacy). Windows Node Overview (Legacy)

List of Metrics

Metric name
windows_cpu_core_frequency_mhz
windows_cpu_time_total
windows_cs_physical_memory_bytes
windows_logical_disk_free_bytes
windows_logical_disk_read_bytes_total
windows_logical_disk_reads_total
windows_logical_disk_requests_queued
windows_logical_disk_size_bytes
windows_logical_disk_split_ios_total
windows_logical_disk_write_bytes_total
windows_logical_disk_write_seconds_total
windows_logical_disk_writes_total
windows_memory_transition_faults_total
windows_net_bytes_received_total
windows_net_bytes_sent_total
windows_net_bytes_total
windows_net_current_bandwidth_bytes
windows_net_packets_outbound_discarded_total
windows_net_packets_outbound_errors
windows_net_packets_outbound_errors_total
windows_net_packets_received_discarded_total
windows_net_packets_received_errors
windows_net_packets_received_errors_total
windows_net_packets_received_total
windows_net_packets_sent_total
windows_os_paging_free_bytes
windows_os_paging_limit_bytes
windows_os_physical_memory_free_bytes
windows_os_processes
windows_os_users
windows_os_virtual_memory_bytes
windows_os_virtual_memory_free_bytes
windows_process_cpu_time_total
windows_process_io_bytes_total
windows_process_io_operations_total
windows_process_threads
windows_process_working_set_bytes
windows_service_info
windows_service_start_mode
windows_service_state
windows_service_status
windows_system_context_switches_total
windows_system_processor_queue_length
windows_system_system_up_time
windows_system_threads
wmi_cpu_core_frequency_mhz
wmi_cpu_time_total
wmi_cs_physical_memory_bytes
wmi_logical_disk_free_bytes
wmi_logical_disk_read_bytes_total
wmi_logical_disk_reads_total
wmi_logical_disk_requests_queued
wmi_logical_disk_size_bytes
wmi_logical_disk_split_ios_total
wmi_logical_disk_write_bytes_total
wmi_logical_disk_writes_total
wmi_net_bytes_received_total
wmi_net_bytes_sent_total
wmi_net_bytes_total
wmi_net_current_bandwidth
wmi_net_packets_outbound_discarded
wmi_net_packets_outbound_errors
wmi_net_packets_received_discarded
wmi_net_packets_received_errors
wmi_net_packets_received_total
wmi_net_packets_sent_total
wmi_os_paging_free_bytes
wmi_os_paging_limit_bytes
wmi_os_physical_memory_free_bytes
wmi_os_processes
wmi_os_users
wmi_os_virtual_memory_bytes
wmi_os_virtual_memory_free_bytes
wmi_system_context_switches_total
wmi_system_processor_queue_length
wmi_system_system_up_time
wmi_system_threads

Preparing the Integration

Enable Windows Prometheus Metrics

In order to collect metrics from Windows VMs, you need to install the Windows exporter Prometheus agent, and the Prometheus Server.

Windows exporter

This component connects to WMI and exposes Windows metrics in Prometheus metric format.

To install this exporter:

  1. Dowload the latest version from Windows Exporter repository
  2. Configure the exporter
  3. Run the exporter with $>.\exporter.exe --config.file=config.yaml
Exporter configuration

You can configure Windows exporter using the config.yaml file as follows:

# Configuration and more info
# https://github.com/prometheus-community/windows_exporter

collectors:
  enabled: cpu,cs,logical_disk,net,os,service,system,textfile,process
collector:
  textfile:
    directory: C:\Path\metrics\
#  service:
#    services-where: "Name='windows_exporter'"
log:
  level: warn

Prometheus Agent

This component collects metrics from the Windows exporter and forwards them to Prometheus Server Remote Write endpoint.

To install the agent:

  1. Download the latest version from Prometheus Repository
  2. Configure the agent
  3. Run agent with $>.\prometheus.exe --enable-feature=agent
Agent Configuration

You can configure Windows exporter using the prometheus.yaml file as follows:

global:
  scrape_interval: 10s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  
remote_write:
    - url: "https://api.sysdigcloud.com/prometheus/remote/write"
      bearer_token: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
      proxy_url: "https://proxy.url:port" # Set the correct proxy url

scrape_configs:
  - job_name: "windows_exporter"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9182"]
    metric_relabel_configs:
      - source_labels: [instance]
        target_label: instance
        regex: '(.*)'
        replacement: 'windows-vm-demo'

Installing

The installation of an exporter is not required for this integration.

Monitoring and Troubleshooting Windows

This document describes important metrics and queries that you can use to monitor and troubleshoot Windows.

Windows Host Monitoring

CPU

Because CPU usage is critical, be aware of the mode of use of CPU. With 100 * avg by (mode) (rate(windows_cpu_time_total[5m])) you can identify who is consuming the processor the most. One tip for this visualization is to focus on idle processes because they contribute to CPU usage.

For environments where you have huge machines and tons of cores, you can use the 100 * sum by (core) (rate(windows_cpu_time_total{mode != 'idle'}[5m])) query to check for any potential peaks in every each of them and verify that they are sharing the load correctly.

Memory

Use the following queries to determine memory consumption in your windows host:

  • 100* (windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes

  • windows_os_physical_memory_free_bytes

Additionally, you can use the following alert when the memory utilization is greater than the defined threshold:

100 * (windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes  > 95

Disk

Disk capacity can be monitored by windows_logical_disk_free_bytes and windows_logical_disk_size_bytes.

With this query you can monitor if the disk is reaching its maximum capacity:

100 * (windows_logical_disk_size_bytes - windows_logical_disk_free_bytes) / windows_logical_disk_size_bytes  > 95

Another factor to consider when you measure disk usage is IOPS. To monitor the write operations, use this query:

rate(windows_logical_disk_writes_total[5m])

Network

You can monitor network error rate for inbound and outbound packages with these following queries:

100 * rate(windows_net_packets_received_errors[5m]) / (rate(windows_net_packets_received_errors[5m]) + rate(windows_net_packets_received_total[5m])>0)  > 75

100 * rate(windows_net_packets_outbound_errors[5m]) / (rate(windows_net_packets_outbound_errors[5m]) + rate(windows_net_packets_sent_total[5m])>0)  > 75

Windows Process Monitoring

You can manage processes inside your machine and be aware about CPU that every process consume with the metric windows_process_cpu_time_total for CPU, and the metric windows_process_working_set_bytes for memory.

You can track Input and Output operations by process with the metric windows_process_io_operations_total. This metric will give you information about some process that can overload your system.

Windows Service Monitoring

You can know about the status and health of the services inside your environment.

You can use this query to monitor the services that are running aggregated by status.

count by (status,instance)((windows_service_status > 0) * on (name) group_left(state) (windows_service_state{state=~"running"} > 0))

In order to identify every single behavior that is critical for your infrastructure, you have to learn about the properties and states of your services.

For state you need to focus on stopped and running, for start mode you have auto, manual and disabled and for status you will manage ok and error.

With those properties defined, you can monitor your services in running state and error status with the following query:

count(windows_service_status{status=~"error"} > 0)

You can also verify the services that are disabled with the following query:

sum by(name,instance) (windows_service_start_mode{start_mode=~"disabled"} > 0)

Agent Configuration

This integration has no default agent job.