Windows

Metrics, Dashboards, Alerts and more for Windows Integration in Sysdig Monitor.
Windows

This integration is disabled by default. See Enable and Disable Integrations to enable it in your account.

This integration has 77 metrics.

List of Alerts

AlertDescriptionFormat
[Windows] High CPU UsageThe CPU of the Windows instance reached 95% of usePrometheus
[Windows] High Disk UsageDisk full over 95% in instance {{$labels.instance}}Prometheus
[Windows] High Physical Memory UsageHigh physical memory usage in instancePrometheus
[Windows] High Network Inbound ErrorsHigh inbound network error rate in instancePrometheus
[Windows] High Network Outbound ErrorsHigh outbound network error rate in instancePrometheus
[Windows] Increase of Disk writes timeIncrease of Disk writes timePrometheus
[Windows] Queue of Writes and reads Disk operations is growingThe queue for writes and reads disk operations is growingPrometheus
[Windows] High percent of swap space usedThe swap space has reached high amount of usedPrometheus
[Windows] Network bandwidth is reaching its limitNetwork Bandwith use is reaching its limitPrometheus
[Windows] High number of transitions virtual addresses into diskThe rate at which pages transition to resident memory without being written to disk has reached problematic limitPrometheus

List of Dashboards

Windows Host Overview

The dashboard provides information about the Windows host. Windows Host Overview

Windows Process Overview

The dashboard provides information about the Windows processes. Windows Process Overview

Windows Services Overview

The dashboard provides information about the Windows services. Windows Services Overview

Windows Node Overview (Legacy)

The dashboard provides information about the Windows nodes (legacy). Windows Node Overview (Legacy)

List of Metrics

Metric name
windows_cpu_core_frequency_mhz
windows_cpu_time_total
windows_cs_physical_memory_bytes
windows_logical_disk_free_bytes
windows_logical_disk_read_bytes_total
windows_logical_disk_reads_total
windows_logical_disk_requests_queued
windows_logical_disk_size_bytes
windows_logical_disk_split_ios_total
windows_logical_disk_write_bytes_total
windows_logical_disk_write_seconds_total
windows_logical_disk_writes_total
windows_memory_transition_faults_total
windows_net_bytes_received_total
windows_net_bytes_sent_total
windows_net_bytes_total
windows_net_current_bandwidth_bytes
windows_net_packets_outbound_discarded_total
windows_net_packets_outbound_errors
windows_net_packets_outbound_errors_total
windows_net_packets_received_discarded_total
windows_net_packets_received_errors
windows_net_packets_received_errors_total
windows_net_packets_received_total
windows_net_packets_sent_total
windows_os_paging_free_bytes
windows_os_paging_limit_bytes
windows_os_physical_memory_free_bytes
windows_os_processes
windows_os_users
windows_os_virtual_memory_bytes
windows_os_virtual_memory_free_bytes
windows_process_cpu_time_total
windows_process_io_bytes_total
windows_process_io_operations_total
windows_process_threads
windows_process_working_set_bytes
windows_service_info
windows_service_start_mode
windows_service_state
windows_service_status
windows_system_context_switches_total
windows_system_processor_queue_length
windows_system_system_up_time
windows_system_threads
wmi_cpu_core_frequency_mhz
wmi_cpu_time_total
wmi_cs_physical_memory_bytes
wmi_logical_disk_free_bytes
wmi_logical_disk_read_bytes_total
wmi_logical_disk_reads_total
wmi_logical_disk_requests_queued
wmi_logical_disk_size_bytes
wmi_logical_disk_split_ios_total
wmi_logical_disk_write_bytes_total
wmi_logical_disk_writes_total
wmi_net_bytes_received_total
wmi_net_bytes_sent_total
wmi_net_bytes_total
wmi_net_current_bandwidth
wmi_net_packets_outbound_discarded
wmi_net_packets_outbound_errors
wmi_net_packets_received_discarded
wmi_net_packets_received_errors
wmi_net_packets_received_total
wmi_net_packets_sent_total
wmi_os_paging_free_bytes
wmi_os_paging_limit_bytes
wmi_os_physical_memory_free_bytes
wmi_os_processes
wmi_os_users
wmi_os_virtual_memory_bytes
wmi_os_virtual_memory_free_bytes
wmi_system_context_switches_total
wmi_system_processor_queue_length
wmi_system_system_up_time
wmi_system_threads

Prerequisites

Windows Prometheus Bundle

The Sysdig Windows Prometheus Bundle is a comprehensive package that installs and configures a Prometheus Agent and the Windows Exporter allowing you to send metrics to your Sysdig Monitor account with ease

Getting Started

To begin monitoring your Windows machines, follow these steps:

  1. Download the binary installer from the latest release of this project
  2. Run the installer in your windows machine
  3. Configure the Sysdig region and your Sysdig API token in the wizard
  4. Select the collectors that you want to enable to produce metrics
  5. Finish the installation
  6. Go to your Sysdig Monitor account and start using the Microsoft Windows dashboards and alerts
Automated installation

You can automate the installation of the Sysdig Windows Prometheus Bundle across multiple machines using the command line or PowerShell.

Use the following command as an example:

msiexec /i windows_exporter-1.0.0-x64.msi ENABLED_COLLECTORS=cpu,os SYSDIG_URL="https://api.sysdigcloud.com/prometheus/remote/write" SYSDIG_TOKEN="yyyyyyy-zzzz-zzzz-zzzz-xxxxxxxx" /qn

This command will install the Sysdig Windows Prometheus Bundle with the specified settings, making it easy to deploy across your infrastructure.

By default, the Prometheus config file is installed in the path C:\Program Files\windows_exporter\prometheus.yml, which can be manually edited to include additional Prometheus jobs.

Options and parameters

From the command line you can use these options:

  • ENABLED_COLLECTORS: Comma separated list of collectors
  • SYSDIG_URL: The Prometheus endpoint of your Sysdig Monitor region in the form https://api.sysdigcloud.com/prometheus/remote/write. Consult the available regions here.
  • COMPUTER_NAME (optional): Overrides the label instance in metrics generated by the Windows Exporter with a custom value. The default value is the computer name stored in the COMPUTERNAME Windows environment variable.
  • PROMETHEUS_PORT (optional): The Prometheus port. The default value is ‘9090’.
  • PROMETHEUS_LOG_ENABLED (optional): The Prometheus log feature, this creates log file of the prometheus agent into the windows_exporter folder. The default value is ‘0’.
  • PROMETHEUS_LOG_LEVEL (optional): The Prometheus log level, this configure the level of the log file if we previously enable the log. The default value is ‘info’.
  • WINDOWS_EXPORTER_LISTEN_ADDR (optional): The Windows Exporter IP address. The default value is ‘0.0.0.0’.
  • WINDOWS_EXPORTER_LISTEN_PORT (optional): The Windows Exporter port. The default value is ‘9182’.
  • WINDOWS_EXPORTER_EXTRA_FLAGS (optional): Windows Exporter additional CLI flags. The default value is an empty string.
  • WINDOWS_EXPORTER_FIREWALL_REMOTE_ADDR (optional): Comma separated remote IP addresses for the Windows Firewall exception (allow list). The default value is an empty string (any remote address).
  • TEXTFILE_DIR (only if textfile collector is enabled): The local folder where the textfile collector will look for files
Automated uninstallation

Use the following command to uninstall:

msiexec /x windows_exporter-1.0.0-x64.msi /qn

Installation

Installing an exporter is not required for this integration.

Monitoring and Troubleshooting Windows

This document describes important metrics and queries that you can use to monitor and troubleshoot Windows.

Windows Host Monitoring

CPU

Because CPU usage is critical, be aware of the mode of use of CPU. With 100 * avg by (mode) (rate(windows_cpu_time_total[5m])) you can identify who is consuming the processor the most. One tip for this visualization is to focus on idle processes because they contribute to CPU usage.

For environments where you have huge machines and tons of cores, you can use the 100 * sum by (core) (rate(windows_cpu_time_total{mode != 'idle'}[5m])) query to check for any potential peaks in every each of them and verify that they are sharing the load correctly.

Memory

Use the following queries to determine memory consumption in your windows host:

  • 100* (windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes

  • windows_os_physical_memory_free_bytes

Additionally, you can use the following alert when the memory utilization is greater than the defined threshold:

100 * (windows_cs_physical_memory_bytes - windows_os_physical_memory_free_bytes) / windows_cs_physical_memory_bytes  > 95

Disk

Disk capacity can be monitored by windows_logical_disk_free_bytes and windows_logical_disk_size_bytes.

With this query you can monitor if the disk is reaching its maximum capacity:

100 * (windows_logical_disk_size_bytes - windows_logical_disk_free_bytes) / windows_logical_disk_size_bytes  > 95

Another factor to consider when you measure disk usage is IOPS. To monitor the write operations, use this query:

rate(windows_logical_disk_writes_total[5m])

Network

You can monitor network error rate for inbound and outbound packages with these following queries:

100 * rate(windows_net_packets_received_errors[5m]) / (rate(windows_net_packets_received_errors[5m]) + rate(windows_net_packets_received_total[5m])>0)  > 75

100 * rate(windows_net_packets_outbound_errors[5m]) / (rate(windows_net_packets_outbound_errors[5m]) + rate(windows_net_packets_sent_total[5m])>0)  > 75

Windows Process Monitoring

You can manage processes inside your machine and be aware about CPU that every process consume with the metric windows_process_cpu_time_total for CPU, and the metric windows_process_working_set_bytes for memory.

You can track Input and Output operations by process with the metric windows_process_io_operations_total. This metric will give you information about some process that can overload your system.

Windows Service Monitoring

You can know about the status and health of the services inside your environment.

You can use this query to monitor the services that are running aggregated by status.

count by (status,instance)((windows_service_status > 0) * on (name) group_left(state) (windows_service_state{state=~"running"} > 0))

In order to identify every single behavior that is critical for your infrastructure, you have to learn about the properties and states of your services.

For state you need to focus on stopped and running, for start mode you have auto, manual and disabled and for status you will manage ok and error.

With those properties defined, you can monitor your services in running state and error status with the following query:

count(windows_service_status{status=~"error"} > 0)

You can also verify the services that are disabled with the following query:

sum by(name,instance) (windows_service_start_mode{start_mode=~"disabled"} > 0)

Agent Configuration

This integration has no default agent job.