Sysdig Documentation

Logging and Troubleshooting

Logging

After the Agent begins scraping Prometheus metrics, there may be a delay of up to a few minutes before the metrics become visible in Sysdig Monitor. To help quickly confirm your configuration is correct, starting with Agent version 0.80.0, the following log line will appear in the Agent log the first time since starting that it has found and is successfully scraping at least one Prometheus exporter:

2018-05-04 21:42:10.048, 8820, Information, 05-04 21:42:10.048324 Starting export of Prometheus metrics

As this is an INFO level log message, it will appear in Agents using the default logging settings. To reveal even more detail,increase the Agent log level to DEBUG , which produces a message like the following that reveals the name of a specific metric first detected. You can then look for this metric to be visible in Sysdig Monitor shortly after.

2018-05-04 21:50:46.068, 11212, Debug, 05-04 21:50:46.068141 First prometheus metrics since agent start: pid 9583: 5 metrics including: randomSummary.95percentile

Troubleshooting

See the previous section for information on expected log messages during successful scraping. If you have enabled Prometheus and are not seeing the Starting export message shown there, revisit your configuration.

It is also suggested to leave the configuration option in its default setting of log_errors: true , which will reveal any issues scraping eligible processes in the Agent log.

For example, here is an error message for a failed scrape of a TCP port that was listening but not accepting HTTP requests:

2017-10-13 22:00:12.076, 4984, Error, sdchecks[4987] Exception on running check prometheus.5000: Exception('Timeout when hitting http://localhost:5000/metrics',)
2017-10-13 22:00:12.076, 4984, Error, sdchecks, Traceback (most recent call last):
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/sdchecks.py", line 246, in run
2017-10-13 22:00:12.076, 4984, Error, sdchecks, self.check_instance.check(self.instance_conf)
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/checks.d/prometheus.py", line 44, in check
2017-10-13 22:00:12.076, 4984, Error, sdchecks, metrics = self.get_prometheus_metrics(query_url, timeout, "prometheus")
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/checks.d/prometheus.py", line 105, in get_prometheus_metrics
2017-10-13 22:00:12.077, 4984, Error, sdchecks, raise Exception("Timeout when hitting %s" % url)
2017-10-13 22:00:12.077, 4984, Error, sdchecks, Exception: Timeout when hitting http://localhost:5000/metrics

Here is an example error message for a failed scrape of a port that was responding to HTTP requests on the /metrics endpoint but not responding with valid Prometheus-format data. The invalid endpoint is responding as follows:

# curl http://localhost:5002/metrics
This ain't no Prometheus metrics!

And the corresponding error message in the Agent log, indicating no further scraping will be attempted after the initial failure:

2017-10-13 22:03:05.081, 5216, Information, sdchecks[5219] Skip retries for Prometheus error: could not convert string to float: ain't
2017-10-13 22:03:05.082, 5216, Error, sdchecks[5219] Exception on running check prometheus.5002: could not convert string to float: ain't