(Legacy) Logging and Troubleshooting
Logging
After the Agent begins scraping Prometheus metrics, there may be a delay of up to a few minutes before the metrics become visible in Sysdig Monitor. To help quickly confirm your configuration is correct, starting with Agent version 0.80.0, the following log line will appear in the Agent log the first time since starting that it has found and is successfully scraping at least one Prometheus exporter:
2018-05-04 21:42:10.048, 8820, Information, 05-04 21:42:10.048324 Starting export of Prometheus metrics
As this is an INFO level log message, it will appear in Agents using the default logging settings. To reveal even more detail,increase the Agent log level to DEBUG , which produces a message like the following that reveals the name of a specific metric first detected. You can then look for this metric to be visible in Sysdig Monitor shortly after.
2018-05-04 21:50:46.068, 11212, Debug, 05-04 21:50:46.068141 First prometheus metrics since agent start: pid 9583: 5 metrics including: randomSummary.95percentile
Troubleshooting
See the previous section for information on expected log messages during
successful scraping. If you have enabled Prometheus and are not seeing
the Starting export
message shown there, revisit your
configuration.
It is also suggested to leave the configuration option in its default
setting of log_errors: true
, which will reveal any issues
scraping eligible processes in the Agent log.
For example, here is an error message for a failed scrape of a TCP port that was listening but not accepting HTTP requests:
2017-10-13 22:00:12.076, 4984, Error, sdchecks[4987] Exception on running check prometheus.5000: Exception('Timeout when hitting http://localhost:5000/metrics',)
2017-10-13 22:00:12.076, 4984, Error, sdchecks, Traceback (most recent call last):
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/sdchecks.py", line 246, in run
2017-10-13 22:00:12.076, 4984, Error, sdchecks, self.check_instance.check(self.instance_conf)
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/checks.d/prometheus.py", line 44, in check
2017-10-13 22:00:12.076, 4984, Error, sdchecks, metrics = self.get_prometheus_metrics(query_url, timeout, "prometheus")
2017-10-13 22:00:12.076, 4984, Error, sdchecks, File "/opt/draios/lib/python/checks.d/prometheus.py", line 105, in get_prometheus_metrics
2017-10-13 22:00:12.077, 4984, Error, sdchecks, raise Exception("Timeout when hitting %s" % url)
2017-10-13 22:00:12.077, 4984, Error, sdchecks, Exception: Timeout when hitting http://localhost:5000/metrics
Here is an example error message for a failed scrape of a port that was
responding to HTTP requests on the /metrics
endpoint but not
responding with valid Prometheus-format data. The invalid endpoint is
responding as follows:
# curl http://localhost:5002/metrics
This ain't no Prometheus metrics!
And the corresponding error message in the Agent log, indicating no further scraping will be attempted after the initial failure:
2017-10-13 22:03:05.081, 5216, Information, sdchecks[5219] Skip retries for Prometheus error: could not convert string to float: ain't
2017-10-13 22:03:05.082, 5216, Error, sdchecks[5219] Exception on running check prometheus.5002: could not convert string to float: ain't
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.