-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Implement probe_failure_info Metric
#1334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4df06d6 to
73630e8
Compare
4630786 to
4f0dddb
Compare
0d443ed to
274a2f2
Compare
8b4dc6f to
af44658
Compare
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
15e8828 to
692d921
Compare
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
|
I've finally got around to finishing up on this PR 🎉 Apologies in advance for the size towards the reviewers. The main reason it got that big is that the the new metric tries to capture error messages for every failure path in the code base. Additionally all tests have been augmented with checks whether they fail for the right reason. This might also make it easier to spot future regressions entirely unrelated to this PR. |
|
I see that What is the use-case this metric is trying to solve, and how that use-case can't be solved by logs? If we are missing logs for some failures, I would be happy to accept a PR that improves failure logging. |
|
Hi, thanks for looking into this! The use case we're having for this is the following: We use one Blackbox instance to monitor targets run by many different customers. When a probe fails, we send out an alert using Prometheus + Alertmanager to the specific entity responsible for that particular Service. Currently, the available metrics only offer scarce and incomplete information about why a probe failed, so there is AFAICT no easy way to include such information in a Prometheus generated alert. Especially in cases, where failures are not consistent, then the recipient of the alert needs to create a ticket and someone with access to the logs (that cover all Monitoring Targets and shouldn't be available to end users only responsible for a single target) needs to search for the logs correlated with one particular alert. This is further complicated by the fact that failure reasons for probes are only visible with debug logging activated for probes, which generates so much logs, that a longer retention isn't practical. Overall, delivering the information contained in the logs to the recipient of a Prometheus Alert is a complicated manual process, involving communication of multiple people across teams and organizations. If there was a metric containing some basic failure reasons, this could all be automated away trivially. |
we made some changes in the way logger works in PR: #1461, and even updated some log levels for some logs lines. Please try running build from If you notice that there are probe errors logs that are logged at debug level or Info level, but should be logged at warn or error level, please send a PR to improve them and fix the log level or add missing logs.
We allow getting logs, and even added support to fetch logs by a target in PR: #1063 so users can attach links direct links to logs in alerts You can use that endpoint or other source to enrich the Alert by adding more data into |
|
Thank you for these suggestions! Using the logs endpoint seems promising. However, this brings up the following two issues:
|
@slrtbtfs #1257 is merged into main and will be included in next release. |
Thanks a lot, I'm closing this PR then. |
Fixes #1077.
This make information about Probe failures that is already logged is additionally available via the
probe_failure_infometric.The underlying mechanisms are mainly implemented in
prober/prober.goandprober/handler.go, the rest of the diff is mostly refactoring.The main change is adding a new type
ProbeResult, which replaces the boolean return type of Probes. For failed Probes this type includes diagnostic information which is then published in theprobe_failure_reasonmetric. If error messages contain information that is quite large or may change very frequently, leading to a high cardinality, its not included in this metric and instead just logged. .The underlying mechanisms are mainly implemented in
prober/prober.goandprober/handler.go, the rest is mostly refactoring. Since this new type is used in almost every Probe function and their tests, the diff got quite big. I hope that doesn't make the review to cumbersomeAs a side effect, now most tests not only check whether a probe succeeded, but also whether failed tests failed with the correct error message. Having the error message in the return type also allowed to remove some more complex testing code that used to extract this information from the prober logs.
Before merging, this should still get some improved tests and polishing, however I'd like to get some early feedback about whether you agree with the implementation approach in general.There is still space for improvement as some error messages provide only limited detail (e.g. the generic
HTTP request failedone). IMO this makes most sense as a follow up once the basic implementation of this metric is merged.