Skip to content

Conversation

@slrtbtfs
Copy link

@slrtbtfs slrtbtfs commented Dec 12, 2024

Fixes #1077.

This make information about Probe failures that is already logged is additionally available via the probe_failure_info metric.

The underlying mechanisms are mainly implemented in prober/prober.go and prober/handler.go, the rest of the diff is mostly refactoring.

The main change is adding a new type ProbeResult, which replaces the boolean return type of Probes. For failed Probes this type includes diagnostic information which is then published in the probe_failure_reason metric. If error messages contain information that is quite large or may change very frequently, leading to a high cardinality, its not included in this metric and instead just logged. .

The underlying mechanisms are mainly implemented in prober/prober.go and prober/handler.go, the rest is mostly refactoring. Since this new type is used in almost every Probe function and their tests, the diff got quite big. I hope that doesn't make the review to cumbersome

As a side effect, now most tests not only check whether a probe succeeded, but also whether failed tests failed with the correct error message. Having the error message in the return type also allowed to remove some more complex testing code that used to extract this information from the prober logs.

Before merging, this should still get some improved tests and polishing, however I'd like to get some early feedback about whether you agree with the implementation approach in general.

There is still space for improvement as some error messages provide only limited detail (e.g. the generic HTTP request failed one). IMO this makes most sense as a follow up once the basic implementation of this metric is merged.

@github-actions github-actions bot added the stale label Feb 13, 2025
@slrtbtfs slrtbtfs force-pushed the probe_failure_info branch 7 times, most recently from 0d443ed to 274a2f2 Compare February 27, 2025 14:29
@slrtbtfs slrtbtfs force-pushed the probe_failure_info branch 2 times, most recently from 8b4dc6f to af44658 Compare April 11, 2025 14:01
@github-actions github-actions bot removed the stale label Apr 14, 2025
slrtbtfs added 16 commits April 28, 2025 15:09
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
slrtbtfs added 12 commits April 28, 2025 15:12
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
@slrtbtfs slrtbtfs force-pushed the probe_failure_info branch from 15e8828 to 692d921 Compare April 28, 2025 13:15
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
@slrtbtfs slrtbtfs marked this pull request as ready for review April 28, 2025 14:50
@slrtbtfs
Copy link
Author

I've finally got around to finishing up on this PR 🎉

Apologies in advance for the size towards the reviewers.

The main reason it got that big is that the the new metric tries to capture error messages for every failure path in the code base. Additionally all tests have been augmented with checks whether they fail for the right reason. This might also make it easier to spot future regressions entirely unrelated to this PR.

@github-actions github-actions bot added the stale label Jul 1, 2025
@electron0zero
Copy link
Member

I see that probe_failure_info metric is trying to cover something that's should be covered by logs, and because of that I don't see the strong need to have this. Metrics and logs are suited for different things, and this feels like an an area where logs are better suited instead of an info metric.

What is the use-case this metric is trying to solve, and how that use-case can't be solved by logs? If we are missing logs for some failures, I would be happy to accept a PR that improves failure logging.

@slrtbtfs
Copy link
Author

Hi, thanks for looking into this!

The use case we're having for this is the following:

We use one Blackbox instance to monitor targets run by many different customers.

When a probe fails, we send out an alert using Prometheus + Alertmanager to the specific entity responsible for that particular Service.

Currently, the available metrics only offer scarce and incomplete information about why a probe failed, so there is AFAICT no easy way to include such information in a Prometheus generated alert.

Especially in cases, where failures are not consistent, then the recipient of the alert needs to create a ticket and someone with access to the logs (that cover all Monitoring Targets and shouldn't be available to end users only responsible for a single target) needs to search for the logs correlated with one particular alert.

This is further complicated by the fact that failure reasons for probes are only visible with debug logging activated for probes, which generates so much logs, that a longer retention isn't practical.

Overall, delivering the information contained in the logs to the recipient of a Prometheus Alert is a complicated manual process, involving communication of multiple people across teams and organizations.

If there was a metric containing some basic failure reasons, this could all be automated away trivially.

@electron0zero
Copy link
Member

This is further complicated by the fact that failure reasons for probes are only visible with debug logging activated for probes, which generates so much logs, that a longer retention isn't practical.

we made some changes in the way logger works in PR: #1461, and even updated some log levels for some logs lines.

Please try running build from master to give it a try.

If you notice that there are probe errors logs that are logged at debug level or Info level, but should be logged at warn or error level, please send a PR to improve them and fix the log level or add missing logs.

Overall, delivering the information contained in the logs to the recipient of a Prometheus Alert is a complicated manual process, involving communication of multiple people across teams and organizations.

We allow getting logs, and even added support to fetch logs by a target in PR: #1063 so users can attach links direct links to logs in alerts

You can use that endpoint or other source to enrich the Alert by adding more data into annotations. for example: links to view the logs, or attach logs directly into alert payload.

@github-actions github-actions bot removed the stale label Nov 1, 2025
@slrtbtfs
Copy link
Author

slrtbtfs commented Nov 5, 2025

Thank you for these suggestions!

Using the logs endpoint seems promising.

However, this brings up the following two issues:

  1. (Major) We frequently have multiple prober modules configured for the same target. In that case the endpoint doesn't allow ensuring, we receive the logs for the correct module. This issue would be solved by merging Fix retrieving probe logs by target name when probed by multiple modules #1257. If you could give that PR a review that would be greatly appreciated.
  2. (Minor) There is a potential race condition in this approach. It is theoretically possible that a new Probe has been completed by the time the alert enricher is fetching the logs. I know neither how much of an actual problem this is, nor what would be a good solution. (maybe Add metric for Probe ID #1491 would help with both of these issues)

@electron0zero
Copy link
Member

  1. This issue would be solved by merging Fix retrieving probe logs by target name when probed by multiple modules #1257.

@slrtbtfs #1257 is merged into main and will be included in next release.

@slrtbtfs
Copy link
Author

  1. This issue would be solved by merging Fix retrieving probe logs by target name when probed by multiple modules #1257.

@slrtbtfs #1257 is merged into main and will be included in next release.

Thanks a lot, I'm closing this PR then.
(maybe #1077 can be closed then, too?)

@slrtbtfs slrtbtfs closed this Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature request] probe (module scoped) failure reason metric

2 participants