Implement `probe_failure_info` Metric #1334

slrtbtfs · 2024-12-12T16:15:26Z

This make information about Probe failures that is already logged is additionally available via the probe_failure_info metric.

The underlying mechanisms are mainly implemented in prober/prober.go and prober/handler.go, the rest of the diff is mostly refactoring.

The main change is adding a new type ProbeResult, which replaces the boolean return type of Probes. For failed Probes this type includes diagnostic information which is then published in the probe_failure_reason metric. If error messages contain information that is quite large or may change very frequently, leading to a high cardinality, its not included in this metric and instead just logged. .

The underlying mechanisms are mainly implemented in prober/prober.go and prober/handler.go, the rest is mostly refactoring. Since this new type is used in almost every Probe function and their tests, the diff got quite big. I hope that doesn't make the review to cumbersome

As a side effect, now most tests not only check whether a probe succeeded, but also whether failed tests failed with the correct error message. Having the error message in the return type also allowed to remove some more complex testing code that used to extract this information from the prober logs.

~~Before merging, this should still get some improved tests and polishing, however I'd like to get some early feedback about whether you agree with the implementation approach in general.~~

There is still space for improvement as some error messages provide only limited detail (e.g. the generic HTTP request failed one). IMO this makes most sense as a follow up once the basic implementation of this metric is merged.

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

slrtbtfs · 2025-04-28T14:54:37Z

I've finally got around to finishing up on this PR 🎉

Apologies in advance for the size towards the reviewers.

The main reason it got that big is that the the new metric tries to capture error messages for every failure path in the code base. Additionally all tests have been augmented with checks whether they fail for the right reason. This might also make it easier to spot future regressions entirely unrelated to this PR.

electron0zero · 2025-10-29T13:02:15Z

I see that probe_failure_info metric is trying to cover something that's should be covered by logs, and because of that I don't see the strong need to have this. Metrics and logs are suited for different things, and this feels like an an area where logs are better suited instead of an info metric.

What is the use-case this metric is trying to solve, and how that use-case can't be solved by logs? If we are missing logs for some failures, I would be happy to accept a PR that improves failure logging.

slrtbtfs · 2025-10-29T13:56:06Z

Hi, thanks for looking into this!

The use case we're having for this is the following:

We use one Blackbox instance to monitor targets run by many different customers.

When a probe fails, we send out an alert using Prometheus + Alertmanager to the specific entity responsible for that particular Service.

Currently, the available metrics only offer scarce and incomplete information about why a probe failed, so there is AFAICT no easy way to include such information in a Prometheus generated alert.

Especially in cases, where failures are not consistent, then the recipient of the alert needs to create a ticket and someone with access to the logs (that cover all Monitoring Targets and shouldn't be available to end users only responsible for a single target) needs to search for the logs correlated with one particular alert.

This is further complicated by the fact that failure reasons for probes are only visible with debug logging activated for probes, which generates so much logs, that a longer retention isn't practical.

Overall, delivering the information contained in the logs to the recipient of a Prometheus Alert is a complicated manual process, involving communication of multiple people across teams and organizations.

If there was a metric containing some basic failure reasons, this could all be automated away trivially.

electron0zero · 2025-10-29T18:44:43Z

This is further complicated by the fact that failure reasons for probes are only visible with debug logging activated for probes, which generates so much logs, that a longer retention isn't practical.

we made some changes in the way logger works in PR: #1461, and even updated some log levels for some logs lines.

Please try running build from master to give it a try.

If you notice that there are probe errors logs that are logged at debug level or Info level, but should be logged at warn or error level, please send a PR to improve them and fix the log level or add missing logs.

Overall, delivering the information contained in the logs to the recipient of a Prometheus Alert is a complicated manual process, involving communication of multiple people across teams and organizations.

We allow getting logs, and even added support to fetch logs by a target in PR: #1063 so users can attach links direct links to logs in alerts

You can use that endpoint or other source to enrich the Alert by adding more data into annotations. for example: links to view the logs, or attach logs directly into alert payload.

slrtbtfs · 2025-11-05T14:18:54Z

Thank you for these suggestions!

Using the logs endpoint seems promising.

However, this brings up the following two issues:

(Major) We frequently have multiple prober modules configured for the same target. In that case the endpoint doesn't allow ensuring, we receive the logs for the correct module. This issue would be solved by merging Fix retrieving probe logs by target name when probed by multiple modules #1257. If you could give that PR a review that would be greatly appreciated.
(Minor) There is a potential race condition in this approach. It is theoretically possible that a new Probe has been completed by the time the alert enricher is fetching the logs. I know neither how much of an actual problem this is, ~~nor what would be a good solution.~~ (maybe Add metric for Probe ID #1491 would help with both of these issues)

electron0zero · 2025-11-25T11:51:05Z

This issue would be solved by merging Fix retrieving probe logs by target name when probed by multiple modules #1257.

@slrtbtfs #1257 is merged into main and will be included in next release.

slrtbtfs · 2025-11-26T09:31:56Z

This issue would be solved by merging Fix retrieving probe logs by target name when probed by multiple modules #1257.

@slrtbtfs #1257 is merged into main and will be included in next release.

Thanks a lot, I'm closing this PR then.
(maybe #1077 can be closed then, too?)

slrtbtfs force-pushed the probe_failure_info branch from 4df06d6 to 73630e8 Compare December 12, 2024 16:16

slrtbtfs mentioned this pull request Dec 12, 2024

[feature request] probe (module scoped) failure reason metric #1077

Open

slrtbtfs force-pushed the probe_failure_info branch from 4630786 to 4f0dddb Compare December 12, 2024 16:30

github-actions bot added the stale label Feb 13, 2025

slrtbtfs force-pushed the probe_failure_info branch 7 times, most recently from 0d443ed to 274a2f2 Compare February 27, 2025 14:29

slrtbtfs force-pushed the probe_failure_info branch 2 times, most recently from 8b4dc6f to af44658 Compare April 11, 2025 14:01

github-actions bot removed the stale label Apr 14, 2025

slrtbtfs added 16 commits April 28, 2025 15:09

Basic Infrastructure for Handling probe failures

ed6d767

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Report Probe Results for grpc Prober

d3be6eb

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Report Probe Results for HTTP Prober

b003e9c

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Improve Help Text

df8c331

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Report Probe Results for TCP Prober

39bcb42

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Report Probe Results for ICMP Prober

747e8b5

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Report Probe Results for DNS Prober

5c48039

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Fix label generation

4398c7f

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Fix tests

9825c25

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Remove raw errors from Prober result

1148d21

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Log remaining errors

8704c54

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Fix SSL Log Test

655ed2b

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Put failure results on one log line

4a60a60

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Fix typo

13f0367

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

fix tests after rebase

a76e580

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Check error messages in DNS tests

0fc7c76

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

slrtbtfs added 12 commits April 28, 2025 15:12

Check Failure Reasons in grpc test

0d9639f

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

WIP: Check Failure reasons in HTTP test

3547ee0

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Use new result type in http tests

2ea73c5

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Switch more tests over to ProbeResult

8df2769

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Use new Data Types in all tests

01c0162

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Use correct string formatting

9b081ba

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Documentation for ProbeResult type

ab92c7d

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Correct string formatting

2753cca

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

rework grpc tests

26cca1a

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

WIP: Fix tests after rebase

4553a8f

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Fix tests after rebase

accbc90

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Grammar

692d921

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

slrtbtfs force-pushed the probe_failure_info branch from 15e8828 to 692d921 Compare April 28, 2025 13:15

slrtbtfs added 2 commits April 28, 2025 16:25

Instrument TLS and TCP Code

333e0b9

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

Add CI-Skip

a5eefb9

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>

slrtbtfs marked this pull request as ready for review April 28, 2025 14:50

github-actions bot added the stale label Jul 1, 2025

github-actions bot removed the stale label Nov 1, 2025

slrtbtfs closed this Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement `probe_failure_info` Metric #1334

Implement `probe_failure_info` Metric #1334

Uh oh!

slrtbtfs commented Dec 12, 2024 •

edited

Loading

Uh oh!

slrtbtfs commented Apr 28, 2025

Uh oh!

electron0zero commented Oct 29, 2025

Uh oh!

slrtbtfs commented Oct 29, 2025

Uh oh!

electron0zero commented Oct 29, 2025

Uh oh!

slrtbtfs commented Nov 5, 2025 •

edited

Loading

Uh oh!

electron0zero commented Nov 25, 2025

Uh oh!

slrtbtfs commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement probe_failure_info Metric #1334

Implement probe_failure_info Metric #1334

Uh oh!

Conversation

slrtbtfs commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slrtbtfs commented Apr 28, 2025

Uh oh!

electron0zero commented Oct 29, 2025

Uh oh!

slrtbtfs commented Oct 29, 2025

Uh oh!

electron0zero commented Oct 29, 2025

Uh oh!

slrtbtfs commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

electron0zero commented Nov 25, 2025

Uh oh!

slrtbtfs commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement `probe_failure_info` Metric #1334

Implement `probe_failure_info` Metric #1334

slrtbtfs commented Dec 12, 2024 •

edited

Loading

slrtbtfs commented Nov 5, 2025 •

edited

Loading