*#Stop(): wait for own Checkable#ProcessCheckResult() to finish🏁 #10397

Al2Klimov · 2025-03-28T17:09:07Z

This PR is the first step in the process of making the shutdown order of our components less unpredictable. Specifically, the goal is to ensure that components that produce any kind of events (e.g. notifications) or check results are stopped in a determined order, and they don't cause any post-stop events to be triggered. The problem this PR specifically addresses is that e.g. stopping the CheckerComponent does prevent new checks from being scheduled, but it does not wait for already started (so can't be cancelled) checks to finish. This can lead to a situation where the e.g. IcingaDB component is stopped before such checks complete and causing all the resulted events to be lost, because IcingaDB is not able to process them anymore. How does this PR solve this?

It introduces a new CheckResultProducer interface that is implemented only by types that actually produce check results, which as of now are CheckerComponent (for local checks) and ApiListener (for remote checks). Each of such instance maintains a simple counter of the number of ongoing checks results it produces. However, they don't just increment the counter after e.g. Checkable::ExecuteCheck() is called, but only after the check execution produces a result and enters the Checkable::ProcessCheckResult() method, and decremented when leaving it. This way, we don't have to worry about the fact that check execution time can be very long and block the shutdown of the component indefinitely (credit goes to @Al2Klimov). Ok, but how does this prevent triggering post-factum events, you ask? Well, the Checkable::ProcessCheckResult requires now the CheckResultProducer be passed in as an argument, and is only allowed to actually process that result after locking the producer, otherwise the cr is discarded.

But what does locking the producer mean? It doesn't mean acquiring the ObjectLock on that object, but rather acquiring a shared lock¹ on it. However, here's the catch: the producer is only shared lockable if its Stop() method hasn't been already called. Meaning, whenever CheckerComponent::Stop() is called, any thread that tries to acquire a shared lock on it in Checkable::ProcessCheckResult() will ultimately fail and discard the cr without ever processing it. On the other hand, if the checker was stopped after some threads have successfully acquired a shared lock on it (remember that when a thread acquires a shared lock, the producer automatically increments its counter), it will wait for all of them to finish while preventing any new threads from acquiring the lock until its counter goes back to 0.

The only remaining problem that this PR doesn't solve is that ApiListener::Stop() implementation doesn't gracefully disconnect all active connections, which means that we might still lose some important check results received from remote endpoints. However, since this is a completely different issue, it will be addressed in a separate PR.

Edit (yhabteab) end:

This is the very first and hardest step of the #10179 (comment) plan.

Checkable#ProcessCheckResult() now takes a CheckResultProducer::Ptr:

23c2365 Checkable#ProcessCheckResult(): discard CR or delay its producers shutdown

It tries to shared-lock the CheckResultProducer:

Failure means the CheckResultProducer indicates it's already shutting down, so the not yet really started Checkable#ProcessCheckResult() fails and discards the CR
Success means Checkable#ProcessCheckResult() can continue and on shutdown CheckResultProducer will wait for that

This implies that every component producing checks needs to implement CheckResultProducer:

f86b8ba *#Stop(): wait for own Checkable#ProcessCheckResult()s to finish
84cb9d8 mkclass: inherit from Object like this: class T : virtual public Object

Finally, the CheckResultProducer::Ptr has "just" to be passed through from the CheckResultProducer itself to Checkable#ProcessCheckResult():

The new CheckResultProducer interface satisfies the std::SharedLockable requirements. ↩

julianbrost · 2025-03-31T07:45:20Z

TODO(@Al2Klimov): better PR description

Well, that's indeed necessary.

@julianbrost rather agreed with #10179 (comment), but said #10179 (comment) lacks some tracking.

Interesting that I've said something about a comment I haven't even seen until just now.

Al2Klimov · 2025-03-31T16:02:51Z

@julianbrost Very funny. Fact is, on Thursday you

agreed with me that components should stop/join their own checks along with themselves
said that will need some tracking

So I've done that. So far:

684ca7d Now Checkable#ProcessCheckResult() doesn't guess the CR origin, instead it demands a CheckResultProducer barely fitting into a std::shared_lock
a8d011f CheckerComponent implements that CheckResultProducer (TODO(@Al2Klimov): more classes)

Finally, CheckResultProducer::Ptr just has to be passed through.

yhabteab · 2025-05-15T16:56:32Z

@Al2Klimov please don't be mad at me :)! I took the initiative to expand the PR description slightly based on the analysis made together with @jschmidt-icinga, so everyone can follow here without having to browse through the changed files. Though, you can gladly merge all the commits suffixed with the take CheckResultProducer::Ptr message into one commit with a meaningful commit message. They're basically all the same, just for different classes and clutter the PR unnecessarily and make it appear larger than it is and cryptic.

jschmidt-icinga · 2025-05-16T07:09:32Z

I think in principle this solves the problem well, but I agree with @yhabteab, kindly squash it down to a single commit describing the functional change (the PR title is fine), as none of the individual changes make sense without all the others. That would also make it much easier to review on GH.

Al2Klimov · 2025-05-16T08:45:48Z

Are you sure? I've squashed the low-hanging fruits into 170e7cf – and see how big this one commit became!

jschmidt-icinga · 2025-05-16T10:30:17Z

Are you sure? I've squashed the low-hanging fruits into 170e7cf – and see how big this one commit became!

Yes, I'm sure. There's big commits and there's commits with trivial one-line changes in lots of files. You could split off adding the new Classes, maybe adding the final check, but at least all the function signature changes should be in a single commit.

What's more, commits should reflect individual consistent states of the code, i.e., each of them, when checked out, should at least still compile and the program should still work. Granted this is less important when on a branch that gets merged, but its still a good idea to maintain some discipline.

Al2Klimov · 2025-05-16T10:39:18Z

In this case, before I squash them:

Is 65b5a39 good or should the method take CheckResultProducer::Ptr like everyone else, for consistency?
Is 0d0fa18 good or should there be only one signature, i.e always take CheckResultProducer::Ptr, for consistency?

jschmidt-icinga · 2025-05-16T12:59:38Z

* Is 65b5a39 good or should the method take CheckResultProducer::Ptr like everyone else, for consistency?

Personally, I'd maybe also use the CheckResultProducer:Ptr here, at least from Checkable.hpp down. Less for consistency because then consumers of the Checkable.hpp header don't (necessarily) need to know about ApiListener. The elegance of your solution lies in that the classes processing the check results do not need to directly call back to the producer but get an anonymous handle, we might as well use the anonymous handle everywhere we can.

Then again, Checkable-check.cpp already knows about ApiListener, so it's up to you (from my pov).

* Is 0d0fa18 good or should there be only one signature, i.e always take CheckResultProducer::Ptr, for consistency?

Personally, I think it's reasonable to wrap them in a lambda and add the overload for RegisterCommand, like you did. Which avoids adding dead function parameters to functions that don't need them (which would not be clear from the header). It's less significant here, since these functions aren't called manually, but still, wrapping means less clutter means better readability.

Alternatively you could do something like this:

using ExternalCommandCallback =
	std::variant<void (*)(const CheckResultProducer::Ptr &, double, const std::vector<String> &),
		void (*)(double, const std::vector<String> &)>;

and call it in Execute() with std::visit:

	std::visit([&producer, &time, &realArguments](auto && cb){
		using T = std::decay_t<decltype(cb)>;
		if constexpr (std::is_same_v<T, void (*)(const CheckResultProducer::Ptr &, double, const std::vector<String> &)>) {
			cb(producer, time, realArguments);
		} else if constexpr (std::is_same_v<T, void (*)(double, const std::vector<String> &)>) {
			cb(time, realArguments);
		} else {
			static_assert(false, "Must be one or the other");
		}
	}, eci.Callback);

Then you don't need any overloads or std::functions, even making the whole thing a little more efficient (not that it matters here).

Edit: Fixed std::visit.

jschmidt-icinga

Your commits still don't compile atomically. I suggest the following commit structure:

The entire implementation of CheckResultProducerComponent (currently this is split between two commits for some reason)
Make classes inherit from this component (which I'd like to discuss in the first place) and call Start() and Stop() as appropriate
all function signature and call changes (including unit tests)
Add the final check in ProcessCheckResult()

That way each commit builds upon the previous one. Personally I'd squash it down even more ultimately, but that's as atomic as I'd make it for easy review.

But to start, see below for a first batch of suggestions. I'm happy to discuss all of these and learn more about your reasoning behind your choices.

lib/remote/crproducer.hpp

lib/icinga/checkable-check.cpp

tools/mkclass/classcompiler.cpp

lib/remote/crproducer.cpp

lib/remote/crproducer.hpp

lib/icinga/externalcommandprocessor.cpp

This allows .ti classes multiple inheritance from Object.

…s in tests

Namely: Checkable#ProcessCheckResult() ClusterCheckTask::ScriptFunc() ClusterZoneCheckTask::ScriptFunc() DummyCheckTask::ScriptFunc() ExceptionCheckTask::ScriptFunc() IcingaCheckTask::ScriptFunc() IfwApiCheckTask::ScriptFunc() NullCheckTask::ScriptFunc() PluginCheckTask::ScriptFunc() RandomCheckTask::ScriptFunc() SleepCheckTask::ScriptFunc() IdoCheckTask::ScriptFunc() IcingadbCheck::ScriptFunc() CheckCommand#Execute() Checkable#ExecuteCheck() ClusterEvents::ExecuteCheckFromQueue() ExternalCommandProcessor::Process*CheckResult() ExternalCommandCallback ExternalCommandProcessor::Execute() ExternalCommandProcessor::ExecuteFromFile() ExternalCommandProcessor::ProcessFile() LivestatusQuery#ExecuteCommandHelper() LivestatusQuery#Execute()

…tdown

jschmidt-icinga · 2025-05-21T14:00:33Z

I've had a discussion with @julianbrost and @yhabteab with the following results:

We'd like to avoid virtual inheritance by using the new class by composition as a member instead of inheritance. While nobody is strictly against it, we don't see the benefit of using it in this situation over just keeping and passing a class member.
The class should then derive from SharedObject instead of Object because it doesn't need any of Object's capabilities, just the ::Ptr.
It can be named "StoppableWaitGroup" to reflect the (apparent) similarity with Go WaitGroups, with the difference that a stop can be signaled. We are all satisfied with that name but still open to other suggestions.
Further on that note, the initialization done in Start() can be done in the constructor (which basically means just initializing m_InstanceIsActive to true instead of false) and Stop() should be named Wait().
A lock_shared() method should be implemented to properly implement the SharedLockable interface, even if it is not used at the moment. As per the spec, it should throw when no lock can be obtained. Internally it can use try_lock_shared() and just throw when that returns false.

Al2Klimov · 2025-05-21T14:57:23Z

avoid virtual inheritance by using the new class by composition as a member instead of inheritance. While nobody is strictly against it, we don't see the benefit of using it

Benefits of my version (ordered descending by importance):

It's cool af 😎
It perfectly restricts who can exclusive-lock (inheritors) and who is allowed to shared-lock (everyone) – I mean...

to reflect the (apparent) similarity with Go WaitGroups,

... especially when you talk about Go, remember how context works:

context.Context can only query some info and wait
only context.Context creator can cancel it

This restriction would vanish here if I got your idea correctly:

Getters for all affected component classes
A getter returns (a pointer to) the whole thing, so everyone can call everything

(Given #10397 (comment), strange that you wanna give up such a restriction.)

Let me know whether my concern is legit or I shall refactor anyway.

the initialization done in Start() can be done in the constructor (which basically means just initializing m_InstanceIsActive to true instead of false)

Before the component is activated? But why?

Al2Klimov requested review from julianbrost and yhabteab March 28, 2025 17:09

Al2Klimov self-assigned this Mar 28, 2025

cla-bot bot added the cla/signed label Mar 28, 2025

Al2Klimov force-pushed the activation-priority-10179 branch from ce0d035 to 5b1c4b5 Compare March 31, 2025 14:04

Al2Klimov changed the title ~~WIP 🚧~~ 🚧WIP: CheckerComponent#Stop(): wait for own Checkable#ProcessCheckResult() to finish Mar 31, 2025

Al2Klimov requested a review from Copilot March 31, 2025 14:06

This comment was marked as spam.

Sign in to view

Al2Klimov force-pushed the activation-priority-10179 branch 3 times, most recently from a0ad767 to a06b01a Compare April 3, 2025 09:09

Al2Klimov changed the title ~~🚧WIP: CheckerComponent#Stop(): wait for own Checkable#ProcessCheckResult() to finish~~ CheckerComponent#Stop(): wait for own Checkable#ProcessCheckResult() to finish Apr 3, 2025

Al2Klimov changed the title ~~CheckerComponent#Stop(): wait for own Checkable#ProcessCheckResult() to finish~~ *#Stop(): wait for own Checkable#ProcessCheckResult() to finish Apr 3, 2025

Al2Klimov marked this pull request as ready for review April 3, 2025 10:30

Al2Klimov changed the title *#Stop(): wait for own Checkable#ProcessCheckResult() to finish *#Stop(): wait for own Checkable#ProcessCheckResult() to finish🏁 Apr 3, 2025

Al2Klimov force-pushed the activation-priority-10179 branch from ea36975 to a06b01a Compare April 7, 2025 09:53

Al2Klimov marked this pull request as draft April 7, 2025 10:56

Al2Klimov marked this pull request as ready for review April 7, 2025 12:44

Al2Klimov removed their assignment Apr 7, 2025

Al2Klimov force-pushed the activation-priority-10179 branch from 72a171d to 4873a34 Compare April 7, 2025 12:44

Al2Klimov mentioned this pull request Apr 17, 2025

Remove deprecated features: statusdata, compatlog, checkresultfilereader, commandpipe #6307

Open

yhabteab added the bug Something isn't working label May 15, 2025

yhabteab added this to the 2.15.0 milestone May 15, 2025

Al2Klimov force-pushed the activation-priority-10179 branch from 4873a34 to f0de2aa Compare May 16, 2025 08:43

yhabteab mentioned this pull request May 16, 2025

Allow to define both [de]activation_priority for a given type #10385

Closed

Al2Klimov force-pushed the activation-priority-10179 branch from f0de2aa to e56731f Compare May 16, 2025 15:36

jschmidt-icinga reviewed May 19, 2025

View reviewed changes

Al2Klimov force-pushed the activation-priority-10179 branch from e56731f to a4d52af Compare May 19, 2025 16:12

This comment was marked as resolved.

Sign in to view

Al2Klimov mentioned this pull request May 20, 2025

Remove unused ProcessingResult::NoCheckResult #10444

Merged

Al2Klimov added 6 commits May 20, 2025 12:59

mkclass: inherit from Object like this: class T : virtual public Object

fff9c58

This allows .ti classes multiple inheritance from Object.

Introduce CheckResultProducerComponent

61d98ad

Introduce UnitTestCRP to unconditionally allow processing CheckResult…

d672039

…s in tests

*#Stop(): wait for own Checkable#ProcessCheckResult()s to finish

d9481fd

Checkable#ProcessCheckResult(): discard CR or delay its producers shu…

45851bc

…tdown

Al2Klimov force-pushed the activation-priority-10179 branch from a4d52af to 45851bc Compare May 20, 2025 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*#Stop(): wait for own Checkable#ProcessCheckResult() to finish🏁 #10397

*#Stop(): wait for own Checkable#ProcessCheckResult() to finish🏁 #10397

Al2Klimov commented Mar 28, 2025 •

edited by yhabteab

Loading

julianbrost commented Mar 31, 2025

This comment was marked as spam.

Al2Klimov commented Mar 31, 2025

yhabteab commented May 15, 2025 •

edited

Loading

jschmidt-icinga commented May 16, 2025

Al2Klimov commented May 16, 2025

jschmidt-icinga commented May 16, 2025 •

edited

Loading

Al2Klimov commented May 16, 2025

jschmidt-icinga commented May 16, 2025 •

edited

Loading

jschmidt-icinga left a comment

This comment was marked as resolved.

jschmidt-icinga commented May 21, 2025

Al2Klimov commented May 21, 2025

*#Stop(): wait for own Checkable#ProcessCheckResult() to finish🏁 #10397

Are you sure you want to change the base?

*#Stop(): wait for own Checkable#ProcessCheckResult() to finish🏁 #10397

Conversation

Al2Klimov commented Mar 28, 2025 • edited by yhabteab Loading

Edit (yhabteab) end:

Footnotes

julianbrost commented Mar 31, 2025

This comment was marked as spam.

Al2Klimov commented Mar 31, 2025

yhabteab commented May 15, 2025 • edited Loading

jschmidt-icinga commented May 16, 2025

Al2Klimov commented May 16, 2025

jschmidt-icinga commented May 16, 2025 • edited Loading

Al2Klimov commented May 16, 2025

jschmidt-icinga commented May 16, 2025 • edited Loading

jschmidt-icinga left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

jschmidt-icinga commented May 21, 2025

Al2Klimov commented May 21, 2025

Al2Klimov commented Mar 28, 2025 •

edited by yhabteab

Loading

yhabteab commented May 15, 2025 •

edited

Loading

jschmidt-icinga commented May 16, 2025 •

edited

Loading

jschmidt-icinga commented May 16, 2025 •

edited

Loading