Skip to content

*#Stop(): wait for own Checkable#ProcessCheckResult() to finish🏁 #10397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Mar 28, 2025

This PR is the first step in the process of making the shutdown order of our components less unpredictable. Specifically, the goal is to ensure that components that produce any kind of events (e.g. notifications) or check results are stopped in a determined order, and they don't cause any post-stop events to be triggered. The problem this PR specifically addresses is that e.g. stopping the CheckerComponent does prevent new checks from being scheduled, but it does not wait for already started (so can't be cancelled) checks to finish. This can lead to a situation where the e.g. IcingaDB component is stopped before such checks complete and causing all the resulted events to be lost, because IcingaDB is not able to process them anymore. How does this PR solve this?

It introduces a new CheckResultProducer interface that is implemented only by types that actually produce check results, which as of now are CheckerComponent (for local checks) and ApiListener (for remote checks). Each of such instance maintains a simple counter of the number of ongoing checks results it produces. However, they don't just increment the counter after e.g. Checkable::ExecuteCheck() is called, but only after the check execution produces a result and enters the Checkable::ProcessCheckResult() method, and decremented when leaving it. This way, we don't have to worry about the fact that check execution time can be very long and block the shutdown of the component indefinitely (credit goes to @Al2Klimov). Ok, but how does this prevent triggering post-factum events, you ask? Well, the Checkable::ProcessCheckResult requires now the CheckResultProducer be passed in as an argument, and is only allowed to actually process that result after locking the producer, otherwise the cr is discarded.

But what does locking the producer mean? It doesn't mean acquiring the ObjectLock on that object, but rather acquiring a shared lock1 on it. However, here's the catch: the producer is only shared lockable if its Stop() method hasn't been already called. Meaning, whenever CheckerComponent::Stop() is called, any thread that tries to acquire a shared lock on it in Checkable::ProcessCheckResult() will ultimately fail and discard the cr without ever processing it. On the other hand, if the checker was stopped after some threads have successfully acquired a shared lock on it (remember that when a thread acquires a shared lock, the producer automatically increments its counter), it will wait for all of them to finish while preventing any new threads from acquiring the lock until its counter goes back to 0.

The only remaining problem that this PR doesn't solve is that ApiListener::Stop() implementation doesn't gracefully disconnect all active connections, which means that we might still lose some important check results received from remote endpoints. However, since this is a completely different issue, it will be addressed in a separate PR.

Edit (yhabteab) end:

This is the very first and hardest step of the #10179 (comment) plan.

Checkable#ProcessCheckResult() now takes a CheckResultProducer::Ptr:

  • 23c2365 Checkable#ProcessCheckResult(): discard CR or delay its producers shutdown

It tries to shared-lock the CheckResultProducer:

  • Failure means the CheckResultProducer indicates it's already shutting down, so the not yet really started Checkable#ProcessCheckResult() fails and discards the CR
  • Success means Checkable#ProcessCheckResult() can continue and on shutdown CheckResultProducer will wait for that

This implies that every component producing checks needs to implement CheckResultProducer:

  • f86b8ba *#Stop(): wait for own Checkable#ProcessCheckResult()s to finish
  • 84cb9d8 mkclass: inherit from Object like this: class T : virtual public Object

Finally, the CheckResultProducer::Ptr has "just" to be passed through from the CheckResultProducer itself to Checkable#ProcessCheckResult():

Footnotes

  1. The new CheckResultProducer interface satisfies the std::SharedLockable requirements.

@Al2Klimov Al2Klimov self-assigned this Mar 28, 2025
@cla-bot cla-bot bot added the cla/signed label Mar 28, 2025
@julianbrost
Copy link
Contributor

Well, that's indeed necessary.

@julianbrost rather agreed with #10179 (comment), but said #10179 (comment) lacks some tracking.

Interesting that I've said something about a comment I haven't even seen until just now.

@Al2Klimov Al2Klimov force-pushed the activation-priority-10179 branch from ce0d035 to 5b1c4b5 Compare March 31, 2025 14:04
@Al2Klimov Al2Klimov changed the title WIP 🚧 🚧WIP: CheckerComponent#Stop(): wait for own Checkable#ProcessCheckResult() to finish Mar 31, 2025
@Al2Klimov Al2Klimov requested a review from Copilot March 31, 2025 14:06
Copilot

This comment was marked as spam.

@Al2Klimov
Copy link
Member Author

@julianbrost Very funny. Fact is, on Thursday you

  1. agreed with me that components should stop/join their own checks along with themselves
  2. said that will need some tracking

So I've done that. So far:

  • 684ca7d Now Checkable#ProcessCheckResult() doesn't guess the CR origin, instead it demands a CheckResultProducer barely fitting into a std::shared_lock
  • a8d011f CheckerComponent implements that CheckResultProducer (TODO(@Al2Klimov): more classes)

Finally, CheckResultProducer::Ptr just has to be passed through.

@Al2Klimov Al2Klimov force-pushed the activation-priority-10179 branch 3 times, most recently from a0ad767 to a06b01a Compare April 3, 2025 09:09
@Al2Klimov Al2Klimov changed the title 🚧WIP: CheckerComponent#Stop(): wait for own Checkable#ProcessCheckResult() to finish CheckerComponent#Stop(): wait for own Checkable#ProcessCheckResult() to finish Apr 3, 2025
@Al2Klimov Al2Klimov changed the title CheckerComponent#Stop(): wait for own Checkable#ProcessCheckResult() to finish *#Stop(): wait for own Checkable#ProcessCheckResult() to finish Apr 3, 2025
@Al2Klimov Al2Klimov marked this pull request as ready for review April 3, 2025 10:30
@Al2Klimov Al2Klimov changed the title *#Stop(): wait for own Checkable#ProcessCheckResult() to finish *#Stop(): wait for own Checkable#ProcessCheckResult() to finish🏁 Apr 3, 2025
@Al2Klimov Al2Klimov force-pushed the activation-priority-10179 branch from ea36975 to a06b01a Compare April 7, 2025 09:53
@Al2Klimov Al2Klimov marked this pull request as draft April 7, 2025 10:56
@Al2Klimov Al2Klimov marked this pull request as ready for review April 7, 2025 12:44
@Al2Klimov Al2Klimov removed their assignment Apr 7, 2025
@Al2Klimov Al2Klimov force-pushed the activation-priority-10179 branch from 72a171d to 4873a34 Compare April 7, 2025 12:44
@yhabteab yhabteab added the bug Something isn't working label May 15, 2025
@yhabteab yhabteab added this to the 2.15.0 milestone May 15, 2025
@yhabteab
Copy link
Member

yhabteab commented May 15, 2025

@Al2Klimov please don't be mad at me :)! I took the initiative to expand the PR description slightly based on the analysis made together with @jschmidt-icinga, so everyone can follow here without having to browse through the changed files. Though, you can gladly merge all the commits suffixed with the take CheckResultProducer::Ptr message into one commit with a meaningful commit message. They're basically all the same, just for different classes and clutter the PR unnecessarily and make it appear larger than it is and cryptic.

@jschmidt-icinga
Copy link
Contributor

I think in principle this solves the problem well, but I agree with @yhabteab, kindly squash it down to a single commit describing the functional change (the PR title is fine), as none of the individual changes make sense without all the others. That would also make it much easier to review on GH.

@Al2Klimov Al2Klimov force-pushed the activation-priority-10179 branch from 4873a34 to f0de2aa Compare May 16, 2025 08:43
@Al2Klimov
Copy link
Member Author

Are you sure? I've squashed the low-hanging fruits into 170e7cf – and see how big this one commit became!

@jschmidt-icinga
Copy link
Contributor

jschmidt-icinga commented May 16, 2025

Are you sure? I've squashed the low-hanging fruits into 170e7cf – and see how big this one commit became!

Yes, I'm sure. There's big commits and there's commits with trivial one-line changes in lots of files. You could split off adding the new Classes, maybe adding the final check, but at least all the function signature changes should be in a single commit.

What's more, commits should reflect individual consistent states of the code, i.e., each of them, when checked out, should at least still compile and the program should still work. Granted this is less important when on a branch that gets merged, but its still a good idea to maintain some discipline.

@Al2Klimov
Copy link
Member Author

In this case, before I squash them:

  • Is 65b5a39 good or should the method take CheckResultProducer::Ptr like everyone else, for consistency?
  • Is 0d0fa18 good or should there be only one signature, i.e always take CheckResultProducer::Ptr, for consistency?

@jschmidt-icinga
Copy link
Contributor

jschmidt-icinga commented May 16, 2025

* Is 65b5a39 good or should the method take CheckResultProducer::Ptr like everyone else, for consistency?

Personally, I'd maybe also use the CheckResultProducer:Ptr here, at least from Checkable.hpp down. Less for consistency because then consumers of the Checkable.hpp header don't (necessarily) need to know about ApiListener. The elegance of your solution lies in that the classes processing the check results do not need to directly call back to the producer but get an anonymous handle, we might as well use the anonymous handle everywhere we can.

Then again, Checkable-check.cpp already knows about ApiListener, so it's up to you (from my pov).

* Is 0d0fa18 good or should there be only one signature, i.e always take CheckResultProducer::Ptr, for consistency?

Personally, I think it's reasonable to wrap them in a lambda and add the overload for RegisterCommand, like you did. Which avoids adding dead function parameters to functions that don't need them (which would not be clear from the header). It's less significant here, since these functions aren't called manually, but still, wrapping means less clutter means better readability.

Alternatively you could do something like this:

using ExternalCommandCallback =
	std::variant<void (*)(const CheckResultProducer::Ptr &, double, const std::vector<String> &),
		void (*)(double, const std::vector<String> &)>;

and call it in Execute() with std::visit:

	std::visit([&producer, &time, &realArguments](auto && cb){
		using T = std::decay_t<decltype(cb)>;
		if constexpr (std::is_same_v<T, void (*)(const CheckResultProducer::Ptr &, double, const std::vector<String> &)>) {
			cb(producer, time, realArguments);
		} else if constexpr (std::is_same_v<T, void (*)(double, const std::vector<String> &)>) {
			cb(time, realArguments);
		} else {
			static_assert(false, "Must be one or the other");
		}
	}, eci.Callback);

Then you don't need any overloads or std::functions, even making the whole thing a little more efficient (not that it matters here).

Edit: Fixed std::visit.

Copy link
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your commits still don't compile atomically. I suggest the following commit structure:

  1. The entire implementation of CheckResultProducerComponent (currently this is split between two commits for some reason)
  2. Make classes inherit from this component (which I'd like to discuss in the first place) and call Start() and Stop() as appropriate
  3. all function signature and call changes (including unit tests)
  4. Add the final check in ProcessCheckResult()

That way each commit builds upon the previous one. Personally I'd squash it down even more ultimately, but that's as atomic as I'd make it for easy review.

But to start, see below for a first batch of suggestions. I'm happy to discuss all of these and learn more about your reasoning behind your choices.

@Al2Klimov Al2Klimov force-pushed the activation-priority-10179 branch from e56731f to a4d52af Compare May 19, 2025 16:12
@Al2Klimov

This comment was marked as resolved.

Al2Klimov added 6 commits May 20, 2025 12:59
This allows .ti classes multiple inheritance from Object.
Namely:

Checkable#ProcessCheckResult()
ClusterCheckTask::ScriptFunc()
ClusterZoneCheckTask::ScriptFunc()
DummyCheckTask::ScriptFunc()
ExceptionCheckTask::ScriptFunc()
IcingaCheckTask::ScriptFunc()
IfwApiCheckTask::ScriptFunc()
NullCheckTask::ScriptFunc()
PluginCheckTask::ScriptFunc()
RandomCheckTask::ScriptFunc()
SleepCheckTask::ScriptFunc()
IdoCheckTask::ScriptFunc()
IcingadbCheck::ScriptFunc()
CheckCommand#Execute()
Checkable#ExecuteCheck()
ClusterEvents::ExecuteCheckFromQueue()
ExternalCommandProcessor::Process*CheckResult()
ExternalCommandCallback
ExternalCommandProcessor::Execute()
ExternalCommandProcessor::ExecuteFromFile()
ExternalCommandProcessor::ProcessFile()
LivestatusQuery#ExecuteCommandHelper()
LivestatusQuery#Execute()
@Al2Klimov Al2Klimov force-pushed the activation-priority-10179 branch from a4d52af to 45851bc Compare May 20, 2025 15:18
@jschmidt-icinga
Copy link
Contributor

I've had a discussion with @julianbrost and @yhabteab with the following results:

  • We'd like to avoid virtual inheritance by using the new class by composition as a member instead of inheritance. While nobody is strictly against it, we don't see the benefit of using it in this situation over just keeping and passing a class member.
  • The class should then derive from SharedObject instead of Object because it doesn't need any of Object's capabilities, just the ::Ptr.
  • It can be named "StoppableWaitGroup" to reflect the (apparent) similarity with Go WaitGroups, with the difference that a stop can be signaled. We are all satisfied with that name but still open to other suggestions.
  • Further on that note, the initialization done in Start() can be done in the constructor (which basically means just initializing m_InstanceIsActive to true instead of false) and Stop() should be named Wait().
  • A lock_shared() method should be implemented to properly implement the SharedLockable interface, even if it is not used at the moment. As per the spec, it should throw when no lock can be obtained. Internally it can use try_lock_shared() and just throw when that returns false.

@Al2Klimov
Copy link
Member Author

avoid virtual inheritance by using the new class by composition as a member instead of inheritance. While nobody is strictly against it, we don't see the benefit of using it

Benefits of my version (ordered descending by importance):

  • It's cool af 😎
  • It perfectly restricts who can exclusive-lock (inheritors) and who is allowed to shared-lock (everyone) – I mean...

to reflect the (apparent) similarity with Go WaitGroups,

... especially when you talk about Go, remember how context works:

  • context.Context can only query some info and wait
  • only context.Context creator can cancel it

This restriction would vanish here if I got your idea correctly:

  • Getters for all affected component classes
  • A getter returns (a pointer to) the whole thing, so everyone can call everything

(Given #10397 (comment), strange that you wanna give up such a restriction.)

Let me know whether my concern is legit or I shall refactor anyway.

the initialization done in Start() can be done in the constructor (which basically means just initializing m_InstanceIsActive to true instead of false)

Before the component is activated? But why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla/signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants