Skip to content

Downtime on a removed object are never closed. #910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
w1ll-i-code opened this issue Jan 15, 2025 · 12 comments · May be fixed by #913
Open

Downtime on a removed object are never closed. #910

w1ll-i-code opened this issue Jan 15, 2025 · 12 comments · May be fixed by #913

Comments

@w1ll-i-code
Copy link

Describe the bug

If a object with a Downtime gets disabled (even just temporary) the end of the associated Downtime is never written out to the IDO / IcingaDB.

To Reproduce

  1. Create a host in the director and deploy it.
  2. Create a downtime on the host
  3. Use the director to roll back to an older version
    4.Redeploy the new version

Expected behavior

I would expect the Downtime to be terminated once the object is deactivated (The actual_end_time set to the current time). But since the downtime is dropped without ever setting this field, the object looks in the reports as if it where in a constant downtime. which does not correspond to the internal state of icinga2.

Screenshots

image

@w1ll-i-code
Copy link
Author

Here is my proposed solution: Whenever a object gets removed, all the currently active downtimes get closed as well.

@w1ll-i-code
Copy link
Author

I am willing to implement the change myself, but I'd like to coordinate with you first, so my proposed solution is the right approach. Since the downtimes are dropped afterwards from the icinga2.state file, this seems like the most reasonable solution to me. I'd prefer it the downtimes would persist through the deploys, but that'd be a more invasive change I don't feel comfortable with implementing myself.

@yhabteab
Copy link
Member

I would expect the Downtime to be terminated once the object is deactivated (The actual_end_time set to the current time)

There is no such thing as deactivate downtime when a new version of the configuration is deployed via Icinga Director. When the host the downtimes belong to does not exist in the newly deployed configuration, then the downtimes become dangling objects that Icinga 2 cannot map to their respective host/service, and they will not even survive the config validation. However, since they are created with the ignore_on_errror flag, they will not stop Icinga 2 from loading the other configurations and once Icinga 2 is done loading/validating the other configuration, it will simply erase them from disk.

Here is my proposed solution: Whenever a object gets removed, all the currently active downtimes get closed as well.

If you don't mind wasting time on something that can't be fixed, then go ahead, but bear in mind that this is simply impossible to fix right now. Once the corresponding downtime host/service object is gone, the downtime object itself becomes pretty much useless and is not even a valid object anymore. If you don't want such strange history views, I suggest to manually clear the downtimes before removing the host/service object via Icinga Director.

@w1ll-i-code
Copy link
Author

If you don't mind wasting time on something that can't be fixed, then go ahead, but bear in mind that this is simply impossible to fix right now.

I already wasted that time and I already implemented my solution. It seems to work for mariadb/mysql, but I need to test it for pgsql and icingadb as well. But I'll probably have to do a second take to make it completely correct.

it will simply erase them from disk.

I am well aware of that. That's the problem we are currently facing. It happens often, but randomly enough that cleaning it up manually for all objects that may be affected by it is not feasible. Mostly we notice that once the SLA uptime report is generated and a host is completely out of bounds, as the downtime was not handled correctly. If we trigger a OnDowntimeRemoved before it gets erased from disk, that solution already works for us.

@w1ll-i-code
Copy link
Author

The logic I am thinking of is this:

  1. The configuration for the object gets removed, it is no longer active.
  2. The object still exists in the icinga2.state file together with the downtime.
  3. The config gets loaded and the object gets set to inactive.
  4. The inactive object gets synced to the IDO
    1. Here I propose to also trigger the OnDowntimeRemoved hook for each downtime associated with the host.
  5. The host and downtime are now inactive and will not get synced to disk in the icinga2.state file anymore. (Or just the host, not sure, but the effect is the same.)

Lmk if I have any holes in my understanding here, but from what I can observe rn, this is whats happening.

@Al2Klimov
Copy link
Member

If a object with a Downtime gets disabled (even just temporary) the end of the associated Downtime is never written out to the IDO / IcingaDB.

I doubt this, as (IIRC) Icinga DB syncs the correct state every time.

@w1ll-i-code
Copy link
Author

Image

Doing a quick test, it does not look like it did close it correctly. I added to both objects a downtime of a few minutes and deleted the object on the left before the downtime could run out. As you can see, the object on the left has the end of the downtime in the history, while the object on the left does not. The linked PR above resolves that issue by handling the ending of the downtime during the objects deactivation.

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Feb 10, 2025

@Al2Klimov Hi, are there any updates on this issue?

@yhabteab
Copy link
Member

Hi @w1ll-i-code, we had an internal discussion about how we can fix this and came to the conclusion that this can only be fixed by Icinga DB (Go) as there is no way in Icinga 2 as of today to fix this as noted in Icinga/icinga2#10311 (comment). We will try to fake the corresponding end event or do something else when removing the downtime configuration from the database, so I will move this to the Icinga DB repo and close Icinga/icinga2#10311 if you don't mind.

@yhabteab yhabteab transferred this issue from Icinga/icinga2 Mar 27, 2025
@w1ll-i-code
Copy link
Author

I'm sorry, I really did not understand what you were talking about in that comment. But as long as it gets fixed, it's fine my me.

@yhabteab
Copy link
Member

I'm sorry, I really did not understand what you were talking about in that comment.

You didn't give any reaction to that comment, so I automatically assumed that you have understood why it's not possible to fix this on the Icinga 2 side. But generally, if I haven't explained something well enough, just say something and I'll be happy to explain it in more detail.

Correct me if I'm wrong, but the problem you're experiencing looks like this:

You've a host object named H1 created via the Icinga Director, which essentially uses the /v1/config/packages endpoint internally, meaning that every time you trigger an Icinga Director deployment, it dumps all the Icinga 2 config from the Director DB into Icinga 2 via the aforementioned API endpoint. As you will see from the linked documentation, a request to this endpoint will automatically trigger an Icinga 2 daemon reload unless otherwise specified via the reload: false parameter, which Icinga Director doesn't use. What does that mean for Icinga 2? Every time Icinga 2 is reloaded, it will always start a new process with its own config that might not be the same as the ones used in the old process.

Now, let's go back to your issue, if your host H1 was in downtime before you deleted it from Icinga Director and triggered a deployment, Icinga Director will dump your config without H1 and its downtimes because they have just been deleted. The new process that is going to take over after the reload will now have no knowledge that this host and its downtimes ever existed, so it can't trigger any events for these objects, allowing them to be properly deleted by IDO and Icinga DB.

Instead, once it has taken over and the old process is terminated, it will dump its freshly loaded configuration into Redis, which will be processed by the Icinga DB (Go) daemon. The Icinga DB (Go) daemon then inspects the configuration received from Redis and those from the database, and removes any objects that are not now part of the objects read from Redis. This means it won't receive any events for H1 or its downtimes from Redis, so it removes them from the database. While performing these checks, we can hook up there and easily check if it's a downtime configuration that should be deleted from the database, we can manually try to fake the corresponding end/cancelled events and insert them into the history tables with #913. This will ensure that your history view in Icinga DB Web won't show any unclosed downtimes or whatsoever when you recreate the exact same host H1 again. However, this will only resolve the issue for Icinga DB Web, there's nothing we can do for IDO.

I hope it's now clear what I was talking about and what this is all about.

@w1ll-i-code
Copy link
Author

Yeah, sorry about that. I was planning to dig deeper into that and then it fell completely off my radar, because I forgot to put it into my todos. I think I understand now, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants