[Prometheus Metrics] Ngf config reload metrics are not available #3682

michasHL · 2025-08-01T17:48:22Z

michasHL
Aug 1, 2025

Before opening an issue, I wanted to start with a discussion. Overnight we had one of our clusters where NGF is deployed get into a wonky state where the Gateway could not apply a config because a file was not found.

Error:
msg: Config apply failed, rolling back config; error: failed to parse config open /etc/nginx/includes/ClientSettingsPolicy_xyz-ngf-default-gateway-max-body.conf: no such file or directory in /etc/nginx/conf.d/http.conf:143

This happened around deployment (using a helm chart) where we standup a completely new instance of our application side-by-side to run tests against it, including HTTPRoute & ClientSettingsPolicy (to set max body size). Once the test is complete, we delete the helm chart and all the temporary resources like the route and the policy. It seems that under some circumstances, deleting the policy meant the policy file (/etc/nginx/includes/ClientSettingsPolicy_xyz-ngf-default-gateway-max-body.conf:) was deleted before the actual /etc/nginx/conf.d/http.conf could be updated - thus ending in the error message. After that happened of course no updates to nginx could be made - thus new pods coming where not added and old pods where still in the config.

I believe for this issue I should open a ticket but please correct me if I'm wrong.

Of course, the title mentions metrics and so to see if we could get automated alerts working, we're looking at the documentation on your website of what metrics should be available: https://docs.nginx.com/nginx-gateway-fabric/how-to/monitoring/prometheus/#nginx-gateway-fabric-metrics

The document lists:

nginx_reloads_total: Counts successful NGINX reloads.
nginx_reload_errors_total: Counts NGINX reload failures.
nginx_stale_config: Indicates if NGINX Gateway Fabric couldn’t update NGINX with the latest configuration, resulting in a stale version.
nginx_reloads_milliseconds: Time in milliseconds for NGINX reloads.
event_batch_processing_milliseconds: Time in milliseconds to process batches of Kubernetes events.

Unfortunately checking our prometheus instance we only see event_batch_processing_millisecond coming through. I would like to understand why the reload failures and stale_config metrics are not available. It would be great if we could use those metrics to see when we're running into the same problem.

bjee19 · 2025-08-04T00:45:05Z

bjee19
Aug 4, 2025
Collaborator

Hey @michasHL, thanks for opening this discussion!

So sorry, somehow the official document link you found is an outdated version which should not be publicly accessible, and we're working on fixing this. On our recent 2.0 release, the prometheus metrics were updated as we no longer use the NGINX Prometheus Exporter and this is the current document: https://docs.nginx.com/nginx-gateway-fabric/monitoring/prometheus/#available-metrics-in-nginx-gateway-fabric.

Please let me know if you have any questions on the metrics listed in the document I linked.

Regarding the issue you described, yea that seems weird and if you could open an issue/bug describing it that would be greatly appreciated.

Once the test is complete, we delete the helm chart and all the temporary resources like the route and the policy.

By deleting the helm chart, do you mean deleting NGF or is this a helm chart for your application?

5 replies

michasHL Aug 4, 2025
Author

Thank you @bjee19 for your reply.

Yeah, I meant the helm chart of our application that includes the temporary resources of HttpRoute & a ClientSettingsPolicy to set maxBodySize.

I will open an issue and try to put some more details in it.

Regarding the metric discussion on hand, thanks for sending this official version of the currently exported metrics. Unfortunately, I don't see a good metric to catch config reload issues like we encountered that day.

Do you, or anyone on your team have any suggestion on how to alert on such cases? I'm currently thinking of turning a log into metric - which seems less than elegant.
As it stands, we had to rely on our metrics from our Load Balancer that indicated one of our regions were not 100% up and running anymore.

We're currently in the process of migrating our applications from your ingress controller to NGF. But this instance gave us a little pause, since a deployment of any service could potentially cause downtime to all of our services behind the same NGF/Gateway instance.

Happy to provide more information if needed.

bjee19 Aug 4, 2025
Collaborator

@michasHL

Yeah, I meant the helm chart of our application that includes the temporary resources of HttpRoute & a ClientSettingsPolicy to set >maxBodySize.

I will open an issue and try to put some more details in it.

Ok that makes more sense, thanks!

Do you, or anyone on your team have any suggestion on how to alert on such cases? I'm currently thinking of turning a log into metric - which seems less than elegant.

Unfortunately I'm not too sure, and this seems like it may be a limitation on our end. Let me consult with the team and get back to you with a response (tuesday, since the team is out monday).

michasHL Aug 4, 2025
Author

I appreciate it, thank you.

sjberman Aug 5, 2025
Maintainer

If nginx fails to reload due to a config error, we will write a status message on the Gateway resource, specifically for the Programmed condition. Unfortunately no Prometheus metrics anymore since we don't use the same library as before, after our change in architecture.

bjee19 Aug 5, 2025
Collaborator

@michasHL like @sjberman pointed out, since there are no Prometheus metrics anymore currently, we do write a status message on the Gateway resource and that could be a current workaround rather than turning a log into a metric. Unfortunately we don't provide a solution currently to this case, but perhaps you'll be able to use that Programmed condition to write some automation to monitor it.

mpstefan · 2025-08-05T19:28:52Z

mpstefan
Aug 5, 2025
Maintainer

Reading the discussion above... I'm curious as to why you might want an alert when a config reload fails. Wouldn't the engineer pushing the change verify their change took effect and would thus always see when the config reload failed? Perhaps there's some other action or scenario I'm missing that you would want the alert for?

1 reply

michasHL Aug 5, 2025
Author

Reading the discussion above... I'm curious as to why you might want an alert when a config reload fails. Wouldn't the engineer pushing the change verify their change took effect and would thus always see when the config reload failed? Perhaps there's some other action or scenario I'm missing that you would want the alert for?

Thanks for the question, the traction here.

Without going into too much detail here is how we deploy things:

We have an Infrastructure pipeline that sets up our kubernetes cluster and deploys certain Helm charts into that cluster (Prometheus, Nginx Gateway Fabric, etc.)
We have applications that are deployed into said cluster using Helm charts (Svc, Deployment, HttpRoute, ClientSettingsPolicy, etc.)

Applications in our Dev environment (where we encountered the issue) are shipped continuously from our main branch. There are usually no engineers monitoring deployments into Dev. I also don't think this would have caught the issue because most of our application developers would not monitor the Gateway resource as it's shared across a few applications. Watching the application pipeline also would not have shown any of those issues, because the helm chart does not wait or fail if the HttpRoute and ClientSettingsPolicy are removed successfully.

The removal however brought the shared Gateway resource into a bad state.

I still believe that monitoring cases like that would be beneficial to take the human component (to detect) out of it, especially since a problem with the config apply potentially could disrupt traffic to all of our applications.

Happy to provide more information if needed.

[Prometheus Metrics] Ngf config reload metrics are not available #3682

Uh oh!

michasHL Aug 1, 2025

Replies: 2 comments · 6 replies

Uh oh!

bjee19 Aug 4, 2025 Collaborator

Uh oh!

michasHL Aug 4, 2025 Author

Uh oh!

Uh oh!

bjee19 Aug 4, 2025 Collaborator

Uh oh!

michasHL Aug 4, 2025 Author

Uh oh!

Uh oh!

sjberman Aug 5, 2025 Maintainer

Uh oh!

bjee19 Aug 5, 2025 Collaborator

Uh oh!

mpstefan Aug 5, 2025 Maintainer

Uh oh!

michasHL Aug 5, 2025 Author

michasHL
Aug 1, 2025

Replies: 2 comments 6 replies

bjee19
Aug 4, 2025
Collaborator

michasHL Aug 4, 2025
Author

bjee19 Aug 4, 2025
Collaborator

michasHL Aug 4, 2025
Author

sjberman Aug 5, 2025
Maintainer

bjee19 Aug 5, 2025
Collaborator

mpstefan
Aug 5, 2025
Maintainer

michasHL Aug 5, 2025
Author