Replies: 2 comments 6 replies
-
Hey @michasHL, thanks for opening this discussion! So sorry, somehow the official document link you found is an outdated version which should not be publicly accessible, and we're working on fixing this. On our recent 2.0 release, the prometheus metrics were updated as we no longer use the NGINX Prometheus Exporter and this is the current document: https://docs.nginx.com/nginx-gateway-fabric/monitoring/prometheus/#available-metrics-in-nginx-gateway-fabric. Please let me know if you have any questions on the metrics listed in the document I linked. Regarding the issue you described, yea that seems weird and if you could open an issue/bug describing it that would be greatly appreciated.
By deleting the helm chart, do you mean deleting NGF or is this a helm chart for your application? |
Beta Was this translation helpful? Give feedback.
-
Reading the discussion above... I'm curious as to why you might want an alert when a config reload fails. Wouldn't the engineer pushing the change verify their change took effect and would thus always see when the config reload failed? Perhaps there's some other action or scenario I'm missing that you would want the alert for? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Before opening an issue, I wanted to start with a discussion. Overnight we had one of our clusters where NGF is deployed get into a wonky state where the Gateway could not apply a config because a file was not found.
Error:
msg: Config apply failed, rolling back config; error: failed to parse config open /etc/nginx/includes/ClientSettingsPolicy_xyz-ngf-default-gateway-max-body.conf: no such file or directory in /etc/nginx/conf.d/http.conf:143
This happened around deployment (using a helm chart) where we standup a completely new instance of our application side-by-side to run tests against it, including HTTPRoute & ClientSettingsPolicy (to set max body size). Once the test is complete, we delete the helm chart and all the temporary resources like the route and the policy. It seems that under some circumstances, deleting the policy meant the policy file (/etc/nginx/includes/ClientSettingsPolicy_xyz-ngf-default-gateway-max-body.conf:) was deleted before the actual
/etc/nginx/conf.d/http.conf
could be updated - thus ending in the error message. After that happened of course no updates to nginx could be made - thus new pods coming where not added and old pods where still in the config.I believe for this issue I should open a ticket but please correct me if I'm wrong.
Of course, the title mentions metrics and so to see if we could get automated alerts working, we're looking at the documentation on your website of what metrics should be available: https://docs.nginx.com/nginx-gateway-fabric/how-to/monitoring/prometheus/#nginx-gateway-fabric-metrics
The document lists:
Unfortunately checking our prometheus instance we only see
event_batch_processing_millisecond
coming through. I would like to understand why the reload failures and stale_config metrics are not available. It would be great if we could use those metrics to see when we're running into the same problem.Beta Was this translation helpful? Give feedback.
All reactions