Skip to content

Prometheus metrics are redundant and slow #4644

@renbou

Description

@renbou

While digging into Caddy's source code I've noticed that every (!) route handler is wrapped in a metricsInstrumentedHandler which updates Prometheus metrics during request execution. While it is a great feature and should definitely be enabled by default in Caddy, it currently uses up way to much CPU time and the metrics provided are quite redundant.

Since Caddy tries out every route handler in order until it gets an answer to the request, metric instrumentation is called for each handler, even if it didn't actually partake (which is also quite difficult to define in this context) in resolving the request, so handler-specific metrics are constantly updated with unrelated data and as a result pretty much all of the handler-specific metrics are meaningless, making them only usable to track server-wide stats.

As an example, here are the metrics for a simple Caddy server with 2 hosts, 1 of which only has a reverse_proxy handler, and the other has 2 respond handlers, 1 reverse_proxy handler and 1 file_server handler. The metrics were taken after running an http load-testing tool on the endpoints.
prometheus-metrics.txt
As seen in the example, all of the handler-specific metrics are pretty much the same for all handlers, even though in reality only the respond handler was requested.

The handler metrics provide even less use if the web server hosts multiple domains, since requests from all domains get mixed up in the metrics.
However, questionable metrics wouldn't be much of an issue if they were as fast as providing server-wide metrics, but they, of course, aren't, since they are getting updated multiple times until the request is finally answered.

I've ran pprof while putting load using h2load to request one of the simple respond handlers, and it turned out that 73% of the time spent during caddyhttp.Server.primaryHandlerChain.ServeHTTP was in the metricsInstrumentedHandler (only 30% of the time was spend by the actual caddyhttp.StaticResponse.ServeHTTP). Here's the profile:
profile.zip
image

I really think that metrics such as these should be server-wide where they make more sense and are quicker. https://github.com/nginxinc/nginx-prometheus-exporter could be seen as an example of similar Prometheus metrics commonly used for nginx.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug 🐞Something isn't workingdiscussion 💬The right solution needs to be foundhelp wanted 🆘Extra attention is neededoptimization 📉Performance or cost improvements

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions