Prometheus metrics are redundant and slow

While digging into Caddy's source code I've noticed that every (!) route handler is wrapped in a `metricsInstrumentedHandler` which updates Prometheus metrics during request execution. While it is a great feature and should definitely be enabled by default in Caddy, it currently uses up way to much CPU time and the metrics provided are quite redundant.

Since Caddy tries out every route handler in order until it gets an answer to the request, metric instrumentation is called for each handler, even if it didn't actually partake (which is also quite difficult to define in this context) in resolving the request, so handler-specific metrics are constantly updated with unrelated data and as a result pretty much all of the handler-specific metrics are meaningless, making them only usable to track server-wide stats.

As an example, here are the metrics for a simple Caddy server with 2 hosts, 1 of which only has a `reverse_proxy` handler, and the other has 2 `respond` handlers, 1 `reverse_proxy` handler and 1 `file_server` handler. The metrics were taken after running an http load-testing tool on the endpoints.
[prometheus-metrics.txt](https://github.com/caddyserver/caddy/files/8301618/prometheus-metrics.txt)
As seen in the example, all of the handler-specific metrics are pretty much the same for all handlers, even though in reality only the `respond` handler was requested.

The handler metrics provide even less use if the web server hosts multiple domains, since requests from all domains get mixed up in the metrics.
However, questionable metrics wouldn't be much of an issue if they were as fast as providing server-wide metrics, but they, of course, aren't, since they are getting updated multiple times until the request is finally answered.

I've ran `pprof` while putting load using `h2load` to request one of the simple `respond` handlers, and it turned out that **73%** of the time spent during `caddyhttp.Server.primaryHandlerChain.ServeHTTP` was in the `metricsInstrumentedHandler` (only 30% of the time was spend by the actual `caddyhttp.StaticResponse.ServeHTTP`). Here's the profile:
[profile.zip](https://github.com/caddyserver/caddy/files/8301688/profile.zip)
![image](https://user-images.githubusercontent.com/30644072/158941116-2f2ba661-d144-4c2d-97c8-7aaed062436a.png)


I really think that metrics such as these should be server-wide where they make more sense and are quicker. https://github.com/nginxinc/nginx-prometheus-exporter could be seen as an example of similar Prometheus metrics commonly used for nginx.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Prometheus metrics are redundant and slow #4644

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Prometheus metrics are redundant and slow #4644

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions