monitoring: eagerly register apm-server metric instruments#20993
monitoring: eagerly register apm-server metric instruments#20993carsonip wants to merge 26 commits into
Conversation
🤖 GitHub commentsJust comment with:
|
158e1b5 to
65f5c2d
Compare
|
This pull request does not have a backport label. Could you fix it @carsonip? 🙏
|
6308fdc to
c3e3347
Compare
apm-server's HTTP middleware, gRPC interceptor, agentcfg fetcher,
event counter, tail-sampling processor, and trace groups counter all
record metrics via OpenTelemetry instruments. OTel instruments are
lazy: an Int64Counter only appears in stats output once .Add() has
been called on it, because metricdata.Sum's data points slice stays
empty until then.
The lazy behavior silently breaks downstream tooling that expects
/stats to enumerate every metric apm-server defines. metricbeat
field mappings, the stack monitoring templates, and the
elastic_agent integration package can't be regenerated from a
stats capture unless every error-path counter happens to have fired
during the capture window.
This change introduces internal/otelmetric, a tiny package whose
Int64Counter / Int64UpDownCounter / Int64Gauge constructors record
a zero value at instrument creation. Each registration site that
defines apm-server.* metrics (agentcfg ES fetcher, intake handler,
event counter, beater transactions-dropped, tail-sampling event
metrics, dynamic-service-groups counter, HTTP monitoring middleware,
gRPC metrics interceptor) is converted to use these helpers.
The HTTP middleware and gRPC interceptor additionally walk the
canonical request.AllResultIDs list at construction so every IDR* /
IDR* + legacy-prefix counter is registered up front; this is the
big-impact case (covers acm, root, server, otlp.{grpc,http}.* error
metrics -- 137 leaves under apm-server.* with zero traffic, vs ~40
with brief traffic on unpatched main).
Histograms aren't eagerly initialized: each Record() observation
increments the data-point count, so a synthetic Record(0) would
pollute the count for tests that assert specific observation counts.
The monitoringtest assertion helper is updated accordingly: in
fullMatch mode, gathered counters / gauges / up-down counters not
listed in expectedMetrics must equal 0 (the natural pattern with
zero-init), while histograms are skipped from that rule.
Several tests that switched to ExpectContainOtelMetrics during the
OTel migration are reverted to ExpectOtelMetrics, with their
expected maps extended to enumerate the legacy-prefix siblings now
visible alongside http.server.* / grpc.server.* (mux_config_agent,
mux_intake_backend, mux_intake_rum, otlp http+grpc, eventcounter).
Tests that check specific values from a complex subsystem keep
ExpectContainOtelMetrics (server_test.go, sampling/processor_test,
the concurrent-safety interceptor test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c3e3347 to
77b78c2
Compare
After elastic#20993 (eager monitoring counters) the /stats endpoint exposes the full enumerated set of apm-server.* metrics from the moment the server starts, including apm-server.sampling.tail.storage.disk_usage_threshold_pct, a Float64 gauge that's reported as 0.8 by default. Previously the tool aborted on non-integer numbers; map them to type "float" instead, matching the existing upstream typing in monitoring-beats*.json and the Beats / Integrations fields.yml files. Refresh the test fixtures: - testdata/stats.json: captured from a TBS-enabled apm-server built from the eager-monitoring-counters branch. Includes the float disk_usage_threshold_pct and the full eager-emitted counter set. - testdata/inputs/ea-beat-stats-fields.yml: re-snapshot from elastic/integrations main; the other four upstream snapshots were already current. - testdata/golden/*: regenerated from the new inputs and stats. Goldens now contain disk_usage_threshold_pct as type: float (typed files) or as an alias entry (beat-root-fields.yml), and the full set of otlp.{grpc,http}.{logs,metrics,traces} response error/valid families that previously only appeared after a matching response. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
carsonip
left a comment
There was a problem hiding this comment.
i'll review the code again tomorrow, but it looks right to me at a high level. At the very least it is fixing a bug.
ericywl
left a comment
There was a problem hiding this comment.
Looks good overall, just minor nits.
bb9238d to
42b5e38
Compare
Lift the four-times-duplicated "exactly one dimensionless point" check inside getScalarInt64 and getScalarFloat64 into a generic helper, scalarValue[T int64 | float64]. Each switch arm in the two getter functions collapses to a single line. The validation rule now has one home; a future change to it lands in one place instead of four. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified by removing the getScalarFloat64 branch in addAPMServerMetricsToMap: the test failed with the missing disk_usage_threshold_pct surfaced in the diff. Restoring the branch passes again, locking in the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52d0f79 to
0fd598b
Compare
Three-metric input covering the int64 sum and float64 gauge cases the function dispatches over. Benchmark is the baseline a follow-up commit replaces strings.Split with strings.SplitAfterSeq against. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
strings.Split allocates a []string per call to addAPMServerMetricsToMap;
strings.SplitAfterSeq returns an iterator. Each yielded segment carries
the trailing "." except the last, so strings.CutSuffix(seg, ".")
distinguishes intermediate (descend into nested map) from leaf (assign
the value) without buffering the whole path or carrying lookahead state.
benchstat (10 runs, ./internal/beatcmd):
│ split │ splitafterseq │
│ sec/op │ sec/op vs base │
AddAPMServerMetricsToMap-16 1.175µ ± 1% 1.078µ ± 2% -8.22% (p=0.000 n=10)
AddAPMServerMetricsToMap-16 B 2.492Ki ± 0% 2.305Ki ± 0% -7.52% (p=0.000 n=10)
AddAPMServerMetricsToMap-16 a 18.00 ± 0% 15.00 ± 0% -16.67% (p=0.000 n=10)
The 3-allocation drop equals the metric count: one slice per metric
removed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps a single Metrics() invocation in b.Loop() to measure the per-call cost of the interceptor's hot path (3-4 m.inc() calls plus one histogram record). Used as the baseline a follow-up commit replaces sync.Map with a plain map against. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Counters and the request.duration histogram are pre-populated in
Metrics() before the interceptor is returned; nothing creates or
mutates them at request time. Replace sync.Map (designed for concurrent
mutation) with a plain map[counterKey]Counter and a single histogram
field. Read-only after construction = safe for concurrent gRPC calls.
The struct counterKey{prefix, id} avoids the per-call "prefix +
string(id)" string concat that the sync.Map path needed for its key,
killing the two allocations every m.inc() did.
Drift becomes loud: a missing (prefix, id) entry now logs an error
instead of silently creating a counter on first use.
benchstat (10 runs, ./internal/beater/interceptors):
│ syncmap │ plain map │
│ sec/op │ sec/op vs base │
Interceptor-16 738.1n ± 0% 405.2n ± 2% -45.11% (p=0.000 n=10)
Interceptor-16 B 304.0 ± 0% 0.0 ± 0% -100.00% (p=0.000 n=10)
Interceptor-16 a 9.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the "Drift contract" prose at Interceptor() in favour of a one- sentence rule: any counter or histogram used by Interceptor must be eagerly created in Metrics(). The struct counterKey gets a one-liner explaining why it's a struct: avoids per-call "prefix+id" string concat and the alloc that comes with it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
708cf41 to
6e18089
Compare
| diskThresholdGauge, _ := storageMeter.Float64Gauge("apm-server.sampling.tail.storage.disk_usage_threshold_pct") | ||
| diskThresholdGauge.Record(context.Background(), 0.8) |
There was a problem hiding this comment.
[to reviewer] new test that catches missing float reporting
| // legacy "apm-server.otlp.grpc.<signal>." metric prefix. Single source of | ||
| // truth for both the per-call dispatch in Interceptor() and the eager | ||
| // registration loop in Metrics(); adding a new signal means one entry. | ||
| var otlpGRPCLegacyMetricsPrefixes = map[string]string{ |
There was a problem hiding this comment.
[to reviewer] Single source of truth for method name to metric prefix. A slight overhead in map lookup compared to switch case string comparison but negligible in the grand scheme of things
| type metricsInterceptor struct { | ||
| logger *logp.Logger | ||
| counters map[counterKey]metric.Int64Counter | ||
| requestDurationHist metric.Int64Histogram |
There was a problem hiding this comment.
[to reviewer] sorry for the major refactor here. sync.Map was added in #16182 to workaround a concurrent map access. Now that we change the paradigm to always create up front, we can cut the sync.Map overhead and benefit from metric appearing in /stats from registration time
with this change, see benchstat
│ /tmp/syncmap.txt │ /tmp/plainmap.txt │
│ sec/op │ sec/op vs base │
Interceptor-16 738.1n ± 0% 405.2n ± 2% -45.11% (p=0.000 n=10)
│ /tmp/syncmap.txt │ /tmp/plainmap.txt │
│ B/op │ B/op vs base │
Interceptor-16 304.0 ± 0% 0.0 ± 0% -100.00% (p=0.000 n=10)
│ /tmp/syncmap.txt │ /tmp/plainmap.txt │
│ allocs/op │ allocs/op vs base │
Interceptor-16 9.000 ± 0% 0.000 ± 0% -100.00% (p=0.000 n=10)
| // counterKey is the map key for metricsInterceptor.counters. A struct key | ||
| // is used instead of the concatenated "prefix+id" string so the lookup at | ||
| // each m.inc() call avoids allocating a fresh string. | ||
| type counterKey struct { |
There was a problem hiding this comment.
[to reviewer] this cuts the string prefix and name concat alloc per inc call
| // SplitAfterSeq yields each segment with its trailing "." preserved, | ||
| // except the last. CutSuffix tells us which we're looking at: hasDot | ||
| // true => intermediate (descend), false => leaf (assign). | ||
| for seg := range strings.SplitAfterSeq(suffix, ".") { |
There was a problem hiding this comment.
[to reviewer] strings.Split -> strings.SplitAfterSeq perf gain
│ split.txt │ splitafterseq.txt │
│ sec/op │ sec/op vs base │
AddAPMServerMetricsToMap-16 1.175µ ± 1% 1.078µ ± 2% -8.22% (p=0.000 n=10)
│ split.txt │ splitafterseq.txt │
│ B/op │ B/op vs base │
AddAPMServerMetricsToMap-16 2.492Ki ± 0% 2.305Ki ± 0% -7.52% (p=0.000 n=10)
│ split.txt │ splitafterseq.txt │
│ allocs/op │ allocs/op vs base │
AddAPMServerMetricsToMap-16 18.00 ± 0% 15.00 ± 0% -16.67% (p=0.000 n=10)
Wraps a single Apply(MonitoringMiddleware, HandlerIdle) call in b.Loop() to measure the per-request cost of the middleware's hot path. Used as the baseline a follow-up commit replaces sync.Map with an eager-only map against. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same refactor as elastic#20993 applied to the gRPC interceptor: every counter is created up front in MonitoringMiddleware() and the backing maps are read-only at request time, so sync.Map's concurrent-mutation guarantees aren't needed. Drift becomes runtime-loud: a missing entry is reported via logger.Error instead of silently creating a counter on first use. Two `map[request.ResultID]Counter` (one for "http.server.<id>", one for "<legacyMetricsPrefix><id>") avoid the per-call "prefix + string(id)" string concat, killing every per-request allocation on the middleware hot path. The MonitoringMiddleware constructor now takes a *logp.Logger so inc() can report drift; the only production caller (api/mux.go's apmMiddleware) already has a logger in scope. benchstat (10 runs, ./internal/beater/middleware): │ syncmap │ plain map │ │ sec/op │ sec/op vs base │ MonitoringMiddleware-16 906.6n ± 2% 432.2n ± 1% -52.32% (p=0.000 n=10) MonitoringMiddleware-16 B 320.0 ± 0% 0.0 ± 0% -100.00% (p=0.000 n=10) MonitoringMiddleware-16 a 11.00 ± 0% 0.00 ± 0% -100.00% (p=0.000 n=10) Existing TestMonitoringHandler (5 outcome subtests + Parallel race test) covers all paths and passes unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| logger: logger, | ||
| legacyMetricsPrefix: legacyMetricsPrefix, | ||
| counters: sync.Map{}, | ||
| histograms: sync.Map{}, | ||
| serverCounters: serverCounters, | ||
| legacyCounters: legacyCounters, | ||
| requestDurationHist: requestDurationHist, |
There was a problem hiding this comment.
[to reviewer] this gives us
│ syncmap │ plain map │
│ sec/op │ sec/op vs base │
MonitoringMiddleware-16 906.6n ± 2% 432.2n ± 1% -52.32% (p=0.000 n=10)
320.0 ± 0% 0.0 ± 0% -100.00% (p=0.000 n=10) [B/op]
11.00 ± 0% 0.00 ± 0% -100.00% (p=0.000 n=10) [allocs]
5b05eae to
fd48728
Compare
The "monitoring counter not eagerly registered" log line meant well but didn't tell the operator what to do with it. Prefix the message with "BUG:" and link the issue tracker so a stack-monitoring user seeing this in their logs knows it's an apm-server bug to file, not something they configured wrong. Same change applied to the matching inc() in interceptors/metrics.go and middleware/monitoring_middleware.go. Doc comments on Interceptor / Middleware / inc() updated to call out the contract violation as a bug rather than a generic "logs an error". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fd48728 to
f977af3
Compare
The previous change loosened ExpectOtelMetrics so that any unlisted gathered metric was implicitly treated as zero. That broke the function's own contract: "fullMatch means full" -- a strict set equality check should not silently allow extra metrics, even when those extras happen to be zero. Restore ExpectOtelMetrics to its original strict semantics: gathered ≡ expected via assert.ElementsMatch, with an "unexpected metric" failure on any unlisted gathered metric. Tests that exercise eagerly zero-initialized counters (HTTP middleware, gRPC interceptor) bootstrap their expected map via the new monitoringtest.EagerCountersZeros helper, which returns int64(0) for every (prefix, id) pair the test declares the implementation to eagerly initialize. The eager-init contract becomes part of the asserted state, visible at the test site. Add ExpectOtelMetricsNonZero for tests where listing every eager zero is impractical (mux api tests, otlp tests, eventcounter test) because the test exercises a much larger graph of eager-init counters than it cares about. Semantics: every listed metric is asserted exactly; every unlisted gathered metric must equal 0 (histograms exempt). The test signal -- "no surprise non-zero metric leaked" -- is preserved without forcing the test to enumerate the noise floor. ExpectContainOtelMetrics is unchanged (subset semantic). Internally, assertOtelMetrics now takes a matchMode enum (matchModeFull / matchModeNonZero / matchModeContain) instead of a fullMatch bool, so each public helper maps to one mode and the behavior is grep-able from either side. Test sites updated: - middleware monitoring_middleware_test.go: ExpectOtelMetrics + helper. - interceptors metrics_test.go: ExpectOtelMetrics + helper. - api/mux_*_test.go (4 files): ExpectOtelMetricsNonZero. - otlp http_test.go, grpc_test.go: ExpectOtelMetricsNonZero. - modelprocessor/eventcounter_test.go: ExpectOtelMetricsNonZero. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efixes The previous change loosened ExpectOtelMetrics so that any unlisted gathered metric was implicitly treated as zero. That broke the function's own contract -- "fullMatch means full" should not silently allow extra metrics, even when those extras happen to be zero. Restore the strict semantics. ExpectOtelMetrics gains an optional WithEagerPrefixes(prefixes ...string) variadic that declares prefixes under which an unlisted gathered counter is tolerated *provided its value is 0*; histograms and counters outside those prefixes still trigger "unexpected metric". This is the feature middleware and interceptor tests need: their eager-init footprint is bounded (http.server.* / "" for the middleware; grpc.server.* and the three otlp.grpc.<signal>.* prefixes for the interceptor). Tests with an unbounded eager-init footprint -- mux api tests, otlp tests, eventcounter test -- go back to ExpectContainOtelMetrics (subset semantics). That is what they used before the lazy-counter fix, and it remains the right contract: those tests instantiate the full mux / OTLP server / processor which eagerly registers many prefixes, and the test only cares about the counters its own code path increments. Drop the temporarily-introduced ExpectOtelMetricsNonZero -- nothing uses it now. scalarValue() is extracted from the original four type-switch arms in assertOtelMetrics so the new tolerance branch can read isHistogram without duplicating the check across three of the arms. ElementsMatch is replaced with Subset on the inverse loop because the tolerance branch lets foundMetrics legitimately contain entries not in expectedMetrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit added WithEagerPrefixes to monitoringtest so ExpectOtelMetrics could tolerate unlisted-but-zero counters under declared prefixes. Simpler: tests that want a strict full match just build the full expected map themselves, including all eagerly-zero entries, before calling ExpectOtelMetrics. middleware/monitoring_middleware_test.go's TestMonitoringHandler and interceptors/metrics_test.go's TestMetrics each construct the prefix×AllResultIDs cross-product as int64(0) and override with the non-zero values the handler/interceptor actually fires. No helper options, no eager-prefix concept inside monitoringtest, no scalarValue extraction, no ElementsMatch→Subset migration. monitoringtest/opentelemetry.go is now byte-identical to main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…source
The previous shape had three places to keep in sync:
- the const block declaring each ResultID,
- MapResultIDToStatus listing HTTP-status-mapped IDs,
- AllResultIDs's hardcoded slice of "unconditional" IDs plus
MapResultIDToStatus keys.
A new ResultID could land in the const block, get used by an inc()
call site, and quietly miss AllResultIDs (so its counter never
appeared in /stats from process start). The "drift contract" comment
relied on humans reading it.
Replace with a single resultIDStatus map keyed by ResultID. Each
entry's value is its Status (zero-valued for unconditional IDs).
init() iterates resultIDStatus once to populate AllResultIDs and the
HTTP-status-mapped subset MapResultIDToStatus. Adding a new ResultID
is one new entry there; both derived structures pick it up
automatically. Forgetting the resultIDStatus entry on a fresh const
surfaces at runtime as the "BUG: monitoring counter missing from
eager registration" log fired by inc(), which the strict
ExpectOtelMetrics tests fail on.
Also drop IDEventReceivedCount and IDEventDroppedCount: dead since
the Jaeger removal in elastic#14791, no callers in the codebase.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| // IDEventReceivedCount identifies amount of received events | ||
| IDEventReceivedCount ResultID = "event.received.count" | ||
| // IDEventDroppedCount identifies amount of dropped events | ||
| IDEventDroppedCount ResultID = "event.dropped.count" |
There was a problem hiding this comment.
[to reviewer] according to Claude: IDEventReceivedCount / IDEventDroppedCount removed — dead consts since the Jaeger removal in #14791, no callers
| // Status (every ID c.Result.ID can be set to). Populated by init(). | ||
| var MapResultIDToStatus = map[ResultID]Status{} | ||
|
|
||
| func init() { |
There was a problem hiding this comment.
[to reviewer] sorry for major refactor here again. This is to allow eager registration to be aware of all the result IDs and make the code drift-proof.
| } else if v, ok := getScalarFloat64(m.Data); ok { | ||
| value = v | ||
| } else { | ||
| continue |
There was a problem hiding this comment.
Would it be a good idea to have some indication that these metrics are being dropped?
carsonip
left a comment
There was a problem hiding this comment.
TODO
- double check if
addDocappenderLibbeatOutputMetricsandaddDocappenderLibbeatPipelineMetricsare working
addDocappenderLibbeatOutputMetrics and addDocappenderLibbeatPipelineMetrics now accumulate values into locals and report every field unconditionally, including write.bytes. The previous shape only reported a field when the matching docappender OTel metric had data points, so the libbeat.output.* and libbeat.pipeline.events.total fields were missing from /stats until the first ES publish. The mapping-regeneration tooling reads a single /stats snapshot, so absent fields silently drop from regenerated mappings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rly" This reverts commit 10a8d1a.
… zero
Records the contract that the docappender-derived libbeat.output.* and
libbeat.pipeline.events.total fields surface in /stats at zero before
any indexing happens. Five of the eight fields (events.{active, batches,
total}, write.bytes, pipeline.events.total) are unblocked by the
upstream Add(0) eager-init in elastic/go-docappender#317.
The test is skipped until apm-server's docappender go.mod entry is
bumped to a release that includes the eager-init change. It can be
validated locally before that release lands by adding a replace
directive in go.mod pointing at the docappender PR branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The guard suppressed libbeat.output.write.bytes whenever the docappender flushed-bytes counter was 0 (process start, or any /stats snapshot taken between flushes). Since writeBytes is initialised to 0 and assigned only from a real elasticsearch.flushed.bytes data point, emitting it unconditionally is faithful to the source: 0 means "0 bytes flushed", not "value unknown". Also extend TestLibbeatMetrics's first-snapshot assertion to include output.write.bytes=0, which now appears before any flush completes. This is independent of the docappender eager-init bump (elastic/go-docappender#317): with the bump in place the counter still reads 0 at startup, so the guard would still suppress write.bytes there too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The libbeat.output.* and libbeat.pipeline.* sub-trees in /stats are populated by per-scrape NewFunc callbacks that read docappender's OTel ScopeMetrics. Pre-this-change, each field emitted only when its source metric had at least one data point — so before docappender's first publish, six fields were missing entirely (and three more were suppressed by a now-removed writeBytes>0 guard). Downstream mapping-regeneration tooling reads a single /stats snapshot, so absent fields silently drop from the regenerated mappings. Replace the per-call accumulate-then-emit shape with two persistent state structs (libbeatOutputState, libbeatPipelineState) initialised to zero at process start. Each NewFunc updates only the fields docappender currently exports and then reports the full set: - Eager: every field surfaces in /stats from the first scrape, before any indexing has happened, because the structs zero-init. - No fabrication on missing scrapes: a rename or drop of a docappender instrument leaves the corresponding state field at its last-known value rather than snapping back to a fabricated 0 each call. - Drift detection: TestLibbeatMetrics exercises real traffic and asserts specific cumulative values for every field, so a rename fails that test loudly. The "type": "elasticsearch" string and the field emission are gated on having ever observed the docappender scope, so console-output runs keep an empty libbeat.output sub-tree as before. Add TestLibbeatMetricsEagerZero to lock the eager-zero contract: build a Beat with output.elasticsearch, create a docappender with no traffic, and assert every field appears at zero in the snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
💛 Build succeeded, but was flaky
Failed CI StepsHistory
|
Motivation/summary
apm-server's monitoring tree (the
/statsHTTP endpoint and the libbeat monitoring snapshot) is the source we use to generate three sets of upstream field mappings:beatmodule's_meta/fields.ymlandstats/_meta/fields.ymlmonitoring-beats.jsonandmonitoring-beats-mb.jsonbeat-stats-fields.ymlMappings can include fields that aren't always emitted at runtime — the problem is the tooling we use to derive those mappings. The stats-to-mapping helper from the closed #13638 (and any equivalent script) reads a single
/statssnapshot and infers field types from what's there. If a counter only appears in/statsafter some specific code path has fired, it's not in the snapshot and silently gets dropped from the regenerated mappings.After #15094 "Translate otel metrics to libbeat monitoring", that's exactly what happens: the OpenTelemetry instruments backing the new monitoring path are lazy. An
Int64Counteronly contributes a data point — and therefore only appears in/stats— once.Add()has been called on it. Counters whose code path hasn't fired yet (most error responses, OTLP variants we haven't seen traffic for, agent-config cache hits/misses, etc.) are absent from the snapshot. Pre-#15094, libbeat monitoring counters were registered up front and emitted as0by default, which gave/statsan exhaustive enumeration of every counter the server defines.This PR restores that property under the OpenTelemetry path so the mapping-generation tooling can regenerate field mappings from a single
/statscapture without needing to drive every error path.What this changes
Eager zero-init helper
Adds a small
internal/otelmetricpackage whose constructors (NewInt64Counter,NewInt64UpDownCounter) call.Add(0)at instrument creation, ensuring each instrument has at least one data point so the SDK exports it. Gauges and histograms are intentionally not in the helper: a syntheticRecord(0)would falsely declare a gauge's current value as 0 and would inflate a histogram's observation count.Per-package eager registration
Each registration site that defines
apm-server.*metrics is converted to use the helpers:internal/agentcfg/elasticsearch.go(6 counters underapm-server.agentcfg.elasticsearch.*plus the cache-entries gauge converted to anInt64UpDownCounterwith delta tracking, since the cumulative sum of its deltas equals the current cache size)internal/beater/api/intake/handler.go(3 counters underapm-server.processor.stream.*)internal/beater/beater.go(apm-server.sampling.transactions_dropped)internal/model/modelprocessor/eventcounter.go(apm-server.processor.<event-type>.transformations)x-pack/apm-server/sampling/processor.go(6 counters underapm-server.sampling.tail.events.*)x-pack/apm-server/sampling/groups.go(apm-server.sampling.tail.dynamic_service_groups)HTTP middleware and gRPC interceptor
MonitoringMiddlewareandmetricsInterceptorproduce counter names that depend on arequest.ResultIDchosen per-request. A newrequest.AllResultIDsslice is iterated at construction so every(prefix, ResultID)pair has its counter created up front. This is the largest single contributor — it coversapm-server.acm.*,apm-server.root.*,apm-server.server.*,apm-server.otlp.{grpc,http}.*error counters that previously only appeared after a matching error response.The interceptor's and middleware's per-request counter storage was previously a
sync.Map(added in #16182 to fix a concurrent-map race). Since every(prefix, ResultID)pair is now created in advance, the maps are never written after construction andsync.Map's concurrent-mutation guarantees aren't needed. Both refactor to a plain map populated at startup and read-only at request time. The map keys avoid the per-callprefix + string(id)string concat (interceptor uses a{prefix, ResultID}struct key; middleware uses twomap[ResultID]Countersince the prefix is fixed per-instance), so the hot path becomes allocation-free. A missing entry now logs"BUG: monitoring counter missing from eager registration"instead of silently creating a counter on first use — drift becomes runtime-loud.The OTLP gRPC dispatch (three method names → three legacy prefixes) is consolidated into a single
otlpGRPCLegacyMetricsPrefixesmap keyed by gRPC method. The same map drives both per-call dispatch inInterceptor()and the eager-registration loop inMetrics(); adding a new OTLP signal is one map entry.Single source of truth for the canonical ResultID set
internal/beater/request/result.gopreviously declared the same set of ResultIDs in three places: a const block, theMapResultIDToStatusmap, and aResultID-slice literal inAllResultIDs. A new ID could land in the const block and any inc() call site without ever being added toAllResultIDs, silently breaking the eager-registration property.The file now has a single
resultIDStatusmap keyed by ResultID;init()derives bothAllResultIDsandMapResultIDToStatusfrom it. Adding a ResultID is one new entry; forgetting the entry surfaces immediately because the new ID isn't inAllResultIDs, the eager loop skips it, and the first inc() logs the BUG line which the strict tests fail on.Two dead constants (
IDEventReceivedCount,IDEventDroppedCount— leftovers from the Jaeger removal in #14791) are removed in the same change since they had no remaining callers.Float64 translator gap (fixes #20996)
apm-server.sampling.tail.storage.disk_usage_threshold_pctis aFloat64Gaugeintroduced in #20464 (present in 9.2.0+) and has never appeared in/stats, even with TBS enabled and the storage manager actively recording values.addAPMServerMetricsToMapininternal/beatcmd/beat.goonly handledSum[int64]/Gauge[int64]; float64 instruments fell through theelse. The translator is extended with agetScalarFloat64path so float-typedapm-server.*instruments are reported viamonitoring.ReportFloat. The validation rule for "single dimensionless data point" is hoisted into a generic helperscalarValue[T int64 | float64]shared by both getters.TestMonitoringApmServeris extended with aFloat64Gaugeto lock the float branch in.unsupported_droppedguard removalapm-server.otlp.{grpc,http}.metrics.consumer.unsupported_droppedwas guarded byif stats.UnsupportedMetricsDropped > 0inside the observable-counter callback, so it never appeared until at least one unsupported metric was dropped. The guard is removed; the callback now always observes the (possibly zero) counter.libbeat.output / libbeat.pipeline lazy emission
The docappender → libbeat translation in
addDocappenderLibbeatOutputMetricsandaddDocappenderLibbeatPipelineMetrics(internal/beatcmd/beat.go) used to report eachlibbeat.output.events.*/libbeat.output.write.bytes/libbeat.pipeline.events.totalfield only inside the matching metric-name switch arm. Before docappender's first publish — i.e., at process start with ES output configured — none of the source metric names had data points, so eight fields were missing from/stats:Two earlier attempts to address this were rejected: an always-emit-zero adapter (10a8d1a, reverted in 702bc90) fabricated zeros even when docappender stopped reporting an instrument, masking renames; an upstream
Add(0)PR inelastic/go-docappender#317(now closed) put the eager init at the right architectural layer but couldn't seed the three status-attributed fields without changing docappender's test contract forevents.processed.This PR converts the two
NewFunccallbacks to use per-instance persistent state structs (libbeatOutputState,libbeatPipelineState) backed byatomic.Bool/atomic.Int64fields. The structs zero-initialise at process start, so every field surfaces in/statsfrom the first scrape, before any indexing has occurred. Each scrape updates only the fields docappender currently exports and reports the full set; a rename or drop of an upstream instrument leaves the field at its last-known value rather than snapping back to a fabricated zero. Theif writeBytes > 0guard is removed in the same change, sincewriteBytesis now a struct field rather than a per-call local."type": "elasticsearch"and the field emission are gated on having ever observed the docappender scope, so console-output runs keep an emptylibbeat.outputsub-tree as before.Drift in docappender's instrument names is caught by
TestLibbeatMetrics, which exercises real traffic and asserts specific cumulative values for every field — a rename fails that test loudly.TestLibbeatMetricsEagerZerolocks the eager-zero contract directly: build a Beat withoutput.elasticsearch, create a docappender with no traffic, assert every field appears at zero in the snapshot.Walker performance
addAPMServerMetricsToMapwalks each metric's dotted path. The originalstrings.Splitallocated a[]stringper call; replaced withstrings.SplitAfterSeqwhich yields each segment with its trailing.preserved, andstrings.CutSuffix(seg, ".")distinguishes intermediate (descend) from leaf (assign) without lookahead state. Benchstat over a 3-metric input:Interceptor and middleware performance
The sync.Map → plain map swap, combined with the struct/ID-keyed lookup, takes both hot paths to zero allocations:
Tests
Strict-match assertions stay strict.
monitoringtest/opentelemetry.gois byte-identical tomain.TestMonitoringHandler(middleware) andTestMetrics(interceptor) build the full expected map in-place —prefix × request.AllResultIDs → int64(0)plus the per-test non-zero overrides — before calling the existingExpectOtelMetrics. Tests with an unbounded eager-init footprint (mux api tests, otlp http/grpc tests, eventcounter test) stay onExpectContainOtelMetrics(subset semantics), which is what they used before #15094.TestMonitoringApmServerextended with aFloat64Gaugeso thegetScalarFloat64path is verified.Two benchmarks added so future regressions are caught:
BenchmarkAddAPMServerMetricsToMapininternal/beatcmd/beat_test.go.BenchmarkInterceptorininternal/beater/interceptors/metrics_test.go.BenchmarkMonitoringMiddlewareininternal/beater/middleware/monitoring_middleware_test.go.How to test these changes
Before this change,
apm-server.*shows only metrics that have actually been incremented (typically just the fewsampling.tail.storage.*int64 gauges if TBS is enabled, plus whatever traffic has triggered). After this change, the full enumerated set (acm,agentcfg,otlp,processor,root,sampling,server) is present at zero — anddisk_usage_threshold_pctshows up undersampling.tail.storagefor the first time.Unit tests:
go test -race ./internal/... ./x-pack/...Related issues
Unblocks the mapping-regeneration tooling in #13638 and the periodic upstream-mapping-sync work in #15533 and #13475.
Part of #20991.
Fixes #20996 as a side effect (float64 translator gap).