Skip to content

monitoring: eagerly register apm-server metric instruments#20993

Draft
carsonip wants to merge 26 commits into
elastic:mainfrom
carsonip:eager-monitoring-counters
Draft

monitoring: eagerly register apm-server metric instruments#20993
carsonip wants to merge 26 commits into
elastic:mainfrom
carsonip:eager-monitoring-counters

Conversation

@carsonip
Copy link
Copy Markdown
Member

@carsonip carsonip commented Apr 27, 2026

Motivation/summary

apm-server's monitoring tree (the /stats HTTP endpoint and the libbeat monitoring snapshot) is the source we use to generate three sets of upstream field mappings:

Mappings can include fields that aren't always emitted at runtime — the problem is the tooling we use to derive those mappings. The stats-to-mapping helper from the closed #13638 (and any equivalent script) reads a single /stats snapshot and infers field types from what's there. If a counter only appears in /stats after some specific code path has fired, it's not in the snapshot and silently gets dropped from the regenerated mappings.

After #15094 "Translate otel metrics to libbeat monitoring", that's exactly what happens: the OpenTelemetry instruments backing the new monitoring path are lazy. An Int64Counter only contributes a data point — and therefore only appears in /stats — once .Add() has been called on it. Counters whose code path hasn't fired yet (most error responses, OTLP variants we haven't seen traffic for, agent-config cache hits/misses, etc.) are absent from the snapshot. Pre-#15094, libbeat monitoring counters were registered up front and emitted as 0 by default, which gave /stats an exhaustive enumeration of every counter the server defines.

This PR restores that property under the OpenTelemetry path so the mapping-generation tooling can regenerate field mappings from a single /stats capture without needing to drive every error path.

What this changes

Eager zero-init helper

Adds a small internal/otelmetric package whose constructors (NewInt64Counter, NewInt64UpDownCounter) call .Add(0) at instrument creation, ensuring each instrument has at least one data point so the SDK exports it. Gauges and histograms are intentionally not in the helper: a synthetic Record(0) would falsely declare a gauge's current value as 0 and would inflate a histogram's observation count.

Per-package eager registration

Each registration site that defines apm-server.* metrics is converted to use the helpers:

  • internal/agentcfg/elasticsearch.go (6 counters under apm-server.agentcfg.elasticsearch.* plus the cache-entries gauge converted to an Int64UpDownCounter with delta tracking, since the cumulative sum of its deltas equals the current cache size)
  • internal/beater/api/intake/handler.go (3 counters under apm-server.processor.stream.*)
  • internal/beater/beater.go (apm-server.sampling.transactions_dropped)
  • internal/model/modelprocessor/eventcounter.go (apm-server.processor.<event-type>.transformations)
  • x-pack/apm-server/sampling/processor.go (6 counters under apm-server.sampling.tail.events.*)
  • x-pack/apm-server/sampling/groups.go (apm-server.sampling.tail.dynamic_service_groups)

HTTP middleware and gRPC interceptor

MonitoringMiddleware and metricsInterceptor produce counter names that depend on a request.ResultID chosen per-request. A new request.AllResultIDs slice is iterated at construction so every (prefix, ResultID) pair has its counter created up front. This is the largest single contributor — it covers apm-server.acm.*, apm-server.root.*, apm-server.server.*, apm-server.otlp.{grpc,http}.* error counters that previously only appeared after a matching error response.

The interceptor's and middleware's per-request counter storage was previously a sync.Map (added in #16182 to fix a concurrent-map race). Since every (prefix, ResultID) pair is now created in advance, the maps are never written after construction and sync.Map's concurrent-mutation guarantees aren't needed. Both refactor to a plain map populated at startup and read-only at request time. The map keys avoid the per-call prefix + string(id) string concat (interceptor uses a {prefix, ResultID} struct key; middleware uses two map[ResultID]Counter since the prefix is fixed per-instance), so the hot path becomes allocation-free. A missing entry now logs "BUG: monitoring counter missing from eager registration" instead of silently creating a counter on first use — drift becomes runtime-loud.

The OTLP gRPC dispatch (three method names → three legacy prefixes) is consolidated into a single otlpGRPCLegacyMetricsPrefixes map keyed by gRPC method. The same map drives both per-call dispatch in Interceptor() and the eager-registration loop in Metrics(); adding a new OTLP signal is one map entry.

Single source of truth for the canonical ResultID set

internal/beater/request/result.go previously declared the same set of ResultIDs in three places: a const block, the MapResultIDToStatus map, and a ResultID-slice literal in AllResultIDs. A new ID could land in the const block and any inc() call site without ever being added to AllResultIDs, silently breaking the eager-registration property.

The file now has a single resultIDStatus map keyed by ResultID; init() derives both AllResultIDs and MapResultIDToStatus from it. Adding a ResultID is one new entry; forgetting the entry surfaces immediately because the new ID isn't in AllResultIDs, the eager loop skips it, and the first inc() logs the BUG line which the strict tests fail on.

Two dead constants (IDEventReceivedCount, IDEventDroppedCount — leftovers from the Jaeger removal in #14791) are removed in the same change since they had no remaining callers.

Float64 translator gap (fixes #20996)

apm-server.sampling.tail.storage.disk_usage_threshold_pct is a Float64Gauge introduced in #20464 (present in 9.2.0+) and has never appeared in /stats, even with TBS enabled and the storage manager actively recording values. addAPMServerMetricsToMap in internal/beatcmd/beat.go only handled Sum[int64] / Gauge[int64]; float64 instruments fell through the else. The translator is extended with a getScalarFloat64 path so float-typed apm-server.* instruments are reported via monitoring.ReportFloat. The validation rule for "single dimensionless data point" is hoisted into a generic helper scalarValue[T int64 | float64] shared by both getters. TestMonitoringApmServer is extended with a Float64Gauge to lock the float branch in.

unsupported_dropped guard removal

apm-server.otlp.{grpc,http}.metrics.consumer.unsupported_dropped was guarded by if stats.UnsupportedMetricsDropped > 0 inside the observable-counter callback, so it never appeared until at least one unsupported metric was dropped. The guard is removed; the callback now always observes the (possibly zero) counter.

libbeat.output / libbeat.pipeline lazy emission

The docappender → libbeat translation in addDocappenderLibbeatOutputMetrics and addDocappenderLibbeatPipelineMetrics (internal/beatcmd/beat.go) used to report each libbeat.output.events.* / libbeat.output.write.bytes / libbeat.pipeline.events.total field only inside the matching metric-name switch arm. Before docappender's first publish — i.e., at process start with ES output configured — none of the source metric names had data points, so eight fields were missing from /stats:

libbeat.output.events.{acked,active,batches,failed,toomany,total}
libbeat.output.write.bytes
libbeat.pipeline.events.total

Two earlier attempts to address this were rejected: an always-emit-zero adapter (10a8d1a, reverted in 702bc90) fabricated zeros even when docappender stopped reporting an instrument, masking renames; an upstream Add(0) PR in elastic/go-docappender#317 (now closed) put the eager init at the right architectural layer but couldn't seed the three status-attributed fields without changing docappender's test contract for events.processed.

This PR converts the two NewFunc callbacks to use per-instance persistent state structs (libbeatOutputState, libbeatPipelineState) backed by atomic.Bool / atomic.Int64 fields. The structs zero-initialise at process start, so every field surfaces in /stats from the first scrape, before any indexing has occurred. Each scrape updates only the fields docappender currently exports and reports the full set; a rename or drop of an upstream instrument leaves the field at its last-known value rather than snapping back to a fabricated zero. The if writeBytes > 0 guard is removed in the same change, since writeBytes is now a struct field rather than a per-call local. "type": "elasticsearch" and the field emission are gated on having ever observed the docappender scope, so console-output runs keep an empty libbeat.output sub-tree as before.

Drift in docappender's instrument names is caught by TestLibbeatMetrics, which exercises real traffic and asserts specific cumulative values for every field — a rename fails that test loudly. TestLibbeatMetricsEagerZero locks the eager-zero contract directly: build a Beat with output.elasticsearch, create a docappender with no traffic, assert every field appears at zero in the snapshot.

Walker performance

addAPMServerMetricsToMap walks each metric's dotted path. The original strings.Split allocated a []string per call; replaced with strings.SplitAfterSeq which yields each segment with its trailing . preserved, and strings.CutSuffix(seg, ".") distinguishes intermediate (descend) from leaf (assign) without lookahead state. Benchstat over a 3-metric input:

                            │ split        │ splitafterseq                  │
                            │ sec/op       │ sec/op     vs base             │
AddAPMServerMetricsToMap-16   1.175µ ± 1%   1.078µ ± 2%  -8.22% (p=0.000)
                              2.492Ki ± 0%  2.305Ki ± 0% -7.52% (p=0.000) [B/op]
                              18.00 ± 0%    15.00 ± 0% -16.67% (p=0.000) [allocs]

Interceptor and middleware performance

The sync.Map → plain map swap, combined with the struct/ID-keyed lookup, takes both hot paths to zero allocations:

                            │ syncmap       │ plain map                          │
                            │ sec/op        │ sec/op       vs base               │
Interceptor-16                738.1n ± 0%    405.2n ± 2%   -45.11% (p=0.000)
                              304.0 ± 0%     0.0 ± 0%     -100.00% (p=0.000) [B/op]
                              9.000 ± 0%     0.000 ± 0%   -100.00% (p=0.000) [allocs]

MonitoringMiddleware-16       906.6n ± 2%    432.2n ± 1%   -52.32% (p=0.000)
                              320.0 ± 0%     0.0 ± 0%     -100.00% (p=0.000) [B/op]
                              11.00 ± 0%     0.00 ± 0%    -100.00% (p=0.000) [allocs]

Tests

Strict-match assertions stay strict. monitoringtest/opentelemetry.go is byte-identical to main. TestMonitoringHandler (middleware) and TestMetrics (interceptor) build the full expected map in-place — prefix × request.AllResultIDs → int64(0) plus the per-test non-zero overrides — before calling the existing ExpectOtelMetrics. Tests with an unbounded eager-init footprint (mux api tests, otlp http/grpc tests, eventcounter test) stay on ExpectContainOtelMetrics (subset semantics), which is what they used before #15094.

TestMonitoringApmServer extended with a Float64Gauge so the getScalarFloat64 path is verified.

Two benchmarks added so future regressions are caught:

  • BenchmarkAddAPMServerMetricsToMap in internal/beatcmd/beat_test.go.
  • BenchmarkInterceptor in internal/beater/interceptors/metrics_test.go.
  • BenchmarkMonitoringMiddleware in internal/beater/middleware/monitoring_middleware_test.go.

How to test these changes

# Build and run with no traffic.
go build -o /tmp/apm-server ./x-pack/apm-server
/tmp/apm-server -e -c apm-server.yml &
sleep 4
curl -s http://127.0.0.1:5066/stats | jq '."apm-server"'

Before this change, apm-server.* shows only metrics that have actually been incremented (typically just the few sampling.tail.storage.* int64 gauges if TBS is enabled, plus whatever traffic has triggered). After this change, the full enumerated set (acm, agentcfg, otlp, processor, root, sampling, server) is present at zero — and disk_usage_threshold_pct shows up under sampling.tail.storage for the first time.

Unit tests:

go test -race ./internal/... ./x-pack/...

Related issues

Unblocks the mapping-regeneration tooling in #13638 and the periodic upstream-mapping-sync work in #15533 and #13475.

Part of #20991.
Fixes #20996 as a side effect (float64 translator gap).

@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@carsonip carsonip force-pushed the eager-monitoring-counters branch from 158e1b5 to 65f5c2d Compare April 27, 2026 23:07
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 27, 2026

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8.19 is the label to automatically backport to the 8.19 branch.
  • backport-9./d is the label to automatically backport to the 9./d branch. /d is the digit.
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@carsonip carsonip force-pushed the eager-monitoring-counters branch 5 times, most recently from 6308fdc to c3e3347 Compare April 28, 2026 10:30
apm-server's HTTP middleware, gRPC interceptor, agentcfg fetcher,
event counter, tail-sampling processor, and trace groups counter all
record metrics via OpenTelemetry instruments. OTel instruments are
lazy: an Int64Counter only appears in stats output once .Add() has
been called on it, because metricdata.Sum's data points slice stays
empty until then.

The lazy behavior silently breaks downstream tooling that expects
/stats to enumerate every metric apm-server defines. metricbeat
field mappings, the stack monitoring templates, and the
elastic_agent integration package can't be regenerated from a
stats capture unless every error-path counter happens to have fired
during the capture window.

This change introduces internal/otelmetric, a tiny package whose
Int64Counter / Int64UpDownCounter / Int64Gauge constructors record
a zero value at instrument creation. Each registration site that
defines apm-server.* metrics (agentcfg ES fetcher, intake handler,
event counter, beater transactions-dropped, tail-sampling event
metrics, dynamic-service-groups counter, HTTP monitoring middleware,
gRPC metrics interceptor) is converted to use these helpers.

The HTTP middleware and gRPC interceptor additionally walk the
canonical request.AllResultIDs list at construction so every IDR* /
IDR* + legacy-prefix counter is registered up front; this is the
big-impact case (covers acm, root, server, otlp.{grpc,http}.* error
metrics -- 137 leaves under apm-server.* with zero traffic, vs ~40
with brief traffic on unpatched main).

Histograms aren't eagerly initialized: each Record() observation
increments the data-point count, so a synthetic Record(0) would
pollute the count for tests that assert specific observation counts.
The monitoringtest assertion helper is updated accordingly: in
fullMatch mode, gathered counters / gauges / up-down counters not
listed in expectedMetrics must equal 0 (the natural pattern with
zero-init), while histograms are skipped from that rule.

Several tests that switched to ExpectContainOtelMetrics during the
OTel migration are reverted to ExpectOtelMetrics, with their
expected maps extended to enumerate the legacy-prefix siblings now
visible alongside http.server.* / grpc.server.* (mux_config_agent,
mux_intake_backend, mux_intake_rum, otlp http+grpc, eventcounter).
Tests that check specific values from a complex subsystem keep
ExpectContainOtelMetrics (server_test.go, sampling/processor_test,
the concurrent-safety interceptor test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@carsonip carsonip force-pushed the eager-monitoring-counters branch from c3e3347 to 77b78c2 Compare April 28, 2026 10:39
carsonip added a commit to carsonip/apm-server that referenced this pull request Apr 28, 2026
After elastic#20993 (eager monitoring counters) the /stats
endpoint exposes the full enumerated set of apm-server.* metrics from
the moment the server starts, including
apm-server.sampling.tail.storage.disk_usage_threshold_pct, a Float64
gauge that's reported as 0.8 by default. Previously the tool aborted on
non-integer numbers; map them to type "float" instead, matching the
existing upstream typing in monitoring-beats*.json and the Beats /
Integrations fields.yml files.

Refresh the test fixtures:

  - testdata/stats.json: captured from a TBS-enabled apm-server built
    from the eager-monitoring-counters branch. Includes the float
    disk_usage_threshold_pct and the full eager-emitted counter set.
  - testdata/inputs/ea-beat-stats-fields.yml: re-snapshot from
    elastic/integrations main; the other four upstream snapshots were
    already current.
  - testdata/golden/*: regenerated from the new inputs and stats.

Goldens now contain disk_usage_threshold_pct as type: float (typed
files) or as an alias entry (beat-root-fields.yml), and the full set of
otlp.{grpc,http}.{logs,metrics,traces} response error/valid families
that previously only appeared after a matching response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/beatcmd/beat.go
@carsonip carsonip marked this pull request as ready for review April 28, 2026 16:58
@carsonip carsonip requested a review from a team as a code owner April 28, 2026 16:58
Copy link
Copy Markdown
Member Author

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll review the code again tomorrow, but it looks right to me at a high level. At the very least it is fixing a bug.

@carsonip carsonip added the backport-active-all Automated backport with mergify to all the active branches label Apr 28, 2026
Copy link
Copy Markdown
Contributor

@ericywl ericywl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just minor nits.

Comment thread internal/beatcmd/beat.go Outdated
Comment thread internal/beater/interceptors/metrics.go Outdated
@carsonip carsonip force-pushed the eager-monitoring-counters branch 2 times, most recently from bb9238d to 42b5e38 Compare April 29, 2026 11:01
carsonip and others added 2 commits April 29, 2026 12:03
Lift the four-times-duplicated "exactly one dimensionless point"
check inside getScalarInt64 and getScalarFloat64 into a generic
helper, scalarValue[T int64 | float64]. Each switch arm in the two
getter functions collapses to a single line. The validation rule
now has one home; a future change to it lands in one place instead
of four.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified by removing the getScalarFloat64 branch in
addAPMServerMetricsToMap: the test failed with the missing
disk_usage_threshold_pct surfaced in the diff. Restoring the
branch passes again, locking in the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@carsonip carsonip force-pushed the eager-monitoring-counters branch from 52d0f79 to 0fd598b Compare April 29, 2026 11:07
carsonip and others added 2 commits April 29, 2026 12:20
Three-metric input covering the int64 sum and float64 gauge cases
the function dispatches over. Benchmark is the baseline a follow-up
commit replaces strings.Split with strings.SplitAfterSeq against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
strings.Split allocates a []string per call to addAPMServerMetricsToMap;
strings.SplitAfterSeq returns an iterator. Each yielded segment carries
the trailing "." except the last, so strings.CutSuffix(seg, ".")
distinguishes intermediate (descend into nested map) from leaf (assign
the value) without buffering the whole path or carrying lookahead state.

benchstat (10 runs, ./internal/beatcmd):

                                │ split        │       splitafterseq                  │
                                │ sec/op       │   sec/op     vs base                 │
AddAPMServerMetricsToMap-16       1.175µ ± 1%   1.078µ ± 2%  -8.22%  (p=0.000 n=10)
AddAPMServerMetricsToMap-16   B   2.492Ki ± 0%  2.305Ki ± 0% -7.52%  (p=0.000 n=10)
AddAPMServerMetricsToMap-16   a   18.00 ± 0%    15.00 ± 0%  -16.67% (p=0.000 n=10)

The 3-allocation drop equals the metric count: one slice per metric
removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
carsonip and others added 3 commits April 29, 2026 14:18
Wraps a single Metrics() invocation in b.Loop() to measure the per-call
cost of the interceptor's hot path (3-4 m.inc() calls plus one
histogram record). Used as the baseline a follow-up commit replaces
sync.Map with a plain map against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Counters and the request.duration histogram are pre-populated in
Metrics() before the interceptor is returned; nothing creates or
mutates them at request time. Replace sync.Map (designed for concurrent
mutation) with a plain map[counterKey]Counter and a single histogram
field. Read-only after construction = safe for concurrent gRPC calls.

The struct counterKey{prefix, id} avoids the per-call "prefix +
string(id)" string concat that the sync.Map path needed for its key,
killing the two allocations every m.inc() did.

Drift becomes loud: a missing (prefix, id) entry now logs an error
instead of silently creating a counter on first use.

benchstat (10 runs, ./internal/beater/interceptors):

                 │  syncmap     │       plain map                       │
                 │  sec/op      │   sec/op     vs base                  │
Interceptor-16     738.1n ± 0%   405.2n ± 2%  -45.11% (p=0.000 n=10)
Interceptor-16  B  304.0 ± 0%    0.0 ± 0%   -100.00% (p=0.000 n=10)
Interceptor-16  a  9.000 ± 0%    0.000 ± 0% -100.00% (p=0.000 n=10)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the "Drift contract" prose at Interceptor() in favour of a one-
sentence rule: any counter or histogram used by Interceptor must be
eagerly created in Metrics(). The struct counterKey gets a one-liner
explaining why it's a struct: avoids per-call "prefix+id" string concat
and the alloc that comes with it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@carsonip carsonip force-pushed the eager-monitoring-counters branch from 708cf41 to 6e18089 Compare April 29, 2026 13:22
Comment thread internal/beatcmd/beat.go Outdated
Comment on lines +296 to +297
diskThresholdGauge, _ := storageMeter.Float64Gauge("apm-server.sampling.tail.storage.disk_usage_threshold_pct")
diskThresholdGauge.Record(context.Background(), 0.8)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] new test that catches missing float reporting

Comment thread internal/beater/interceptors/metrics.go Outdated
// legacy "apm-server.otlp.grpc.<signal>." metric prefix. Single source of
// truth for both the per-call dispatch in Interceptor() and the eager
// registration loop in Metrics(); adding a new signal means one entry.
var otlpGRPCLegacyMetricsPrefixes = map[string]string{
Copy link
Copy Markdown
Member Author

@carsonip carsonip Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] Single source of truth for method name to metric prefix. A slight overhead in map lookup compared to switch case string comparison but negligible in the grand scheme of things

Comment on lines +57 to +60
type metricsInterceptor struct {
logger *logp.Logger
counters map[counterKey]metric.Int64Counter
requestDurationHist metric.Int64Histogram
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] sorry for the major refactor here. sync.Map was added in #16182 to workaround a concurrent map access. Now that we change the paradigm to always create up front, we can cut the sync.Map overhead and benefit from metric appearing in /stats from registration time

with this change, see benchstat

                 │ /tmp/syncmap.txt │          /tmp/plainmap.txt          │                                                                                                                                           
                 │      sec/op      │   sec/op     vs base                │
  Interceptor-16        738.1n ± 0%   405.2n ± 2%  -45.11% (p=0.000 n=10)                                                                                                                                             
                                                                     
                 │ /tmp/syncmap.txt │         /tmp/plainmap.txt         │                                                                                                                                             
                 │       B/op       │   B/op    vs base                 │
  Interceptor-16         304.0 ± 0%   0.0 ± 0%  -100.00% (p=0.000 n=10)                                                                                                                                               
                                                                                                                                                                                                                      
                 │ /tmp/syncmap.txt │          /tmp/plainmap.txt          │
                 │    allocs/op     │ allocs/op   vs base                 │                                                                                                                                           
  Interceptor-16         9.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)    

// counterKey is the map key for metricsInterceptor.counters. A struct key
// is used instead of the concatenated "prefix+id" string so the lookup at
// each m.inc() call avoids allocating a fresh string.
type counterKey struct {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] this cuts the string prefix and name concat alloc per inc call

Comment thread internal/beatcmd/beat.go
// SplitAfterSeq yields each segment with its trailing "." preserved,
// except the last. CutSuffix tells us which we're looking at: hasDot
// true => intermediate (descend), false => leaf (assign).
for seg := range strings.SplitAfterSeq(suffix, ".") {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] strings.Split -> strings.SplitAfterSeq perf gain

                              │   split.txt    │       splitafterseq.txt            │
                              │     sec/op     │   sec/op     vs base               │
  AddAPMServerMetricsToMap-16    1.175µ ± 1%    1.078µ ± 2%  -8.22%  (p=0.000 n=10)

                              │   split.txt    │       splitafterseq.txt            │
                              │      B/op      │     B/op      vs base              │
  AddAPMServerMetricsToMap-16    2.492Ki ± 0%   2.305Ki ± 0%  -7.52%  (p=0.000 n=10)

                              │   split.txt    │       splitafterseq.txt            │
                              │   allocs/op    │ allocs/op   vs base                │
  AddAPMServerMetricsToMap-16     18.00 ± 0%    15.00 ± 0%  -16.67%  (p=0.000 n=10)

carsonip and others added 2 commits April 29, 2026 14:40
Wraps a single Apply(MonitoringMiddleware, HandlerIdle) call in
b.Loop() to measure the per-request cost of the middleware's hot
path. Used as the baseline a follow-up commit replaces sync.Map with
an eager-only map against.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same refactor as elastic#20993 applied to the gRPC interceptor: every
counter is created up front in MonitoringMiddleware() and the
backing maps are read-only at request time, so sync.Map's
concurrent-mutation guarantees aren't needed. Drift becomes
runtime-loud: a missing entry is reported via logger.Error instead
of silently creating a counter on first use.

Two `map[request.ResultID]Counter` (one for "http.server.<id>",
one for "<legacyMetricsPrefix><id>") avoid the per-call
"prefix + string(id)" string concat, killing every per-request
allocation on the middleware hot path.

The MonitoringMiddleware constructor now takes a *logp.Logger so
inc() can report drift; the only production caller (api/mux.go's
apmMiddleware) already has a logger in scope.

benchstat (10 runs, ./internal/beater/middleware):

                          │ syncmap        │ plain map                          │
                          │ sec/op         │ sec/op       vs base               │
MonitoringMiddleware-16     906.6n ± 2%      432.2n ± 1%  -52.32% (p=0.000 n=10)
MonitoringMiddleware-16  B  320.0 ± 0%       0.0 ± 0%   -100.00% (p=0.000 n=10)
MonitoringMiddleware-16  a  11.00 ± 0%       0.00 ± 0%  -100.00% (p=0.000 n=10)

Existing TestMonitoringHandler (5 outcome subtests + Parallel race
test) covers all paths and passes unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +112 to +116
logger: logger,
legacyMetricsPrefix: legacyMetricsPrefix,
counters: sync.Map{},
histograms: sync.Map{},
serverCounters: serverCounters,
legacyCounters: legacyCounters,
requestDurationHist: requestDurationHist,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] this gives us

                            │ syncmap         │ plain map                          │                                                                                                                                  
                            │ sec/op          │ sec/op     vs base                 │                                                                                                                                  
  MonitoringMiddleware-16     906.6n ± 2%      432.2n ± 1%  -52.32% (p=0.000 n=10)                                                                                                                                    
                              320.0 ± 0%       0.0 ± 0%    -100.00% (p=0.000 n=10) [B/op]
                              11.00 ± 0%       0.00 ± 0%   -100.00% (p=0.000 n=10) [allocs]          

@carsonip carsonip force-pushed the eager-monitoring-counters branch 2 times, most recently from 5b05eae to fd48728 Compare April 29, 2026 13:51
The "monitoring counter not eagerly registered" log line meant well
but didn't tell the operator what to do with it. Prefix the message
with "BUG:" and link the issue tracker so a stack-monitoring user
seeing this in their logs knows it's an apm-server bug to file, not
something they configured wrong. Same change applied to the matching
inc() in interceptors/metrics.go and middleware/monitoring_middleware.go.

Doc comments on Interceptor / Middleware / inc() updated to call out
the contract violation as a bug rather than a generic "logs an error".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@carsonip carsonip force-pushed the eager-monitoring-counters branch from fd48728 to f977af3 Compare April 29, 2026 13:53
carsonip and others added 5 commits April 29, 2026 14:55
The previous change loosened ExpectOtelMetrics so that any unlisted
gathered metric was implicitly treated as zero. That broke the
function's own contract: "fullMatch means full" -- a strict set
equality check should not silently allow extra metrics, even when
those extras happen to be zero.

Restore ExpectOtelMetrics to its original strict semantics: gathered
≡ expected via assert.ElementsMatch, with an "unexpected metric"
failure on any unlisted gathered metric. Tests that exercise eagerly
zero-initialized counters (HTTP middleware, gRPC interceptor)
bootstrap their expected map via the new monitoringtest.EagerCountersZeros
helper, which returns int64(0) for every (prefix, id) pair the test
declares the implementation to eagerly initialize. The eager-init
contract becomes part of the asserted state, visible at the test
site.

Add ExpectOtelMetricsNonZero for tests where listing every eager
zero is impractical (mux api tests, otlp tests, eventcounter test)
because the test exercises a much larger graph of eager-init
counters than it cares about. Semantics: every listed metric is
asserted exactly; every unlisted gathered metric must equal 0
(histograms exempt). The test signal -- "no surprise non-zero metric
leaked" -- is preserved without forcing the test to enumerate the
noise floor.

ExpectContainOtelMetrics is unchanged (subset semantic).

Internally, assertOtelMetrics now takes a matchMode enum
(matchModeFull / matchModeNonZero / matchModeContain) instead of a
fullMatch bool, so each public helper maps to one mode and the
behavior is grep-able from either side.

Test sites updated:
  - middleware monitoring_middleware_test.go: ExpectOtelMetrics + helper.
  - interceptors metrics_test.go: ExpectOtelMetrics + helper.
  - api/mux_*_test.go (4 files): ExpectOtelMetricsNonZero.
  - otlp http_test.go, grpc_test.go: ExpectOtelMetricsNonZero.
  - modelprocessor/eventcounter_test.go: ExpectOtelMetricsNonZero.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…efixes

The previous change loosened ExpectOtelMetrics so that any unlisted
gathered metric was implicitly treated as zero. That broke the
function's own contract -- "fullMatch means full" should not silently
allow extra metrics, even when those extras happen to be zero.

Restore the strict semantics. ExpectOtelMetrics gains an optional
WithEagerPrefixes(prefixes ...string) variadic that declares prefixes
under which an unlisted gathered counter is tolerated *provided its
value is 0*; histograms and counters outside those prefixes still
trigger "unexpected metric". This is the feature middleware and
interceptor tests need: their eager-init footprint is bounded
(http.server.* / "" for the middleware; grpc.server.* and the three
otlp.grpc.<signal>.* prefixes for the interceptor).

Tests with an unbounded eager-init footprint -- mux api tests, otlp
tests, eventcounter test -- go back to ExpectContainOtelMetrics
(subset semantics). That is what they used before the lazy-counter
fix, and it remains the right contract: those tests instantiate the
full mux / OTLP server / processor which eagerly registers many
prefixes, and the test only cares about the counters its own code
path increments.

Drop the temporarily-introduced ExpectOtelMetricsNonZero -- nothing
uses it now.

scalarValue() is extracted from the original four type-switch arms
in assertOtelMetrics so the new tolerance branch can read isHistogram
without duplicating the check across three of the arms.
ElementsMatch is replaced with Subset on the inverse loop because
the tolerance branch lets foundMetrics legitimately contain entries
not in expectedMetrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit added WithEagerPrefixes to monitoringtest so
ExpectOtelMetrics could tolerate unlisted-but-zero counters under
declared prefixes. Simpler: tests that want a strict full match just
build the full expected map themselves, including all eagerly-zero
entries, before calling ExpectOtelMetrics.

middleware/monitoring_middleware_test.go's TestMonitoringHandler and
interceptors/metrics_test.go's TestMetrics each construct the
prefix×AllResultIDs cross-product as int64(0) and override with the
non-zero values the handler/interceptor actually fires. No helper
options, no eager-prefix concept inside monitoringtest, no scalarValue
extraction, no ElementsMatch→Subset migration.

monitoringtest/opentelemetry.go is now byte-identical to main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…source

The previous shape had three places to keep in sync:

  - the const block declaring each ResultID,
  - MapResultIDToStatus listing HTTP-status-mapped IDs,
  - AllResultIDs's hardcoded slice of "unconditional" IDs plus
    MapResultIDToStatus keys.

A new ResultID could land in the const block, get used by an inc()
call site, and quietly miss AllResultIDs (so its counter never
appeared in /stats from process start). The "drift contract" comment
relied on humans reading it.

Replace with a single resultIDStatus map keyed by ResultID. Each
entry's value is its Status (zero-valued for unconditional IDs).
init() iterates resultIDStatus once to populate AllResultIDs and the
HTTP-status-mapped subset MapResultIDToStatus. Adding a new ResultID
is one new entry there; both derived structures pick it up
automatically. Forgetting the resultIDStatus entry on a fresh const
surfaces at runtime as the "BUG: monitoring counter missing from
eager registration" log fired by inc(), which the strict
ExpectOtelMetrics tests fail on.

Also drop IDEventReceivedCount and IDEventDroppedCount: dead since
the Jaeger removal in elastic#14791, no callers in the codebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines -39 to -42
// IDEventReceivedCount identifies amount of received events
IDEventReceivedCount ResultID = "event.received.count"
// IDEventDroppedCount identifies amount of dropped events
IDEventDroppedCount ResultID = "event.dropped.count"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] according to Claude: IDEventReceivedCount / IDEventDroppedCount removed — dead consts since the Jaeger removal in #14791, no callers

// Status (every ID c.Result.ID can be set to). Populated by init().
var MapResultIDToStatus = map[ResultID]Status{}

func init() {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] sorry for major refactor here again. This is to allow eager registration to be aware of all the result IDs and make the code drift-proof.

Comment thread internal/beatcmd/beat.go
} else if v, ok := getScalarFloat64(m.Data); ok {
value = v
} else {
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a good idea to have some indication that these metrics are being dropped?

Copy link
Copy Markdown
Member Author

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

  • double check if addDocappenderLibbeatOutputMetrics and addDocappenderLibbeatPipelineMetrics are working

carsonip and others added 7 commits April 30, 2026 12:24
addDocappenderLibbeatOutputMetrics and addDocappenderLibbeatPipelineMetrics
now accumulate values into locals and report every field unconditionally,
including write.bytes. The previous shape only reported a field when the
matching docappender OTel metric had data points, so the libbeat.output.*
and libbeat.pipeline.events.total fields were missing from /stats until
the first ES publish. The mapping-regeneration tooling reads a single
/stats snapshot, so absent fields silently drop from regenerated mappings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… zero

Records the contract that the docappender-derived libbeat.output.* and
libbeat.pipeline.events.total fields surface in /stats at zero before
any indexing happens. Five of the eight fields (events.{active, batches,
total}, write.bytes, pipeline.events.total) are unblocked by the
upstream Add(0) eager-init in elastic/go-docappender#317.

The test is skipped until apm-server's docappender go.mod entry is
bumped to a release that includes the eager-init change. It can be
validated locally before that release lands by adding a replace
directive in go.mod pointing at the docappender PR branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The guard suppressed libbeat.output.write.bytes whenever the docappender
flushed-bytes counter was 0 (process start, or any /stats snapshot taken
between flushes). Since writeBytes is initialised to 0 and assigned only
from a real elasticsearch.flushed.bytes data point, emitting it
unconditionally is faithful to the source: 0 means "0 bytes flushed",
not "value unknown".

Also extend TestLibbeatMetrics's first-snapshot assertion to include
output.write.bytes=0, which now appears before any flush completes.

This is independent of the docappender eager-init bump
(elastic/go-docappender#317): with the bump in place the counter still
reads 0 at startup, so the guard would still suppress write.bytes there
too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The libbeat.output.* and libbeat.pipeline.* sub-trees in /stats are
populated by per-scrape NewFunc callbacks that read docappender's OTel
ScopeMetrics. Pre-this-change, each field emitted only when its source
metric had at least one data point — so before docappender's first
publish, six fields were missing entirely (and three more were
suppressed by a now-removed writeBytes>0 guard). Downstream
mapping-regeneration tooling reads a single /stats snapshot, so absent
fields silently drop from the regenerated mappings.

Replace the per-call accumulate-then-emit shape with two persistent
state structs (libbeatOutputState, libbeatPipelineState) initialised
to zero at process start. Each NewFunc updates only the fields
docappender currently exports and then reports the full set:

- Eager: every field surfaces in /stats from the first scrape, before
  any indexing has happened, because the structs zero-init.
- No fabrication on missing scrapes: a rename or drop of a docappender
  instrument leaves the corresponding state field at its last-known
  value rather than snapping back to a fabricated 0 each call.
- Drift detection: TestLibbeatMetrics exercises real traffic and
  asserts specific cumulative values for every field, so a rename
  fails that test loudly.

The "type": "elasticsearch" string and the field emission are gated on
having ever observed the docappender scope, so console-output runs
keep an empty libbeat.output sub-tree as before.

Add TestLibbeatMetricsEagerZero to lock the eager-zero contract: build
a Beat with output.elasticsearch, create a docappender with no
traffic, and assert every field appears at zero in the snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@carsonip carsonip marked this pull request as draft April 30, 2026 16:03
@elasticmachine
Copy link
Copy Markdown
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches

Projects

None yet

Development

Successfully merging this pull request may close these issues.

monitoring: Tail-based sampling disk_usage_threshold_pct stat silently dropped from /stats

3 participants