apm: Document sampling.tail.discard_on_write_failure config #1453

isaacaflores2 · 2025-05-22T00:08:43Z

⚠️ HOLD FOR 9.1 ⚠️
Document sampling.tail.discard_on_write_failure config.

I sourced the config explanation from here please let me know if the description is incorrect or unclear in any way.

Updated pages can be found in the docs preview here:

Checklist

Wait for PR apm: Document sampling.tail.ttl config #1269 to be merged and incorporate changes.

Related issues

Part of elastic/apm-server#15330

isaacaflores2 · 2025-05-22T00:12:59Z

reference/apm/cloud/apm-settings.md

@@ -53,6 +53,10 @@ If a setting is not supported by {{ech}}, you will get an error message when you
 Some settings that could break your cluster if set incorrectly are blocklisted. The following settings are generally safe in cloud environments. For detailed information about APM settings, check the [APM documentation](/solutions/observability/apm/configure-apm-server.md).
 ::::

+### Version 9.1+ [ec_version_9_1]


This config also applies to 8.19+ but I left it out based on @carsonip comment in another PR. Let me know if I should add 8.19+.

carsonip

lgtm, a nit on config description. Please hold off from merging until 9.1 release

carsonip · 2025-05-22T09:32:52Z

solutions/observability/apm/tail-based-sampling.md

@@ -85,6 +85,18 @@ Policies map trace events to a sample rate. Each policy must specify a sample ra
 | APM Server binary | `sampling.tail.policies` |
 | Fleet-managed | `Policies` |

+### Discard On Write Failure [sampling-tail-discard-on-write-failure-ref]
+
+Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces.


Suggested change

Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces.

Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will be indexed regardless of the configured sample rate in policies, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces.

nit on description. Trying to make the implication clear. Feel free to change.

Thanks. I added a note to specify we bypass sampling

isaacaflores2 · 2025-05-22T20:48:37Z

solutions/observability/apm/transaction-sampling.md

@@ -146,7 +146,7 @@ Due to [OpenTelemetry tail-based sampling limitations](/solutions/observability/

 Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded when a sampling decision is made.

-In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, sampling will be bypassed.
+In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events will be indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.


I found one other place where the storage limit and sampling bypass was mentioned. Updated to describe the new behavior

nice catch!

florent-leborgne

Thanks for the addition! I left some minor-ish styling suggestions to align the wording with our writing guidelines.

florent-leborgne · 2025-05-26T07:20:23Z

reference/apm/cloud/apm-settings.md

@@ -53,6 +53,10 @@ If a setting is not supported by {{ech}}, you will get an error message when you
 Some settings that could break your cluster if set incorrectly are blocklisted. The following settings are generally safe in cloud environments. For detailed information about APM settings, check the [APM documentation](/solutions/observability/apm/configure-apm-server.md).
 ::::

+### Version 9.1+ [ec_version_9_1]


For other version sections, we specify that These are all of the supported settings for this version:. If providing the full list may be out of scope of this PR, is it possible to at least outline the changes? I assume apm-server.sampling.tail.discard_on_write_failure is a newly supported setting, but are there more changes, if you know?

Suggested change

### Version 9.1+ [ec_version_9_1]

### Version 9.1+ [ec_version_9_1]

This {{stack}} version adds support for the following settings:

9.1 will have 2 more configs than 9.0. One mentioned here, another in #1269 . I agree that explicitly mentioning these are new configs on top of 9.0 would be useful.

On a side note as a heads-up, before we spend too much time polishing this doc, I'm also thinking removing this doc altogether since it isn't providing much value after being moved from cloud to apm: elastic/apm-server#13602

florent-leborgne · 2025-05-26T07:22:22Z

solutions/observability/apm/configure-apm-server.md

@@ -77,6 +77,11 @@ If a setting is not supported on {{ecloud}}, you will get an error message when
 Some settings that could break your cluster if set incorrectly are blocklisted. The following settings are generally safe in cloud environments. For detailed information about APM settings, check the [APM documentation](/solutions/observability/apm/configure-apm-server.md).
 ::::

+### Version 9.1+ [ec_version_9_1]


Same as my previous comment.

Suggested change

### Version 9.1+ [ec_version_9_1]

### Version 9.1+ [ec_version_9_1]

This {{stack}} version adds support for the following settings:

florent-leborgne · 2025-05-26T07:26:57Z

reference/apm/cloud/apm-settings.md

+### Version 9.1+ [ec_version_9_1]
+
+`apm-server.sampling.tail.discard_on_write_failure`
+:   Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`. 


Suggested change

: Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.

: Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

Re-styling to present tense as per writing guidelines

florent-leborgne · 2025-05-26T07:27:31Z

solutions/observability/apm/configure-apm-server.md

+### Version 9.1+ [ec_version_9_1]
+
+`apm-server.sampling.tail.discard_on_write_failure`
+:   Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.


Suggested change

: Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.

: Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

Align with previous suggestion

florent-leborgne · 2025-05-26T07:28:13Z

solutions/observability/apm/tail-based-sampling.md

@@ -85,6 +85,18 @@ Policies map trace events to a sample rate. Each policy must specify a sample ra
 | APM Server binary | `sampling.tail.policies` |
 | Fleet-managed | `Policies` |

+### Discard On Write Failure [sampling-tail-discard-on-write-failure-ref]
+
+Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.


Suggested change

Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.

Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

Align with previous suggestion

florent-leborgne · 2025-05-26T07:29:13Z

solutions/observability/apm/transaction-sampling.md

@@ -146,7 +146,7 @@ Due to [OpenTelemetry tail-based sampling limitations](/solutions/observability/

 Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded when a sampling decision is made.

-In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, sampling will be bypassed.
+In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events will be indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.


Suggested change

In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events will be indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.

In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events are indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.

apm: Document sampling.tail.discard_on_write_failure config

2ebbf56

isaacaflores2 requested review from a team as code owners May 22, 2025 00:08

github-actions bot deployed to docs-preview May 22, 2025 00:09 View deployment

isaacaflores2 commented May 22, 2025

View reviewed changes

carsonip approved these changes May 22, 2025

View reviewed changes

carsonip requested a review from colleenmcginnis May 22, 2025 09:38

isaacaflores2 added 2 commits May 22, 2025 13:42

apm: specify sampling bypass when discard_on_write_failure is false

76fadaa

apm: add discard_on_write_failure note to transaction-sampling.md

ad5d6c0

github-actions bot deployed to docs-preview May 22, 2025 20:44 View deployment

isaacaflores2 commented May 22, 2025

View reviewed changes

carsonip approved these changes May 23, 2025

View reviewed changes

This was referenced May 23, 2025

[APM] Support sampling discard_on_write_failure configuration in apm integration policy elastic/kibana#221441

Open

TBS: Document discard_on_write_failure + expose it to the APM Integration elastic/apm-server#15330

Open

florent-leborgne reviewed May 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

apm: Document sampling.tail.discard_on_write_failure config #1453

apm: Document sampling.tail.discard_on_write_failure config #1453

Uh oh!

isaacaflores2 commented May 22, 2025 •

edited

Loading

Uh oh!

isaacaflores2 May 22, 2025

Uh oh!

carsonip left a comment

Uh oh!

carsonip May 22, 2025

Uh oh!

isaacaflores2 May 22, 2025 •

edited

Loading

Uh oh!

isaacaflores2 May 22, 2025

Uh oh!

carsonip May 23, 2025

Uh oh!

florent-leborgne left a comment

Uh oh!

florent-leborgne May 26, 2025

Uh oh!

carsonip May 27, 2025

Uh oh!

florent-leborgne May 26, 2025

Uh oh!

florent-leborgne May 26, 2025

Uh oh!

florent-leborgne May 26, 2025

Uh oh!

florent-leborgne May 26, 2025

Uh oh!

florent-leborgne May 26, 2025

Uh oh!

Uh oh!

	Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces.
	Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will be indexed regardless of the configured sample rate in policies, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces.

	: Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.
	: Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

apm: Document sampling.tail.discard_on_write_failure config #1453

Are you sure you want to change the base?

apm: Document sampling.tail.discard_on_write_failure config #1453

Uh oh!

Conversation

isaacaflores2 commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Related issues

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaacaflores2 May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

florent-leborgne left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

isaacaflores2 commented May 22, 2025 •

edited

Loading

isaacaflores2 May 22, 2025 •

edited

Loading