Skip to content

apm: Document sampling.tail.discard_on_write_failure config #1453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

isaacaflores2
Copy link

@isaacaflores2 isaacaflores2 commented May 22, 2025

@@ -53,6 +53,10 @@ If a setting is not supported by {{ech}}, you will get an error message when you
Some settings that could break your cluster if set incorrectly are blocklisted. The following settings are generally safe in cloud environments. For detailed information about APM settings, check the [APM documentation](/solutions/observability/apm/configure-apm-server.md).
::::

### Version 9.1+ [ec_version_9_1]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config also applies to 8.19+ but I left it out based on @carsonip comment in another PR. Let me know if I should add 8.19+.

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, a nit on config description. Please hold off from merging until 9.1 release

@@ -85,6 +85,18 @@ Policies map trace events to a sample rate. Each policy must specify a sample ra
| APM Server binary | `sampling.tail.policies` |
| Fleet-managed | `Policies` |

### Discard On Write Failure [sampling-tail-discard-on-write-failure-ref]

Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces.
Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will be indexed regardless of the configured sample rate in policies, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces.

nit on description. Trying to make the implication clear. Feel free to change.

Copy link
Author

@isaacaflores2 isaacaflores2 May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I added a note to specify we bypass sampling

@carsonip carsonip requested a review from colleenmcginnis May 22, 2025 09:38
@@ -146,7 +146,7 @@ Due to [OpenTelemetry tail-based sampling limitations](/solutions/observability/

Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded when a sampling decision is made.

In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, sampling will be bypassed.
In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events will be indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one other place where the storage limit and sampling bypass was mentioned. Updated to describe the new behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

Copy link
Contributor

@florent-leborgne florent-leborgne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition! I left some minor-ish styling suggestions to align the wording with our writing guidelines.

@@ -53,6 +53,10 @@ If a setting is not supported by {{ech}}, you will get an error message when you
Some settings that could break your cluster if set incorrectly are blocklisted. The following settings are generally safe in cloud environments. For detailed information about APM settings, check the [APM documentation](/solutions/observability/apm/configure-apm-server.md).
::::

### Version 9.1+ [ec_version_9_1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other version sections, we specify that These are all of the supported settings for this version:. If providing the full list may be out of scope of this PR, is it possible to at least outline the changes? I assume apm-server.sampling.tail.discard_on_write_failure is a newly supported setting, but are there more changes, if you know?

Suggested change
### Version 9.1+ [ec_version_9_1]
### Version 9.1+ [ec_version_9_1]
This {{stack}} version adds support for the following settings:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9.1 will have 2 more configs than 9.0. One mentioned here, another in #1269 . I agree that explicitly mentioning these are new configs on top of 9.0 would be useful.

On a side note as a heads-up, before we spend too much time polishing this doc, I'm also thinking removing this doc altogether since it isn't providing much value after being moved from cloud to apm: elastic/apm-server#13602

@@ -77,6 +77,11 @@ If a setting is not supported on {{ecloud}}, you will get an error message when
Some settings that could break your cluster if set incorrectly are blocklisted. The following settings are generally safe in cloud environments. For detailed information about APM settings, check the [APM documentation](/solutions/observability/apm/configure-apm-server.md).
::::

### Version 9.1+ [ec_version_9_1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as my previous comment.

Suggested change
### Version 9.1+ [ec_version_9_1]
### Version 9.1+ [ec_version_9_1]
This {{stack}} version adds support for the following settings:

### Version 9.1+ [ec_version_9_1]

`apm-server.sampling.tail.discard_on_write_failure`
: Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.
: Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

Re-styling to present tense as per writing guidelines

### Version 9.1+ [ec_version_9_1]

`apm-server.sampling.tail.discard_on_write_failure`
: Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.
: Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

Align with previous suggestion

@@ -85,6 +85,18 @@ Policies map trace events to a sample rate. Each policy must specify a sample ra
| APM Server binary | `sampling.tail.policies` |
| Fleet-managed | `Policies` |

### Discard On Write Failure [sampling-tail-discard-on-write-failure-ref]

Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Defines the indexing behavior when trace events fail to be written to storage (e.g. when the storage limit is reached). When set to `false`, traces will bypass sampling and always be indexed, significantly increasing the indexing load. When set to `true`, traces will be discarded, there will be data loss potentially resulting in broken traces. The default is `false`.
Defines the indexing behavior when trace events fail to be written to storage (for example, when the storage limit is reached). When set to `false`, traces bypass sampling and are always indexed, which significantly increases the indexing load. When set to `true`, traces are discarded, causing data loss which can result in broken traces. The default is `false`.

Align with previous suggestion

@@ -146,7 +146,7 @@ Due to [OpenTelemetry tail-based sampling limitations](/solutions/observability/

Tail-based sampling (TBS), by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded when a sampling decision is made.

In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, sampling will be bypassed.
In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events will be indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events will be indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.
In an APM Server implementation, the events are stored temporarily on disk instead of in memory for better scalability. Therefore, it requires local disk storage proportional to the APM event ingestion rate and additional memory to facilitate disk reads and writes. If the [storage limit](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-storage_limit-ref) is insufficient, trace events are indexed or discarded based on the [discard on write failure](/solutions/observability/apm/tail-based-sampling.md#sampling-tail-discard-on-write-failure-ref) configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants