-
Notifications
You must be signed in to change notification settings - Fork 89
Add documentation for failure stores. #1368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
TBD on recipes. Most links are not complete and need updating from "???".
|
||
If you have a large number of existing data streams you may want an easier way to control if failures should be redirected. Instead of enabling the failure store using the [put data stream options](./failure-store.md) API, you can instead configure a set of patterns in the [cluster settings](./failure-store.md) which will enable the failure store feature by default. | ||
|
||
Configure a list of patterns using the `data_streams.failure_store.enabled` dynamic cluster setting. If a data stream matches a pattern in this setting and does not have the failure store explicitly disabled in its options, then the failure store will default to being enabled for that matching data stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation here should mention whether this settings applies retroactively to pre-existing data streams, or only on data stream creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering that as well when reading this.
then the failure store will default to being enabled for that matching data stream
Makes it sounds like it's not applying to existing data streams, just act as a default.
If you have a large number of existing data streams [...] you can instead configure a set of patterns in the cluster settings
Makes it sound the setting is an alternative to enabling the failure store one by one via the DS options API if you have a large number of existing data streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I tried to phrase this in a way that made it not seem like setting the property was the same as toggling the feature on permanently. Matching data streams only enable their failure stores so long as they are not explicitly disabled in the options, and only for as long as they match the setting.
I've put up an edit that should simplify the explanation a bit. I also added an example of the explicit disabling of the failure store overriding the cluster setting.
|
||
If you have a large number of existing data streams you may want an easier way to control if failures should be redirected. Instead of enabling the failure store using the [put data stream options](./failure-store.md) API, you can instead configure a set of patterns in the [cluster settings](./failure-store.md) which will enable the failure store feature by default. | ||
|
||
Configure a list of patterns using the `data_streams.failure_store.enabled` dynamic cluster setting. If a data stream matches a pattern in this setting and does not have the failure store explicitly disabled in its options, then the failure store will default to being enabled for that matching data stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering that as well when reading this.
then the failure store will default to being enabled for that matching data stream
Makes it sounds like it's not applying to existing data streams, just act as a default.
If you have a large number of existing data streams [...] you can instead configure a set of patterns in the cluster settings
Makes it sound the setting is an alternative to enabling the failure store one by one via the DS options API if you have a large number of existing data streams.
|
||
### Add and remove from failure store [manage-failure-store-indices] | ||
|
||
Failure stores support adding and removing indices from them using the [modify data stream](./failure-store.md) API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you add a failure store backing index that has incompatible mappings? Are we doing validations when adding a backing index or would it fail at runtime? Not suggesting one way is better than the other but maybe we should describe what happens here to set expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no special handling for mappings when adding an index to the failure store. You could add a completely unrelated index to the failure store and we allow it. Indices that are added to a data stream can never be treated as a write index, so we're less worried about their mappings than when doing a rollover operation. Even if the failure store is empty and we add a random index, the failure store is still marked for lazy rollover and will create a write index on redirection.
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good so far! I have one drive-by comment.
|
||
# Failure store [failure-store] | ||
|
||
Failure stores are a secondary set of indices inside a data stream dedicated to storing failed documents. Failed documents are any documents that cause ingest pipeline exceptions or have a structure that conflicts with a data stream's mappings. These failures normally cause the indexing operation to fail, returning the error message in the response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe replace "These failures normally cause" with something like "Without the failure store, these failures would cause". I don't think it's obvious from your phrasing what "normally" means — like, maybe you're explaining the behaviour with the failure store, since this is a page about failure store? — and it's better to be explicit. (I am aware of this because I've confused people before by saying things like this!)
Hi @jbaiera I noticed a small issue This example
Should be add the
|
Here we have a bulk operation that sends two documents. Both are writing to the `id` field which is mapped as a `long` field type. The first document will be accepted, but the second document would cause a failure because the value `invalid_text` cannot be parsed as a `long`. This second document will be redirected to the failure store: | ||
|
||
```console | ||
POST my-datastream/_bulk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be
POST my-datastream-new/_bulk
to match the data stream template at the top of the page.
|
||
# Failure store [failure-store] | ||
|
||
Failure stores are a secondary set of indices inside a data stream dedicated to storing failed documents. Failed documents are any documents that cause ingest pipeline exceptions or have a structure that conflicts with a data stream's mappings. These failures normally cause the indexing operation to fail, returning the error message in the response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure stores are a secondary set of indices inside a data stream dedicated to storing failed documents. Failed documents are any documents that cause ingest pipeline exceptions or have a structure that conflicts with a data stream's mappings. These failures normally cause the indexing operation to fail, returning the error message in the response. | |
A failure store is a secondary set of indices inside a data stream, dedicated to storing failed documents. A failed document is any document that, without the failure store enabled, would cause an ingest pipeline exception or that has a structure that conflicts with a data stream's mappings. In the absence of the failure store, a failed document would cause the indexing operation to fail, with an error message returned in the operation response. |
I took a stab at rephrasing to incorporate Pete's comment above, and also to introduce the terms "failure store" and "failed document" as singular rather than plural, which is our usual convention, but it's just a suggestion. I think the original works nicely too.
|
||
### Set up for new data streams [set-up-failure-store-new] | ||
|
||
You can specify on a data stream's template if it should enable the failure store when it is first created. The `data_stream_options` field in a [template](../templates.md) contains the settings required to enable a data stream's failure store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can specify on a data stream's template if it should enable the failure store when it is first created. The `data_stream_options` field in a [template](../templates.md) contains the settings required to enable a data stream's failure store. | |
You can specify in a data stream's [index template](../templates.md) if it should enable the failure store when it is first created. |
I'd put the link on the first instance of "template" and then we can remove the second sentence, since I think it's pretty similar to what you have above the example. Also, maybe "index template" is clearer, though I know it can also be a component template.
|
||
### Set up for existing data streams [set-up-failure-store-existing] | ||
|
||
Enabling the failure store via [index templates](../templates.md) can only affect data streams that are newly created. Existing data streams that use a template will not apply any changes to the template's `data_stream_options` after they have been created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enabling the failure store via [index templates](../templates.md) can only affect data streams that are newly created. Existing data streams that use a template will not apply any changes to the template's `data_stream_options` after they have been created. | |
Enabling the failure store via [index templates](../templates.md) can only affect data streams that are newly created. Existing data streams that use a template are not affected by changes to the template's `data_stream_options` field. |
Just a tweak. :-)
|
||
## Set up a data stream failure store [set-up-failure-store] | ||
|
||
Each data stream has its own failure store that can be enabled to accept failures. By default, this failure store is disabled and any ingestion problems are raised in the response to write operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each data stream has its own failure store that can be enabled to accept failures. By default, this failure store is disabled and any ingestion problems are raised in the response to write operations. | |
Each data stream has its own failure store that can be enabled to accept failed documents. By default, this failure store is disabled and any ingestion problems are raised in the response to write operations. |
Just for consistency with some of the other text.
1. The failure store option will now be enabled. | ||
|
||
|
||
The failure store redirection can be disabled using this API as well. When the failure store is deactivated, only failed document redirection is halted. Any existing failure data in the data stream will remain until removed by manual deletion or by retention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure store redirection can be disabled using this API as well. When the failure store is deactivated, only failed document redirection is halted. Any existing failure data in the data stream will remain until removed by manual deletion or by retention. | |
The failure store redirection can be disabled using this API as well. When the failure store is deactivated, only failed document redirection is halted. Any existing failure data in the data stream will remain until removed by manual deletion or by an expired data retention setting. |
Just a tweak, but please ignore if I've got it wrong.
|
||
Once a failure store is enabled for a data stream it will begin redirecting documents that fail due to common ingestion problems instead of returning errors in write operations. Clients are notified in a non-intrusive way when a document is redirected to the failure store. | ||
|
||
Each data stream's failure store is made up of a list of indices that are dedicated to storing failed documents. These failure indices function much like a data stream's normal backing indices: There is a write index that accepts failed documents, they can be rolled over, and are automatically cleaned up over time subject to a lifecycle policy. Failure indices are lazily created the first time they are needed to store a failed document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each data stream's failure store is made up of a list of indices that are dedicated to storing failed documents. These failure indices function much like a data stream's normal backing indices: There is a write index that accepts failed documents, they can be rolled over, and are automatically cleaned up over time subject to a lifecycle policy. Failure indices are lazily created the first time they are needed to store a failed document. | |
Each data stream's failure store is made up of a list of indices that are dedicated to storing failed documents. These failure indices function much like a data stream's normal backing indices: There is a write index that accepts failed documents, the indices can be rolled over, and they're automatically cleaned up over time subject to a lifecycle policy. Failure indices are lazily created the first time they are needed to store a failed document. |
Small tweak.
:::: | ||
|
||
::::{step} Create new rule | ||
Navigate to Management / Alerts and Insights / Rules. Create a new rule. Choose the Elasticsearch query option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Navigate to Management / Alerts and Insights / Rules. Create a new rule. Choose the Elasticsearch query option. | |
Navigate to Management / Alerts and Insights / Rules. Create a new rule. Choose the {{es}} query option. |
} | ||
``` | ||
|
||
1. The response code is 200 OK, and the response body does not report any errors encountered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. The response code is 200 OK, and the response body does not report any errors encountered. | |
1. The response code is `200 OK`, and the response body does not report any errors encountered. |
|
||
1. The failure is returned to the client as normal when the failure store is not enabled. | ||
2. The response is annotated with a flag indicating the failure store could have accepted the document, but it was not enabled. | ||
3. Status of 400 Bad Request due to the mapping problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Status of 400 Bad Request due to the mapping problem. | |
3. The response status is `400 Bad Request` due to the mapping problem. |
(just for consistency with the previous example)
2. The document could not be redirected because the failure store was not able to accept writes at this time due to an unforeseeable issue. | ||
3. The complete exception tree is present on the response. | ||
4. The response is annotated with a flag indicating the failure store would have accepted the document, but it was not able to. | ||
5. Status of 400 Bad Request due to the original mapping problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5. Status of 400 Bad Request due to the original mapping problem. | |
5. The response status is `400 Bad Request` due to the original mapping problem. |
|
||
1. The document belongs to a failure store index on the data stream. | ||
2. The failure document timestamp is when the failure occurred in {{es}}. | ||
3. The document that was sent is captured inside the failure document. Failure documents capture the id of the document at time of failure, along with which data stream the document was being written to, and the contents of the document. The `document.source` fields are unmapped to ensure failures are always captured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. The document that was sent is captured inside the failure document. Failure documents capture the id of the document at time of failure, along with which data stream the document was being written to, and the contents of the document. The `document.source` fields are unmapped to ensure failures are always captured. | |
3. The document that was sent is captured inside the failure document. Failure documents capture the ID of the document at time of failure, along with which data stream the document was being written to, and the contents of the document. The `document.source` fields are unmapped to ensure failures are always captured. |
small nit
: (`object`) The document at time of failure. If the document failed in an ingest pipeline, then the document will be the unprocessed version of the document as it arrived in the original indexing request. If the document failed due to a mapping issue, then the document will be as it was after any ingest pipelines were applied to it. | ||
|
||
`document.id` | ||
: (`keyword`) The id of the original document at the time of failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: (`keyword`) The id of the original document at the time of failure. | |
: (`keyword`) The ID of the original document at the time of failure. |
: (`text`) A compressed stack trace from {{es}} for the failure. | ||
|
||
`error.type` | ||
: (`keyword`) The type classification of failure. Values are the same type returned within failed indexing API responses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: (`keyword`) The type classification of failure. Values are the same type returned within failed indexing API responses. | |
: (`keyword`) The type classification of the failure. Values are the same type returned within failed indexing API responses. |
|
||
We can see that the document failed on the second processor in the pipeline. The first processor would have added a `@timestamp` field. Since the pipeline failed, we find that it has no `@timestamp` field added because it did not save any changes from before the pipeline failed. | ||
|
||
The second place failures can occur is during indexing. After the documents have been processed by any applicable pipelines, they are parsed using the index mappings before being indexed into the shard. If a document is sent to the failure store due to a failure in this process, then it will be stored as it was after any ingestion had occurred. This is because the original document is overwritten by the ingest pipeline changes by this point. This has the benefit of being able to see what the document looked like during the mapping and indexing phase of the write operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second place failures can occur is during indexing. After the documents have been processed by any applicable pipelines, they are parsed using the index mappings before being indexed into the shard. If a document is sent to the failure store due to a failure in this process, then it will be stored as it was after any ingestion had occurred. This is because the original document is overwritten by the ingest pipeline changes by this point. This has the benefit of being able to see what the document looked like during the mapping and indexing phase of the write operation. | |
The second time when failures can occur is during indexing. After the documents have been processed by any applicable pipelines, they are parsed using the index mappings before being indexed into the shard. If a document is sent to the failure store due to a failure in this process, then it will be stored as it was after any ingestion had occurred. This is because, by this point, the original document has already been overwritten by the ingest pipeline changes. This has the benefit of allowing you to see what the document looked like during the mapping and indexing phase of the write operation. |
Small tweak.
Adds a new section to the documentation to explain new failure store functionality.
Preview:
https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/1368/manage-data/data-store/data-streams/failure-store
Work in progress: