Redirect failed ingest node operations to a failure store when available #103481

jbaiera · 2023-12-14T21:43:03Z

This PR updates the ingest service to detect if a failed ingest document was bound for a data stream configured with a failure store, and in that event, restores the document to its original state, transforms it with its failure information, and redirects it to the failure store for the data stream it was originally targeting.

Example run with a default pipeline and data stream:

PUT _ingest/pipeline/testpipeline
{
  "processors": [
    {
      "fail": {
        "message": "This test pipeline fails for all documents"
      }
    }
  ]
}

PUT _index_template/my_data_stream_template
{
  "index_patterns" : ["my_data_stream*"], 
  "data_stream": {
    "failure_store": true
  },
  "priority" : 1,
  "template": {
    "settings" : {
      "number_of_shards" : 1, 
      "index.default_pipeline": "testpipeline"
    }
  }
}

POST my_data_stream_1/_doc
{
  "key": "value",
  "@timestamp": "2023-12-14T12:00:00Z"
}
>>>
{
  "_index": ".fs-my_data_stream_1-2023.12.14-000001",
  "_id": "8es0aowBHIk1gE8HEwcs",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

POST .fs-my_data_stream_1-2023.12.14-000001/_search
>>>
{
  "took": 47,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": ".fs-my_data_stream_1-2023.12.14-000001",
        "_id": "8es0aowBHIk1gE8HEwcs",
        "_score": 1,
        "_source": {
          "@timestamp": "2023-12-14T21:20:46.566Z",
          "document": {
            "index": "my_data_stream_1",
            "source": {
              "@timestamp": "2023-12-14T12:00:00Z",
              "key": "value"
            }
          },
          "error": {
            "type": "fail_processor_exception",
            "message": "This test pipeline fails for all documents",
            "stack_trace": "org.elasticsearch.ingest.common.FailProcessorException: This test pipeline fails for all documents\n\tat org.elasticsearch.ingest.common@8.12.0-SNAPSHOT/org.elasticsearch.ingest.common.FailProcessor.execute(FailProcessor.java:41)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.ingest.CompoundProcessor.innerExecute(CompoundProcessor.java:165)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:141)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:129)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.ingest.IngestDocument.executePipeline(IngestDocument.java:867)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.ingest.IngestService.executePipeline(IngestService.java:1020)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.ingest.IngestService.executePipelines(IngestService.java:879)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.ingest.IngestService$1.doRun(IngestService.java:765)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)\n\tat org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n"
          }
        }
      }
    ]
  }
}

elasticsearchmachine · 2023-12-14T21:43:53Z

Hi @jbaiera, I've created a changelog YAML for you.

jbaiera · 2023-12-14T21:47:16Z

server/src/main/java/org/elasticsearch/action/index/IndexRequest.java

+            // TODO: Should this be a harder backstop than an assert statement?
+            assert ia.isDataStreamRelated()
+                : "Attempting to write a document to a failure store but the targeted index is not a data stream";
+            // Resolve write index and get parent data stream to handle the case of dealing with an alias
+            String defaultWriteIndexName = ia.getWriteIndex().getName();
+            DataStream dataStream = metadata.getIndicesLookup().get(defaultWriteIndexName).getParentDataStream();
+            // TODO: Should this be a harder backstop than an assert statement?
+            assert dataStream.getFailureIndices().size() > 0
+                : "Attempting to write a document to a failure store but the target data stream does not have one enabled";


These assertions: Do we want them to be stronger checks that will trigger at runtime? I can imagine things going poorly if a failure document ends up somewhere it shouldn't

Hmm... I suppose this could happen if we weren't consistent with passing in the same Metadata everywhere, and the index got removed from the metadata. That would definitely be a fun though, and since if that were the case and asserts weren't enabled we'd just fail in the next line with an index-out-of-bounds exception, I think it'd be better to make this a real check so we can at least have a useful message. What do you think?

…g failure store documents.

…ailure store.

…the bulk request

elasticsearchmachine · 2024-01-10T19:49:40Z

Pinging @elastic/es-data-management (Team:Data Management)

dakrone

Thanks for working on this Jimmy, I left some comments

server/src/main/java/org/elasticsearch/action/bulk/FailureStoreDocument.java