Skip to content
11 changes: 11 additions & 0 deletions explore-analyze/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -464,13 +464,24 @@ toc:
- file: workflows/use-cases/security.md
children:
- file: workflows/use-cases/security/automate-security-operations.md
children:
- file: workflows/use-cases/security/automate-security-operations/alert-triage-with-case.md
- file: workflows/use-cases/security/automate-security-operations/ai-driven-alert-triage.md
- file: workflows/use-cases/security/automate-security-operations/enrich-alert-with-threat-intel.md
- file: workflows/use-cases/security/manage-detection-rules.md
children:
- file: workflows/use-cases/security/manage-detection-rules/run-rules-on-demand.md
- file: workflows/use-cases/observability.md
children:
- file: workflows/use-cases/observability/root-cause-analysis.md
- file: workflows/use-cases/ai-augmented-workflows.md
children:
- file: workflows/use-cases/ai-augmented-workflows/classify-and-route-alerts.md
- file: workflows/authoring-techniques.md
children:
- file: workflows/authoring-techniques/use-yaml-editor.md
- file: workflows/authoring-techniques/pass-data-handle-errors.md
- file: workflows/authoring-techniques/compose-workflows.md
- file: workflows/authoring-techniques/human-in-the-loop.md
- file: workflows/authoring-techniques/monitor-workflows.md
- file: workflows/authoring-techniques/manage-workflows.md
Expand Down
1 change: 1 addition & 0 deletions explore-analyze/workflows/authoring-techniques.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Techniques that apply across workflow types, regardless of which outcome you're

- [Use the YAML editor](/explore-analyze/workflows/authoring-techniques/use-yaml-editor.md): Author and run workflows in the YAML editor in {{kib}}.
- [Pass data and handle errors](/explore-analyze/workflows/authoring-techniques/pass-data-handle-errors.md): Move data between steps, use dynamic templating, and make workflows resilient with `on-failure`.
- [Compose workflows from reusable parts](/explore-analyze/workflows/authoring-techniques/compose-workflows.md): Decompose long workflows into reusable child workflows, design the input and output contract, and fan out with asynchronous composition.
- [Human-in-the-loop](/explore-analyze/workflows/authoring-techniques/human-in-the-loop.md): Pause a workflow for reviewer input and resume on their decision.
- [Monitor workflow execution](/explore-analyze/workflows/authoring-techniques/monitor-workflows.md): Track runs, review execution history, and troubleshoot failures.
- [Manage and organize workflows](/explore-analyze/workflows/authoring-techniques/manage-workflows.md): Find, edit, duplicate, enable, and disable workflows from the **Workflows** page.
Expand Down
156 changes: 156 additions & 0 deletions explore-analyze/workflows/authoring-techniques/compose-workflows.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
---
navigation_title: Compose workflows
applies_to:
stack: preview 9.4+
serverless: preview
description: Decompose long workflows into reusable child workflows. Design the input and output contract, test children in isolation, and fan out with asynchronous composition.
products:
- id: kibana
- id: cloud-serverless
- id: cloud-hosted
- id: cloud-enterprise
- id: cloud-kubernetes
- id: elastic-stack
---

# Compose workflows from reusable parts [workflows-compose-workflows]

Composition lets one workflow call another. Done well, it turns a sprawling multi-purpose workflow into a small, testable parent that delegates to focused children. This page covers the authoring decisions: when to extract a child workflow, how to design its input and output contract, how to test it in isolation, and how to fan out to background jobs.

:::{warning}
Composition steps (`workflow.execute`, `workflow.executeAsync`, `workflow.output`, `workflow.fail`) are in technical preview in 9.4. Use them for prototypes and reusable utility workflows. Hold off on critical paths until composition reaches GA.
:::

For the step parameter reference, refer to [Composition steps](/explore-analyze/workflows/steps/composition.md).

## When to extract a child workflow [workflows-compose-when]

Reach for composition when you notice any of the following:

| Signal | What to extract |
|---|---|
| You're copying the same five to ten steps across several workflows. | A shared child that owns the repeated sequence. |
| A workflow has grown long enough that you can't test it end to end. | Split it along a natural boundary (enrichment, notification, remediation). |
| Different teams own different parts of a process. | Give each team their own child workflow with a documented contract. |
| You need to fan out to N background jobs. | An asynchronous child invoked with `workflow.executeAsync`. |
| A step sequence is easier to reason about in isolation. | Extract it so you can read, test, and change it on its own. |

If a workflow is short, used once, and clear end to end, don't extract. Composition introduces a contract you have to maintain.

## Design the input and output contract [workflows-compose-contract]

A child workflow is a unit of code with a public interface. Treat it like one.

### Declare inputs and outputs at the top

Use the top-level `inputs` and `outputs` fields to spell out the contract. The engine validates inputs at invocation and outputs at `workflow.output` time, so callers can rely on shape without guarding against missing keys.

```yaml
name: shared--enrich-alerts
description: Enrich alerts with threat intel and geo data.

triggers:
- type: manual

inputs:
- name: alerts
type: object
required: true

outputs:
- name: enriched_alerts
type: object
- name: enrichment_stats
type: object

steps:
# ...enrichment logic...

- name: return_result
type: workflow.output
with:
enriched_alerts: "${{ steps.enrich.output }}"
enrichment_stats: "${{ steps.stats.output }}"
```

### Keep the shape small

A child with a dozen inputs and a deeply nested output is hard to call. When you notice the contract growing, ask whether the child is doing too much. Most shared children work best with one or two inputs (an object bag if you have several related fields) and two or three outputs.

### Name children so they're discoverable

A common convention: platform teams name shared workflows with a `shared--<verb>-<noun>` prefix (for example, `shared--enrich-alerts`, `shared--open-case`, `shared--notify-team`). Product teams then compose from those shared children into their own domain workflows. The prefix makes the shared library easy to scan in the **Workflows** list.

## Test children in isolation [workflows-compose-test]

Give every child a `manual` trigger so you can test it on its own. This is one of the main operational wins of composition: you can exercise a single piece of your automation without running the parent.

1. Open the child workflow in the [YAML editor](/explore-analyze/workflows/authoring-techniques/use-yaml-editor.md).
2. Use the **Run** action and provide the inputs that a parent would normally pass.
3. Verify the outputs match the declared schema. Mismatches are caught by the engine, so a failing schema validation points directly at a broken child.

Parents invoke children through `workflow.execute`, but triggers and invocation live independently. A child workflow is still a normal workflow: anything that can call it (another workflow, the UI, an API client) works the same way.

## Choose synchronous or asynchronous composition [workflows-compose-sync-async]

| Use | When |
|---|---|
| [`workflow.execute`](/explore-analyze/workflows/steps/composition.md#workflow-execute) (synchronous) | The parent needs the child's result before it can continue. For example, enrich and then decide. |
| [`workflow.executeAsync`](/explore-analyze/workflows/steps/composition.md#workflow-executeasync) (asynchronous) | Fire and forget. Notifications, logging, and fan-out to background workers. |

Asynchronous composition is the primary fan-out primitive in workflows. Each call spawns an independent execution with its own execution log, retry policy, and observability. This is cleaner than any intra-execution concurrency construct and lets each background job fail or succeed without affecting its siblings.

### Fan out with `foreach` plus `workflow.executeAsync`

The canonical fan-out pattern is a `foreach` over a list of work items, each one invoking an asynchronous child:

```yaml
- name: fan_out_hosts
type: foreach
foreach: "${{ steps.find_hosts.output.hits.hits }}"
steps:
- name: spawn_handler
type: workflow.executeAsync
with:
workflow-id: "shared--handle-host"
inputs:
host_id: "{{ foreach.item._source.host.id }}"
correlation_id: "{{ execution.id }}"
```

The parent finishes quickly. The N child executions continue in the background. Pass `execution.id` (or a similar correlation token) as an input so you can link parent and child executions in your observability tooling.

## Guard against recursion [workflows-compose-recursion]

The execution engine enforces a maximum composition depth to prevent infinite recursion. A workflow cannot call itself directly, and a deep chain of parent-child-grandchild calls stops at the depth limit with a clear error.

If you need to guard against your own recursion (for example, a handler that could trigger itself again), read `execution.compositionDepth` inside the child and short-circuit when it exceeds what your design expects:

```yaml
- name: stop_if_nested
type: if
condition: "execution.compositionDepth > 2"
steps:
- name: abort
type: workflow.fail
with:
message: "Nested too deep. Intended max depth is 2."
reason: "max_depth_exceeded"
```

## Version and deprecate shared children [workflows-compose-versioning]

When a shared child changes shape (new required input, renamed output, breaking behavior), the safest path is to publish a new name rather than mutate the existing one.

- Keep `shared--enrich-alerts` stable for callers that depend on the current contract.
- Ship `shared--enrich-alerts-v2` with the new shape.
- Migrate callers one at a time and retire v1 when all are moved.

This discipline pays off quickly once more than one team depends on a shared workflow.

## Related pages [workflows-compose-related]

- [Composition steps reference](/explore-analyze/workflows/steps/composition.md): Parameter shapes for `workflow.execute`, `workflow.executeAsync`, `workflow.output`, and `workflow.fail`.
- [Use the YAML editor](/explore-analyze/workflows/authoring-techniques/use-yaml-editor.md): How test runs work when you're iterating on a child workflow.
- [Pass data and handle errors](/explore-analyze/workflows/authoring-techniques/pass-data-handle-errors.md): `on-failure` interacts with `workflow.fail` in predictable ways.
- [Foreach step](/explore-analyze/workflows/steps/foreach.md): Pair with `workflow.executeAsync` for the fan-out pattern.
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
applies_to:
stack: preview 9.3, ga 9.4+
serverless: ga
description: Pass data between workflow steps with templating, reference inputs and constants, and handle step failures with retries and fallbacks.
description: Pass data between workflow steps with templating, reference inputs and constants, and handle step failures with retries, fallbacks, continue, and a cross-workflow error handler.
products:
- id: kibana
- id: cloud-serverless
Expand Down Expand Up @@ -51,15 +51,12 @@
user.id: "u-123"

- name: create_case_for_user
type: kibana.createCaseDefaultSpace
type: cases.createCase
with:
title: "Investigate user u-123"
description: "A case has been opened for user {{steps.find_user_by_id.output.hits.hits[0]._source.user.fullName}}."
description: "A case has been opened for user {{ steps.find_user_by_id.output.hits.hits[0]._source.user.fullName }}."
owner: "securitySolution"
tags: ["user-investigation"]
connector:
id: "none"
name: "none"
type: ".none"
```

In this example:
Expand All @@ -70,7 +67,15 @@

## Error handling [workflows-error-handling]

By default, if any step in a workflow fails, the entire workflow execution stops immediately. You can override this behavior using the `on-failure` block, which supports retry logic, fallback steps, and continuation options.
By default, if any step fails the entire workflow execution stops immediately (the `abort` behavior). Override this with the `on-failure` block, which supports retry logic, fallback steps, and continuation. For failures that cross workflow boundaries, the [`workflows.failed` trigger](/explore-analyze/workflows/triggers/event-driven-triggers.md) lets a separate handler workflow react after another workflow has failed.

### Three layers of error handling [workflows-error-layers]

| Layer | What it controls | Use for |
|---|---|---|
| **Per-step** `on-failure` | What happens when one step fails. | Retry transient failures, continue past non-critical steps, or provide a fallback. |
| **Workflow-level** `settings.on-failure` | Default `on-failure` applied to every step. | A consistent global retry policy. |
| **Cross-workflow** [`workflows.failed` trigger](/explore-analyze/workflows/triggers/event-driven-triggers.md) | A separate handler workflow that runs after another workflow has failed. | Paging on-call, opening cases, central error reporting. |

### Configuration levels [workflows-on-failure-levels]

Expand All @@ -88,7 +93,7 @@
delay: "5s"
```

**Workflow-level** (configured under `settings`) - applies to all steps as the default error handling behavior:
**Workflow-level** (configured under `settings`) applies to all steps as the default error handling behavior:

```yaml
settings:
Expand All @@ -101,26 +106,35 @@
type: http
```

Precedence: per-step `on-failure` > workflow-level `settings.on-failure` > engine default (`abort`).

:::{note}
Step-level `on-failure` configuration always overrides workflow-level settings.
:::

### Retry [workflows-on-failure-retry]

Retries the failed step a configurable number of times, with an optional delay between attempts.
Retries the failed step a configurable number of times. The full shape accepts backoff strategy, maximum delay, jitter, and a KQL condition so you only retry on specific errors.

```yaml
on-failure:
retry:
max-attempts: 3 # Required, minimum 1 (for example, "1", "2", "5")
delay: "5s" # Optional, duration format (for example, "5s", "1m", "2h")
max-attempts: 5 # Total attempts, including the first. Required, minimum 1.
delay: "1s" # Base delay between attempts. Duration format, for example "5s", "1m".
strategy: exponential # "fixed" (default) or "exponential".
multiplier: 2 # Only used with strategy: exponential.
max-delay: "30s" # Ceiling on the delay between retries.
jitter: true # Add randomness to avoid thundering-herd retry storms.
condition: "steps.self.error.status : 429" # Optional KQL predicate over steps.self.error.
```

The workflow fails when all retries are exhausted.
`condition` is a KQL expression evaluated against `steps.self.error`. Use it to retry only on specific failure modes: for example, retry on HTTP 429s and 5xxs but not on 4xx client errors.

The workflow fails when all retries are exhausted, unless paired with `fallback` or `continue`.

### Fallback [workflows-on-failure-fallback]

Executes alternative steps after the primary step fails and all retries are exhausted. In the following example, when the `delete_critical_document` step fails, the workflow executes two additional steps: one sends a Slack notification to devops-alerts using `{{workflow.name}}`, while the other logs the error details from the failed step using `{{steps.delete_critical_document.error}}`.
Runs alternative steps after the primary step fails and all retries are exhausted. In the following example, when the `delete_critical_document` step fails, the workflow runs two additional steps: one sends a Slack notification to devops-alerts using `{{workflow.name}}`, while the other logs the error details from the failed step using `{{steps.delete_critical_document.error}}`.

```yaml
on-failure:
Expand All @@ -140,20 +154,24 @@

### Continue [workflows-on-failure-continue]

Continues workflow execution even if a step fails. The failure is recorded, but does not interrupt the workflow.
Continues workflow execution even if a step fails. The failure is recorded at `steps.<name>.error`, but the workflow moves on to the next step. Use this for non-critical steps whose failure shouldn't take down the whole workflow.

```yaml
on-failure:
continue: true
```

### Abort [workflows-on-failure-abort]

Check notice on line 164 in explore-analyze/workflows/authoring-techniques/pass-data-handle-errors.md

View workflow job for this annotation

GitHub Actions / build / vale

Elastic.WordChoice: Consider using 'stop, cancel, end' instead of 'Abort', unless the term is in the UI.

Stops the workflow. This is the default when no `on-failure` is configured, so you rarely need to write it explicitly. Use `abort` when a downstream step depends on this step's output and continuing makes no sense.

### Combining options [workflows-on-failure-combining]

You can combine multiple failure-handling options. They are processed in this order: retry → fallback → continue.

In the following example:
1. The step retries up to 2 times with a 1-second delay.
2. If all retries fail, the fallback steps execute.
2. If all retries fail, the fallback steps run.
3. The workflow continues regardless of the outcome.

```yaml
Expand All @@ -179,9 +197,24 @@
### Restrictions [workflows-on-failure-restrictions]

- Flow-control steps (`if`, `foreach`) cannot have workflow-level `on-failure` configurations.
- Fallback steps execute only after all retries have been exhausted.
- Fallback steps run only after all retries have been exhausted.
- When combined, failure-handling options are processed in this order: retry → fallback → continue.

### Handle failures across workflows [workflows-cross-workflow-handler]

For production-critical workflows, the final layer is a separate handler workflow that fires when another workflow fails. The [`workflows.failed` trigger](/explore-analyze/workflows/triggers/event-driven-triggers.md) fires after a workflow execution reaches the `failed` terminal state, so you can build handlers that page on-call, open a case, or post to a dedicated index for workflow-failure observability. Refer to [Event-driven triggers](/explore-analyze/workflows/triggers/event-driven-triggers.md) for the trigger reference and examples.

### Choose the right error-handling layer [workflows-error-layer-decision]

| Problem | Use |
|---|---|
| "This API is flaky and should retry automatically." | Per-step `on-failure: retry`. |
| "Every step in this workflow should get 2 retries by default." | Workflow-level `settings.on-failure: retry`. |
| "This step is nice-to-have, so don't fail the workflow if it dies." | Per-step `on-failure: continue`. |
| "Try the primary API, and if it fails, use the backup API." | Per-step `on-failure: fallback`. |
| "When a production workflow fails, page on-call and open a case." | A separate [`workflows.failed` handler workflow](/explore-analyze/workflows/triggers/event-driven-triggers.md). |
| "This workflow is critical and I want monitoring on its failure rate." | `workflows.failed` handler that writes to an index, plus your existing observability stack. |

## Dynamic values with templating [workflows-dynamic-values]

Workflows support dynamic values through template variables and template expressions.
Expand Down
Loading
Loading