-
Notifications
You must be signed in to change notification settings - Fork 800
[BUG] Flyte subworkflow nodes can be stuck even after referenced launchplans executions are complete #6873
Description
Flyte & Flytekit version
flyte: v1.16.3
flytekit: 1.16.12
Describe the bug
Summary
Flyte subworkflow nodes with reference launchplans can stay in the Running state long after the target execution has completed.
Behavior
The screenshots below once such example which can consistently reproduce the issue.
- Parent workflow triggers a reference launch plan execution, node
n7-another_sleepy_lp - The child reference launchplan executions finishes in under 3.5 minutes.
- Corresponding node
n7in the parent executions takes over 1 hour for completion.
Note that the problematic behavior is not seen for all workflows and is highly dependent on the dag-structure and the run-times of individual nodes in the dag.
Expected behavior
Parent execution's reference launchplan node n7 should get updated within about 30 seconds for child's completion (workflow evaluation re-enqueue frequency)
Additional context to reproduce
The wokflows defined here can consistenly reproduce the issue: https://github.com/sshardool/flyte/blob/4feb41fd0036dde78009bc9962ef269546e54f08/workflows/max-parallel/max_parallelism_workflows.py
Note that structure of the dags is carefully crafted to trigger the issue with nodes appearing in a very specific order.
Screenshots
Are you sure this issue hasn't been raised already?
- Yes
Have you read the Code of Conduct?
- Yes
Metadata
Metadata
Assignees
Labels
Type
Projects
Status