Skip to content

[BUG] Flyte subworkflow nodes can be stuck even after referenced launchplans executions are complete #6873

@sshardool

Description

@sshardool

Flyte & Flytekit version

flyte: v1.16.3
flytekit: 1.16.12

Describe the bug

Summary
Flyte subworkflow nodes with reference launchplans can stay in the Running state long after the target execution has completed.

Behavior
The screenshots below once such example which can consistently reproduce the issue.

  • Parent workflow triggers a reference launch plan execution, node n7 - another_sleepy_lp
  • The child reference launchplan executions finishes in under 3.5 minutes.
  • Corresponding node n7 in the parent executions takes over 1 hour for completion.

Note that the problematic behavior is not seen for all workflows and is highly dependent on the dag-structure and the run-times of individual nodes in the dag.

Expected behavior

Parent execution's reference launchplan node n7 should get updated within about 30 seconds for child's completion (workflow evaluation re-enqueue frequency)

Additional context to reproduce

The wokflows defined here can consistenly reproduce the issue: https://github.com/sshardool/flyte/blob/4feb41fd0036dde78009bc9962ef269546e54f08/workflows/max-parallel/max_parallelism_workflows.py

Note that structure of the dags is carefully crafted to trigger the issue with nodes appearing in a very specific order.

Screenshots

Image Image Image

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinguntriagedThis issues has not yet been looked at by the Maintainers

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions