Skip to content

Improve State restart time on failure caused by server restart, by using activity heartbeat, when startToClose is large #551

@longquanzheng

Description

@longquanzheng

Currently when iwf server restarts, the state api will fail and wait for next attempt by the startToClose timeout + backoff retry interval.
If the startToClose timeout is very large (e.g. >10 mins), it will wait for a long time. To avoid the unnecessary waiting, Temporal/Cadence has a concept of "activity heartbeat" to tell Temporal/Cadence server that the worker is still alive. If no heartbeat is received within heartbeat timeout, Temporal/Cadence will reschedule next activity immediately based on backoff retry policy.

Note: this is also because of the fact that Temporal/Cadence activity task/worker is "polling based". iWF task/worker is "pushing" so it doesn't have such issues.

Need to add a side thread(gorotine) in the activity code:

go (){
      sleep(10 mins)
      activity.heartbeat()
}

^^ is simplified code. We also need to cancel the goroutine when the activity is finished (so need to use golang channel and timer), to avoid goroutine leaks.

Maybe make 10mins configurable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions