Improve State restart time on failure caused by server restart, by using activity heartbeat, when startToClose is large

Currently when iwf server restarts, the state api will fail and wait for next attempt by the startToClose timeout + backoff retry interval. 
If the startToClose timeout is very large (e.g. >10 mins), it will wait for a long time. To avoid the unnecessary waiting, Temporal/Cadence has a concept of "activity heartbeat" to tell Temporal/Cadence server that the worker is still alive. If no heartbeat is received within heartbeat timeout, Temporal/Cadence will reschedule next activity immediately based on backoff retry policy. 

Note: this is also because of the fact that Temporal/Cadence activity task/worker is "polling based". iWF task/worker is "pushing" so it doesn't have such issues.

Need to add a side thread(gorotine) in the activity code:
```go

go (){
      sleep(10 mins)
      activity.heartbeat()
}
```
^^ is simplified code. We also need to cancel the goroutine when the activity is finished (so need to use golang channel and timer), to avoid goroutine leaks.

Maybe make 10mins configurable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve State restart time on failure caused by server restart, by using activity heartbeat, when startToClose is large #551

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve State restart time on failure caused by server restart, by using activity heartbeat, when startToClose is large #551

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions