Handle server startup time post-crashes with simple retry logic to find connection

The TeraChem server is unreliable, unstable, and the master (and only) process frequently crashes, leaving clients without a connection while docker restarts the server. Occasionally workers on tcc pick up new tasks and try to submit them to a recently crashed server that hasn't restarted yet, resulting in failures for ostensibly good inputs.

I think the cleanest way to solve this is to have some simple retry logic in the `def connect` function on the clients that spends maybe 10-30 seconds retrying an initial connection before raising an exception, that way failed servers will have a moment to restart and tasks will continue to flow seamlessly without having cascading failures.

This will also help circumvent race conditions in startup where the worker needs the TeraChem image to start first before really being able to accept tasks (we get this by coincidence right now because the worker image is larger than the TC image so TC tends to startup first).

At a higher level, is using the server really worth all the additional overhead of its instabilities....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle server startup time post-crashes with simple retry logic to find connection #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle server startup time post-crashes with simple retry logic to find connection #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions