Skip to content

Handle server startup time post-crashes with simple retry logic to find connection #32

@coltonbh

Description

@coltonbh

The TeraChem server is unreliable, unstable, and the master (and only) process frequently crashes, leaving clients without a connection while docker restarts the server. Occasionally workers on tcc pick up new tasks and try to submit them to a recently crashed server that hasn't restarted yet, resulting in failures for ostensibly good inputs.

I think the cleanest way to solve this is to have some simple retry logic in the def connect function on the clients that spends maybe 10-30 seconds retrying an initial connection before raising an exception, that way failed servers will have a moment to restart and tasks will continue to flow seamlessly without having cascading failures.

This will also help circumvent race conditions in startup where the worker needs the TeraChem image to start first before really being able to accept tasks (we get this by coincidence right now because the worker image is larger than the TC image so TC tends to startup first).

At a higher level, is using the server really worth all the additional overhead of its instabilities....

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions