-
Notifications
You must be signed in to change notification settings - Fork 34
Description
When a crash occurs in the native message loop (network thread run by fdb_c.dll), the .NET process can become stuck because all pending Tasks on transactions will never complete.
An example of such a case is a very rare AccessViolationException (seen from time to time on our internal Build server) that crashes the network thread. The build process then hangs and must be terminated manually.
The callstack of the crash looks like this:
[Something.Something.FooBarTest.Test_FooBar] [Test Error Output]
Unhandled exception in remote appdomain: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at FoundationDB.Client.Native.FdbNative.NativeMethods.fdb_run_network()
at FoundationDB.Client.Native.FdbNative.RunNetwork()
at FoundationDB.Client.Fdb.EventLoop()
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
If the network thread is not running, then no more callback will run, including the timeout callbacks. This means that any Task pending on a transaction will never complete, even if a timeout was specified on a transaction. The only way for it to stop, would be if the transactions was created with a host-provided cancellation token (ex: per-request CancellationToken in ASP.NET, etc..). If it was created with CancellationToken.None, then the task is dead.
Questions:
First, is it safe to restart the network thread, would it keep up where it left off, and remember all pending callbacks? If it already crashed with an AccessViolationException, it does not look very safe. This seems like we should abandon this process ASAP.
Second, if we can't restart the network thread and must stop the process, how can we ensure that all pending tasks wil abort even if their callback never fires? One solution would be to trigger the internal CancellationTokenSource of each FdbDatabase instance, which means that the binding would need to keep a list of all these instances. OR have as single master CancellationTokenSource that is linked with all the tokens of each FdbDatabase?