Fix #4382: Fix thread leak in WSTP by replacing LinkedTransferQueue with LinkedBlockingDeque #4388

pantShrey · 2025-04-19T19:49:20Z

Description:
This PR fixes Issue #4382 by addressing a thread leak in the Work Stealing Thread Pool (WSTP) caused by the FIFO behavior of LinkedTransferQueue (introduced in #4295). The change switches to LinkedBlockingDeque with LIFO ordering, using offerFirst and pollFirst to prioritize newer threads for reuse, allowing older threads to time out and exit under high load.

Changes:

Replace LinkedTransferQueue with LinkedBlockingDeque in WorkStealingThreadPool.scala for LIFO behavior.
Update WorkerThread.run to use offerFirst instead of tryTransfer
Adjust WorkerThread initialization to use pollFirst for retrieving cached threads, ensuring LIFO reuse.

…hread leaks

armanbilge · 2025-04-23T20:23:16Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

@@ -732,7 +732,7 @@ private[effect] final class WorkerThread[P <: AnyRef](
          // by another thread in the future.
          val len = runtimeBlockingExpiration.length
          val unit = runtimeBlockingExpiration.unit
-          if (pool.cachedThreads.tryTransfer(this, len, unit)) {
+          if (pool.cachedThreads.offerFirst(this, len, unit)) {


I think that this will always succeed immediately:

Inserts the specified element at the front of this deque, waiting up to the specified wait time if necessary for space to become available.

Because:

The capacity, if unspecified, is equal to Integer.MAX_VALUE

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/LinkedBlockingDeque.html

armanbilge · 2025-04-23T20:26:38Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkStealingThreadPool.scala

-  private[unsafe] val cachedThreads: LinkedTransferQueue[WorkerThread[P]] =
-    new LinkedTransferQueue
+  private[unsafe] val cachedThreads: LinkedBlockingDeque[WorkerThread[P]] =
+    new LinkedBlockingDeque


To replicate the old behavior, I think we need to specify a capacity of 0 in the constructor (essentially, a synchronous deque). I'm not entirely certain if that's supported.

To replicate the old behavior, I think we need to specify a capacity of 0 in the constructor (essentially, a synchronous deque). I'm not entirely certain if that's supported.

Capacity of 0 is not allowed
Because

IllegalArgumentException - if capacity is less than 1

the min we can go is 1.
Would that work?

armanbilge · 2025-04-25T19:15:49Z

allowing older threads to time out and exit under high load.

This is not quite correct. Under high load (i.e., lots of blocking tasks), then we want the threads to persist so we can reuse them as much as possible.

It's under lower load, when there are more cached threads than blocking tasks, we would like the older threads to time out and exit.

…eque

armanbilge · 2025-04-29T03:44:51Z

I have a new idea for how to fix this:

We should have a pool-level SynchronousQueue[TransferState].
When a thread transitions to blocking, it should offer its state to the queue. If this fails, it can start a new worker thread to replace itself.
When a thread transitions to cached, it can poll up to the timeout for a new state.
Meanwhile, we can use a ConcurrentHashMap to keep track of blocker threads.

Although officially unspecified, It turns out that the non-fair implementation of SynchronousQueue in the JDK uses a LIFO stack. Although we probably don't want to rely on this in the long-term, I propose that this is good enough to fix the bug for now.

…] and ConcurrentHashMap to keep track of blocker threads.

pantShrey · 2025-05-04T19:47:06Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

@@ -714,7 +713,7 @@ private[effect] final class WorkerThread[P <: AnyRef](
      if (blocking) {
        // The worker thread was blocked before. It is no longer part of the
        // core pool and needs to be cached.
-
+        val stateToTransfer = transferState


Had to do this to avoid NPE in pool.stateTransferQueue.offer(st), but I think this is causing the error

[error] x blocking work does not starve poll
[error] None is not Some (IOPlatformSpecification.scala:702)

armanbilge · 2025-05-04T20:04:29Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

-        pool.replaceWorker(idx, cached)
+        pool.replaceWorker(idx, this)


I think this change doesn't make sense. The old code used to take a thread out of the cache and promote it to the idxth worker thread, to replace this thread which is about to block.. The new code tries to replace this thread with itself?

oh ok I see it now, will try to fix this .

I am thinking of adding a thread reference field in Transfer state to pass on the cached WorkerThread

…and ConcurrentHashMap idea

armanbilge

thanks, this is looking good!

armanbilge · 2025-05-21T11:56:05Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

-            transferState,
+            new WorkerThread.TransferState,


Why this change?

armanbilge · 2025-05-21T12:25:02Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

+
+      if (pool.transferStateQueue.offer(transferState)) {
+        // If successful, a waiting thread will pick it up
+        // Register this thread in the blockerThreads map


I'm confused by this comment. Doesn't the registration happen above?

armanbilge · 2025-05-21T12:30:02Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkStealingThreadPool.scala

-      var t: WorkerThread[P] = null
-      while ({
-        t = cachedThreads.poll()
-        t ne null
-      }) {
+      val it = blockerThreads.keySet().iterator()
+      while (it.hasNext()) {
+        val t = it.next()


If I remember correctly, I think one of the goals here is to avoid any allocations, in case the runtime was shutting down in a fatal condition (e.g. out-of-memory). Unfortunately, creating the iterator is an allocation. But, I don't know how to iterate the elements of a ConcurrentHashMap without an iterator 🤔

pantShrey added 2 commits April 20, 2025 00:37

Switch to LinkedBlockingDeque for LIFO thread reuse to address WSTP t…

16f2616

…hread leaks

changed pollLast to pollFirst

c5f36ff

armanbilge reviewed Apr 23, 2025

View reviewed changes

added capacity to the LinkedBlockingDeque to simulate a synchronous d…

869c954

…eque

implemented new approach of pool level SynchronousQueue[TransferState…

cafe75a

…] and ConcurrentHashMap to keep track of blocker threads.

pantShrey commented May 4, 2025

View reviewed changes

armanbilge reviewed May 4, 2025

View reviewed changes

Fixed several inconsistencies and reimplemented the SynchronousQueue …

cbdad86

…and ConcurrentHashMap idea

armanbilge reviewed May 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #4382: Fix thread leak in WSTP by replacing LinkedTransferQueue with LinkedBlockingDeque #4388

Fix #4382: Fix thread leak in WSTP by replacing LinkedTransferQueue with LinkedBlockingDeque #4388

pantShrey commented Apr 19, 2025

armanbilge Apr 23, 2025

armanbilge Apr 23, 2025

pantShrey Apr 26, 2025 •

edited

Loading

armanbilge commented Apr 25, 2025

armanbilge commented Apr 29, 2025

pantShrey May 4, 2025

armanbilge May 4, 2025

pantShrey May 4, 2025

pantShrey May 4, 2025

armanbilge left a comment

armanbilge May 21, 2025

armanbilge May 21, 2025

armanbilge May 21, 2025

		pool.replaceWorker(idx, cached)
		pool.replaceWorker(idx, this)

Fix #4382: Fix thread leak in WSTP by replacing LinkedTransferQueue with LinkedBlockingDeque #4388

Are you sure you want to change the base?

Fix #4382: Fix thread leak in WSTP by replacing LinkedTransferQueue with LinkedBlockingDeque #4388

Conversation

pantShrey commented Apr 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pantShrey Apr 26, 2025 • edited Loading

Choose a reason for hiding this comment

armanbilge commented Apr 25, 2025

armanbilge commented Apr 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armanbilge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pantShrey Apr 26, 2025 •

edited

Loading