Skip to content

dataloader potentially leaking fibers #5407

@Amnesthesia

Description

@Amnesthesia

Describe the bug

We've had a very strange recurring error where we get FiberErrors in production (but not in other environments, likely because of the volume of traffic). These errors have the message Can't set guard page: Out of Memory. Our instance isn't actually out of memory but it seems like the puma worker is unable to spawn any new fibers.

I've been trying to figure this one out for quite a long time now with no success, but I've noticed that after a request is finished, there'll be dangling dead fibers, and these build up over time and never get garbage collected.

Large fragments and queries are more affected than small ones because it seems like the number of fibers grow by minimum whatever fiber_limit is set to on each request. Setting fiber_limit to a lower value slows the growth down as fewer fibers will be spawned, but it doesn't resolve it as it still keeps on growing forever.

I've been trying to dig into what's still referencing the fibers causing them not to be GC'd but haven't been able to find it. Using the AsyncDataloader makes the problem a lot worse, increasing the growth.

We're on rails 7.0.8.6 and we don't have isolation_context = :fiber, and we use the regular GraphQL::Dataloader, but have tried setting that isolation context and using AsyncDataloader as well, without any luck

Versions

graphql version: 2.5.11
rails (or other framework): 7.0.8.6

Steps to reproduce

# config/initializers/fiber_debug.rb

module FiberDebug
  def self.setup
    patch_fiber_new
    patch_fiber_yield
  end
  
  def self.patch_fiber_new
    Fiber.singleton_class.prepend(Module.new do
      def new(*args, **kwargs, &block)
        fiber = super
        Rails.logger.debug("Fiber created: #{fiber.object_id} by #{caller_locations(1,1)[0].label} #{caller_locations(1,1)[0].path}:#{caller_locations(1,1)[0].lineno}")

        # Print when garbage collected
        ObjectSpace.define_finalizer(fiber, proc { Rails.logger.debug("Fiber finalized: #{fiber.object_id}") })
        fiber
      end
    end)
  end
  
  def self.patch_fiber_yield
    Fiber.prepend(Module.new do
      def yield(*args)
        Rails.logger.debug("Fiber yielding: #{self.object_id}")
        super
      end
      
      def resume(*args)
        Rails.logger.debug("Fiber resuming: #{self.object_id}")
        super
      end
    end)
  end
end

# Print fiber information in development
if Rails.env.development?
  FiberDebug.setup
  Thread.new do
    loop do
      alive_count = ObjectSpace.each_object(Fiber).count { it.alive? }
      dead_count = ObjectSpace.each_object(Fiber).count { !it.alive? }
      dead_ids = ObjectSpace.each_object(Fiber).reject(&:alive?).map { it.object_id }
      fiber_count = ObjectSpace.each_object(Fiber).count
      puts "Fiber count: #{fiber_count} (#{alive_count} alive, #{dead_count} dead)"

      puts ObjectSpace.each_object(Fiber).filter_map { !it.alive? && it.try(:created_by) ? [1, it.try(:created_by).try(:first)] : nil }.group_by(&:last).transform_values(&:count)
      sleep 10
    end
  end
end

Execute a query with the initializer above included and see the growing number of dead fibers. Start puma in single mode for this initializer to work, otherwise move logic to puma on_worker_boot instead

Expected behavior

Dead fibers to be garbage collected over time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions