-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Describe the bug
We've had a very strange recurring error where we get FiberErrors in production (but not in other environments, likely because of the volume of traffic). These errors have the message Can't set guard page: Out of Memory
. Our instance isn't actually out of memory but it seems like the puma worker is unable to spawn any new fibers.
I've been trying to figure this one out for quite a long time now with no success, but I've noticed that after a request is finished, there'll be dangling dead fibers, and these build up over time and never get garbage collected.
Large fragments and queries are more affected than small ones because it seems like the number of fibers grow by minimum whatever fiber_limit
is set to on each request. Setting fiber_limit
to a lower value slows the growth down as fewer fibers will be spawned, but it doesn't resolve it as it still keeps on growing forever.
I've been trying to dig into what's still referencing the fibers causing them not to be GC'd but haven't been able to find it. Using the AsyncDataloader
makes the problem a lot worse, increasing the growth.
We're on rails 7.0.8.6 and we don't have isolation_context = :fiber
, and we use the regular GraphQL::Dataloader
, but have tried setting that isolation context and using AsyncDataloader as well, without any luck
Versions
graphql
version: 2.5.11
rails
(or other framework): 7.0.8.6
Steps to reproduce
# config/initializers/fiber_debug.rb
module FiberDebug
def self.setup
patch_fiber_new
patch_fiber_yield
end
def self.patch_fiber_new
Fiber.singleton_class.prepend(Module.new do
def new(*args, **kwargs, &block)
fiber = super
Rails.logger.debug("Fiber created: #{fiber.object_id} by #{caller_locations(1,1)[0].label} #{caller_locations(1,1)[0].path}:#{caller_locations(1,1)[0].lineno}")
# Print when garbage collected
ObjectSpace.define_finalizer(fiber, proc { Rails.logger.debug("Fiber finalized: #{fiber.object_id}") })
fiber
end
end)
end
def self.patch_fiber_yield
Fiber.prepend(Module.new do
def yield(*args)
Rails.logger.debug("Fiber yielding: #{self.object_id}")
super
end
def resume(*args)
Rails.logger.debug("Fiber resuming: #{self.object_id}")
super
end
end)
end
end
# Print fiber information in development
if Rails.env.development?
FiberDebug.setup
Thread.new do
loop do
alive_count = ObjectSpace.each_object(Fiber).count { it.alive? }
dead_count = ObjectSpace.each_object(Fiber).count { !it.alive? }
dead_ids = ObjectSpace.each_object(Fiber).reject(&:alive?).map { it.object_id }
fiber_count = ObjectSpace.each_object(Fiber).count
puts "Fiber count: #{fiber_count} (#{alive_count} alive, #{dead_count} dead)"
puts ObjectSpace.each_object(Fiber).filter_map { !it.alive? && it.try(:created_by) ? [1, it.try(:created_by).try(:first)] : nil }.group_by(&:last).transform_values(&:count)
sleep 10
end
end
end
Execute a query with the initializer above included and see the growing number of dead fibers. Start puma in single mode for this initializer to work, otherwise move logic to puma on_worker_boot
instead
Expected behavior
Dead fibers to be garbage collected over time