Skip to content

CI job hangs intermittently when running tests in parallel #1134

@omarrida

Description

@omarrida

Testing framework: rspec
Framework: rails
Simplecov version: 0.22.0
Ruby version: 3.3.4
Rails version: 6.1

While running bundle exec rake test:nonintegration in CI with COVERAGE=true, we the job intermittently hangs (about 20% of the time) after the specs have completed. The issue only appears:

  • When COVERAGE=true.
  • While using parallel test execution and simplecov together.
  • At the end of the test run, after all examples have executed.

Observations:

  • The final test output always includes full spec completion message (e.g. 1512 examples, 0 failures, 1 pending) — indicating RSpec completes successfully.
  • Two of the test subprocesses consistently fail to exit. For example when running with 12 processes, TEST_ENV_NUMBER 7 and 9 are always the ones to hang. With 16 processes, it’s always 9 and 16 that hang.
  • The issue does not occur at all when COVERAGE=false.
  • The issue is NOT reproducible on my local machine. Locally I'm running an M4 Max (16 cores) with 48GB of RAM. The runner I have configured in GitHub Actions is also 16 cores (linux) with 64GB of RAM. I'm only able to observe the issue in GitHub Actions.

The nonintegration suite is just an exclude pattern:

task nonintegration: :environment do
  processes = ENV["PARALLEL_TEST_PROCESSORS"]&.to_i || 4
  command = "bundle exec parallel_rspec -n #{processes} --exclude-pattern 'spec\\/(system|integration)'"
  success = system(command)
  exit(1) unless success
end

This is what the simplecov portion of my rails_helper.rb looks like. It's preceded by the rspec configuration.

if ENV["COVERAGE"] && RSpec.configuration.files_to_run.none? { |file| file.include?("/spec/integration/") || file.include?("/spec/system/") }

  SimpleCov.start "rails" do
    # Enable branch coverage
    enable_coverage :branch

    # Add custom groups
    add_group "Controllers", "app/controllers"
    add_group "Models", "app/models"
    add_group "Helpers", "app/helpers"
    add_group "Jobs", "app/jobs"
    add_group "Mailers", "app/mailers"
    add_group "Libraries", "lib/"
    add_group "Services", "app/services"
    add_group "Policies", "app/policies"
    add_group "Ingestors", "app/ingestors"
    add_group "Views", "app/views"

    # Exclude files/directories from coverage
    add_filter %r{^/spec/}                    # Exclude spec files
    add_filter %r{^/test/}                    # Exclude test files
    add_filter %r{^/config/}                  # Exclude config files
    add_filter %r{^/db/}                      # Exclude database files
    add_filter %r{^/vendor/}                  # Exclude vendor files
    add_filter %r{^/node_modules/}            # Exclude node modules
    add_filter %r{^/bin/}                     # Exclude bin files
    add_filter %r{^/public/}                  # Exclude public files
    add_filter %r{^/storage/}                 # Exclude storage files
    add_filter %r{^/tmp/}                     # Exclude tmp files
    add_filter %r{^/log/}                     # Exclude log files
    add_filter %r{^/doc/}                     # Exclude documentation
    add_filter %r{^/docs/}                    # Exclude docs
    add_filter %r{^/datadog/}                 # Exclude datadog config
    add_filter %r{^/typings/}                 # Exclude TypeScript typings
    add_filter %r{^/\.ruby-lsp/}              # Exclude Ruby LSP files
    add_filter %r{^/\.idea/}                  # Exclude IDE files
    add_filter %r{^/\.github/}                # Exclude GitHub files

    # Exclude specific file types
    add_filter(/\.(js|css|scss|sass|less|coffee|ts|tsx)$/)  # Exclude frontend assets
    add_filter(/\.(yml|yaml|json|xml|html|haml|erb)$/)      # Exclude template files
    add_filter(/\.(rake|thor|task)$/)                       # Exclude rake tasks
    add_filter(/\.(md|txt|rdoc)$/)                          # Exclude documentation files

    # Exclude specific files
    add_filter "karafka.rb"                   # Exclude Karafka configuration
    add_filter "config.ru"                    # Exclude Rack configuration
    add_filter "Rakefile"                     # Exclude Rakefile
    add_filter "vite.config.mts"              # Exclude Vite configuration
    add_filter "postcss.config.js"            # Exclude PostCSS configuration
    add_filter "tsconfig.json"                # Exclude TypeScript configuration
    add_filter "testSetup.ts"                 # Exclude test setup

    # Set minimum coverage requirements
    minimum_coverage_by_file 0
    minimum_coverage 50

    formatter SimpleCov::Formatter::MultiFormatter.new([
      SimpleCov::Formatter::HTMLFormatter,
      SimpleCov::Formatter::JSONFormatter
    ])

    # Track files that are not covered
    track_files "app/**/*.rb"
    track_files "lib/**/*.rb"
  end
end

I tried to add the following at the end of the Simplecov.start block as an experiment:

SimpleCov.command_name Process.pid.to_s

puts "[SimpleCov] Process #{Process.pid} started"


SimpleCov.at_exit do
  puts "[SimpleCov] Process #{Process.pid} with env #{ENV["TEST_ENV_NUMBER"]} writing result"
  SimpleCov.result.format!
rescue => e
  puts "[SimpleCov] ERROR writing result in process #{Process.pid} and env #{ENV["TEST_ENV_NUMBER"]}: #{e.message}"
  puts e.backtrace.join("\n")
  raise
end

at_exit do
  puts "[SimpleCov] Process #{Process.pid} and env #{ENV["TEST_ENV_NUMBER"]}exiting (rspec done)"
end

When the hang happens, all the process print the at_exit debugs except the two process that hang which don't print any of the debugs, which seems to me like they've never entered the at_exit block.

This is the last thing that the GitHub action prints before it hangs:

Finished in 1 minute 51.88 seconds (files took 8.24 seconds to load)
1512 examples, 0 failures, 1 pending
Randomized with seed 40911
Randomized with seed 40911
[SimpleCov] Process 4870 and env 9exiting (rspec done)
[SimpleCov] Process 4870 with env 9 writing result
Coverage report generated for 4863, 4864, 4865, 4866, 4867, 4869, 4870, 4871, 4872, 4874 to /home/runner/work/project_name/project_name/coverage.
Line Coverage: 90.28% (19948 / 22095)
Branch Coverage: 72.42% (2956 / 4082)
Coverage report generated for 4863, 4864, 4865, 4866, 4867, 4869, 4870, 4871, 4872, 4874 to /home/runner/work/project_name/project_name/coverage/coverage.json. 19948 / 22095 LOC (90.28%) covered.

Note that the last line shows that the coverage report was generated for 10 processes and is missing the last two which don't appear to have completed.

I'd really appreciate if someone can point me in the right direction on how to go about addressing this. Happy to provide any additional info!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions