TEST/COMMON: Fail tests when file descriptors leak is detected by guy-ealey-morag · Pull Request #11221 · openucx/ucx

guy-ealey-morag · 2026-02-27T17:00:11Z

What?

Add automatic file descriptor leak detection to the gtest framework.
After each test's teardown, the open FD set is compared against the previous test.
New FDs are identified and reported.
A consecutive-increase threshold tolerates one-time FD increases from external library initialization, then fails the test on repeated leaks.

Why?

FD leaks in tests can exhaust the process file descriptor limit, causing hard-to-diagnose failures in later tests.

How?

Added collect_open_fds() and fd_target() free functions in test_helpers to enumerate /proc/self/fd and resolve FD symlink targets.
Added check_fd_leaks() private method to test_base, called at the end of TearDownProxy(). It compares the current FD set against a snapshot from the previous test.
A static consecutive-increase counter tolerates a configurable number of one-time FD increases (e.g. rdma-core or CUDA driver init) before triggering ADD_FAILURE().

NOTE: CI tests are going to fail until existing fd leaks are fixed by a separate PR

ColinNV

LGTM, just a few nitpicks.

ColinNV · 2026-03-05T16:38:11Z

test/gtest/common/test_helpers.h

+/**
+ * Return the symlink target of the given file descriptor.
+ */
+std::string fd_target(int fd);


I find this name confusing, since it's essentially a readlink(2) wrapper I'd name it e.g. readlink_proc_fd() or something like that.

guy-ealey-morag · 2026-03-06T10:48:05Z

test/gtest/common/test_helpers.cc

+std::set<int> collect_open_fds()
+{
+    std::set<int> fds;
+    DIR *dir = opendir("/proc/self/fd");


std::filesystem::directory_iterator is C++17, UCX tests use C++11

ColinNV · 2026-03-05T16:43:12Z

test/gtest/common/test_helpers.cc

+    const int dir_fd = dirfd(dir);
+    struct dirent *entry;
+    while ((entry = readdir(dir)) != NULL) {
+        if (entry->d_name[0] != '.') {


std::filesystem::directory_iterator skips . and ..

ColinNV · 2026-03-05T16:44:17Z

test/gtest/common/test_helpers.h

+/**
+ * Collect the set of currently open file descriptors.
+ */
+std::set<int> collect_open_fds();


[[nodiscard]]
currently_open_fds()
open_fds()
all_open_fds()
?

[[nodiscard]] is C++17, UCX tests use C++11
renamed to get_open_fds()

ColinNV · 2026-03-05T16:45:50Z

test/gtest/common/test.h


 private:
+    void check_fd_leaks();
+    bool is_fd_whitelisted(const std::string &target) const;


This name is confusing, one expects the argument to be an int for a file descriptor.

guy-ealey-morag · 2026-03-06T10:48:05Z

test/gtest/common/test_helpers.cc

+std::set<int> collect_open_fds()
+{
+    std::set<int> fds;
+    DIR *dir = opendir("/proc/self/fd");


std::filesystem::directory_iterator is C++17, UCX tests use C++11

brminich · 2026-03-06T14:31:11Z

test/gtest/common/test.h

+    static std::set<int> m_prev_open_fds;
+    static int m_consecutive_fd_increases;


minor: i'd align with other members for consistency

brminich · 2026-03-06T14:34:24Z

test/gtest/common/test.cc

+
+        if (num_leaked > 0 || num_whitelisted > 0) {
+            UCS_TEST_MESSAGE << "new fds detected (" << num_leaked
+                             << " non-whitelisted, " << num_whitelisted


why non-whitelisted?

It's following the value of num_leaked

change to "leaked" instead

test/gtest/common/test_helpers.cc

test/gtest/common/test.cc

ColinNV · 2026-03-10T12:18:13Z

test/gtest/common/test.cc

-
 namespace ucs {

+constexpr int CONSECUTIVE_FD_INCREASE_THRESHOLD = 2;


Why is there (only) a limit on consecutive increases and not (also) on the total number of increases? Is it because some groups of tests have a first test that opens more files that are then used by the rest?

Yes, for example the first tests that use infiniband or cuda open some fds that persist for the rest of the tests.
We can also track the total to make sure it doesn't happen more than expected, wdyt?

I was just thinking that a hard upper limit, and a limit on the increase from one check to the next (that possibly ignores the whitelist) could be useful. Then again we don't want to over-engineer too much...

I added logging of the total number of increases, I'll see how it behaves to decide if we want to limit it

ColinNV · 2026-03-10T12:24:50Z

test/gtest/common/test.cc

+bool test_base::is_target_whitelisted(const std::string &target) const
+{
+    /* fd targets for external libraries (rdma-core, CUDA driver, etc.) */
+    static const char *targets_whitelist[] = {


std::vector< std::string >

ColinNV · 2026-03-10T12:29:29Z

test/gtest/common/test_helpers.cc

+
+    const int dir_fd = dirfd(dir);
+    if (dir_fd < 0) {
+        closedir(dir);


If this weren't test code I'd insist on RAII.

ColinNV · 2026-03-10T12:30:50Z

test/gtest/common/test_helpers.cc

@@ -14,6 +14,7 @@
 #include <ucs/config/parser.h>

 #include <set>


Is now included by test_helpers.h.

guy-ealey-morag · 2026-03-11T09:03:02Z

@tvegas1 @ColinNV @brminich
I moved the call to check_fd_leaks() from TearDownProxy() to ~test_base() (the destructor).
I found that some resources that are released in destructors still exist during TearDownProxy() but are already released during ~test_base(), so in the previous implementation it detected fds that were going to be closed anyway.

ColinNV · 2026-03-12T16:31:12Z

test/gtest/common/test.cc

 std::vector<std::string> test_base::m_warnings;
 std::vector<std::string> test_base::m_first_warns_and_errors;
+std::set<int> test_base::m_prev_open_fds;
+int test_base::m_consecutive_fd_increases = 0;


Minor, size_t.

yosefe · 2026-03-13T13:03:08Z

test/gtest/common/test.cc

+    const std::string padding(13, ' ');
+    std::set<int> open_fds = get_open_fds();
+
+    if (!m_prev_open_fds.empty()) {


if (m_prev_open_fds.empty()) {
return;
}

m_prev_open_fds = std::move(open_fds); after the if should be always executed so I shouldn't return early

yosefe · 2026-03-13T13:06:42Z

test/gtest/common/test.cc

+        }
+
+        if (num_unexpected == 0) {
+            m_consecutive_fd_increases = 0;


why does it matter if the increases is consecutive?
does it work also when tests are shuffled?

The initial detection logic raised some false-positives that led me to focus on consecutive leaks instead of the total amount.
After stabilizing the leak check it makes more sense to check the total instead.

yosefe · 2026-03-13T13:07:19Z

test/gtest/common/test.cc

+                   << " (consecutive: " << m_consecutive_fd_increases << ")";
+            }
+
+            UCS_TEST_MESSAGE << "new leaked fds (" << num_unexpected


if a file is whitelisted, let's not call it "leaked", may just not print it to make output clean

yosefe · 2026-03-13T13:11:43Z

test/gtest/common/test_helpers.cc

+    }
+
+    link[len] = '\0';
+    return std::string(link) + (len == max_len ? " (truncated)" : "");


if link is PATH_MAX, i guess it should not be truncated?

I agree, removed.

test/gtest/ucs/test_rcache.cc

test/gtest/common/test.cc

…cx#11221)

guy-ealey-morag added 4 commits February 27, 2026 16:43

TEST/COMMON: Fail tests when file descriptors leak is detected

2033b7d

TEST/COMMON: Improve leak detection code

bda493c

TEST/COMMON: Add whitelist logic to fd detection

8cc2349

Merge branch 'master' into fail-on-fd-leaks

3ef08e9

guy-ealey-morag marked this pull request as ready for review March 5, 2026 14:29

guy-ealey-morag marked this pull request as draft March 5, 2026 14:30

TEST/COMMON: Use find instead of regex

bc99692

guy-ealey-morag marked this pull request as ready for review March 5, 2026 14:33

guy-ealey-morag requested review from ColinNV, brminich, iyastreb, tvegas1 and yosefe March 5, 2026 14:33

guy-ealey-morag added the Ready for Review label Mar 5, 2026

ColinNV reviewed Mar 5, 2026

View reviewed changes

guy-ealey-morag commented Mar 6, 2026

View reviewed changes

TEST/COMMON: Fix PR comments

53969e2

brminich reviewed Mar 6, 2026

View reviewed changes

TEST/COMMON: Fix PR comments

3a413b6

tvegas1 reviewed Mar 9, 2026

View reviewed changes

TEST/COMMON: Fix PR comments

374dbaf

guy-ealey-morag requested review from ColinNV and tvegas1 March 10, 2026 12:06

ColinNV reviewed Mar 10, 2026

View reviewed changes

guy-ealey-morag added 3 commits March 10, 2026 12:31

TEST/COMMON: Fix string array type

4a78c3a

TEST/COMMON: Track total fd increases

d1a52a4

TEST/COMMON: Fix build error

d84b305

guy-ealey-morag added 2 commits March 11, 2026 08:55

TEST/COMMON: Move fd check to destructor

79c0b08

Merge branch 'master' into fail-on-fd-leaks

73bb1c1

tvegas1 previously approved these changes Mar 11, 2026

View reviewed changes

guy-ealey-morag requested review from ColinNV, brminich and tvegas1 March 11, 2026 09:03

ColinNV reviewed Mar 12, 2026

View reviewed changes

ColinNV approved these changes Mar 12, 2026

View reviewed changes

TEST/COMMON: Change int to size_t for counters

9ead746

guy-ealey-morag dismissed tvegas1’s stale review via 9ead746 March 13, 2026 09:07

guy-ealey-morag self-assigned this Mar 13, 2026

tvegas1 previously approved these changes Mar 13, 2026

View reviewed changes

brminich previously approved these changes Mar 13, 2026

View reviewed changes

yosefe reviewed Mar 13, 2026

View reviewed changes

TEST/COMMON: Fix PR comments

e48305f

guy-ealey-morag dismissed stale reviews from brminich and tvegas1 via e48305f March 16, 2026 09:21

yosefe reviewed Mar 16, 2026

View reviewed changes

TEST/COMMON: Fix PR comments

83ea69c

guy-ealey-morag requested a review from yosefe March 16, 2026 09:52

yosefe approved these changes Mar 16, 2026

View reviewed changes

tvegas1 approved these changes Mar 16, 2026

View reviewed changes

brminich approved these changes Mar 17, 2026

View reviewed changes

brminich merged commit 92cc96f into openucx:master Mar 17, 2026
152 checks passed

jeynmann pushed a commit to jeynmann/ucx that referenced this pull request Mar 17, 2026

TEST/COMMON: Fail tests when file descriptors leak is detected (openu…

9689740

…cx#11221)

		static std::set<int> m_prev_open_fds;
		static int m_consecutive_fd_increases;


		namespace ucs {

		constexpr int CONSECUTIVE_FD_INCREASE_THRESHOLD = 2;

		@@ -14,6 +14,7 @@
		#include <ucs/config/parser.h>

		#include <set>

Conversation

guy-ealey-morag commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Uh oh!

ColinNV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guy-ealey-morag Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guy-ealey-morag Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guy-ealey-morag Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ColinNV Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guy-ealey-morag commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guy-ealey-morag Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guy-ealey-morag commented Feb 27, 2026 •

edited

Loading

guy-ealey-morag Mar 6, 2026 •

edited

Loading

guy-ealey-morag Mar 6, 2026 •

edited

Loading

guy-ealey-morag Mar 6, 2026 •

edited

Loading

ColinNV Mar 10, 2026 •

edited

Loading

guy-ealey-morag commented Mar 11, 2026 •

edited

Loading

guy-ealey-morag Mar 16, 2026 •

edited

Loading