Skip to content

TEST/COMMON: Fail tests when file descriptors leak is detected#11221

Merged
brminich merged 16 commits intoopenucx:masterfrom
guy-ealey-morag:fail-on-fd-leaks
Mar 17, 2026
Merged

TEST/COMMON: Fail tests when file descriptors leak is detected#11221
brminich merged 16 commits intoopenucx:masterfrom
guy-ealey-morag:fail-on-fd-leaks

Conversation

@guy-ealey-morag
Copy link
Copy Markdown
Contributor

@guy-ealey-morag guy-ealey-morag commented Feb 27, 2026

What?

Add automatic file descriptor leak detection to the gtest framework.
After each test's teardown, the open FD set is compared against the previous test.
New FDs are identified and reported.
A consecutive-increase threshold tolerates one-time FD increases from external library initialization, then fails the test on repeated leaks.

Why?

FD leaks in tests can exhaust the process file descriptor limit, causing hard-to-diagnose failures in later tests.

How?

  • Added collect_open_fds() and fd_target() free functions in test_helpers to enumerate /proc/self/fd and resolve FD symlink targets.
  • Added check_fd_leaks() private method to test_base, called at the end of TearDownProxy(). It compares the current FD set against a snapshot from the previous test.
  • A static consecutive-increase counter tolerates a configurable number of one-time FD increases (e.g. rdma-core or CUDA driver init) before triggering ADD_FAILURE().

NOTE: CI tests are going to fail until existing fd leaks are fixed by a separate PR

@guy-ealey-morag guy-ealey-morag marked this pull request as ready for review March 5, 2026 14:29
@guy-ealey-morag guy-ealey-morag marked this pull request as draft March 5, 2026 14:30
@guy-ealey-morag guy-ealey-morag marked this pull request as ready for review March 5, 2026 14:33
Copy link
Copy Markdown
Contributor

@ColinNV ColinNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nitpicks.

/**
* Return the symlink target of the given file descriptor.
*/
std::string fd_target(int fd);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this name confusing, since it's essentially a readlink(2) wrapper I'd name it e.g. readlink_proc_fd() or something like that.

std::set<int> collect_open_fds()
{
std::set<int> fds;
DIR *dir = opendir("/proc/self/fd");
Copy link
Copy Markdown
Contributor Author

@guy-ealey-morag guy-ealey-morag Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::filesystem::directory_iterator is C++17, UCX tests use C++11

const int dir_fd = dirfd(dir);
struct dirent *entry;
while ((entry = readdir(dir)) != NULL) {
if (entry->d_name[0] != '.') {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::filesystem::directory_iterator skips . and ..

/**
* Collect the set of currently open file descriptors.
*/
std::set<int> collect_open_fds();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[nodiscard]]
currently_open_fds()
open_fds()
all_open_fds()
?

Copy link
Copy Markdown
Contributor Author

@guy-ealey-morag guy-ealey-morag Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[nodiscard]] is C++17, UCX tests use C++11
renamed to get_open_fds()


private:
void check_fd_leaks();
bool is_fd_whitelisted(const std::string &target) const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is confusing, one expects the argument to be an int for a file descriptor.

std::set<int> collect_open_fds()
{
std::set<int> fds;
DIR *dir = opendir("/proc/self/fd");
Copy link
Copy Markdown
Contributor Author

@guy-ealey-morag guy-ealey-morag Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::filesystem::directory_iterator is C++17, UCX tests use C++11

Comment on lines +175 to +176
static std::set<int> m_prev_open_fds;
static int m_consecutive_fd_increases;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: i'd align with other members for consistency


if (num_leaked > 0 || num_whitelisted > 0) {
UCS_TEST_MESSAGE << "new fds detected (" << num_leaked
<< " non-whitelisted, " << num_whitelisted
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why non-whitelisted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's following the value of num_leaked

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to "leaked" instead


namespace ucs {

constexpr int CONSECUTIVE_FD_INCREASE_THRESHOLD = 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there (only) a limit on consecutive increases and not (also) on the total number of increases? Is it because some groups of tests have a first test that opens more files that are then used by the rest?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for example the first tests that use infiniband or cuda open some fds that persist for the rest of the tests.
We can also track the total to make sure it doesn't happen more than expected, wdyt?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking that a hard upper limit, and a limit on the increase from one check to the next (that possibly ignores the whitelist) could be useful. Then again we don't want to over-engineer too much...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added logging of the total number of increases, I'll see how it behaves to decide if we want to limit it

bool test_base::is_target_whitelisted(const std::string &target) const
{
/* fd targets for external libraries (rdma-core, CUDA driver, etc.) */
static const char *targets_whitelist[] = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector< std::string >


const int dir_fd = dirfd(dir);
if (dir_fd < 0) {
closedir(dir);
Copy link
Copy Markdown
Contributor

@ColinNV ColinNV Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this weren't test code I'd insist on RAII.

@@ -14,6 +14,7 @@
#include <ucs/config/parser.h>

#include <set>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is now included by test_helpers.h.

tvegas1
tvegas1 previously approved these changes Mar 11, 2026
@guy-ealey-morag
Copy link
Copy Markdown
Contributor Author

guy-ealey-morag commented Mar 11, 2026

@tvegas1 @ColinNV @brminich
I moved the call to check_fd_leaks() from TearDownProxy() to ~test_base() (the destructor).
I found that some resources that are released in destructors still exist during TearDownProxy() but are already released during ~test_base(), so in the previous implementation it detected fds that were going to be closed anyway.

std::vector<std::string> test_base::m_warnings;
std::vector<std::string> test_base::m_first_warns_and_errors;
std::set<int> test_base::m_prev_open_fds;
int test_base::m_consecutive_fd_increases = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, size_t.

tvegas1
tvegas1 previously approved these changes Mar 13, 2026
brminich
brminich previously approved these changes Mar 13, 2026
const std::string padding(13, ' ');
std::set<int> open_fds = get_open_fds();

if (!m_prev_open_fds.empty()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (m_prev_open_fds.empty()) {
return;
}

Copy link
Copy Markdown
Contributor Author

@guy-ealey-morag guy-ealey-morag Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_prev_open_fds = std::move(open_fds); after the if should be always executed so I shouldn't return early

}

if (num_unexpected == 0) {
m_consecutive_fd_increases = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it matter if the increases is consecutive?
does it work also when tests are shuffled?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial detection logic raised some false-positives that led me to focus on consecutive leaks instead of the total amount.
After stabilizing the leak check it makes more sense to check the total instead.

<< " (consecutive: " << m_consecutive_fd_increases << ")";
}

UCS_TEST_MESSAGE << "new leaked fds (" << num_unexpected
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a file is whitelisted, let's not call it "leaked", may just not print it to make output clean

}

link[len] = '\0';
return std::string(link) + (len == max_len ? " (truncated)" : "");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if link is PATH_MAX, i guess it should not be truncated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, removed.

@guy-ealey-morag guy-ealey-morag dismissed stale reviews from brminich and tvegas1 via e48305f March 16, 2026 09:21
@guy-ealey-morag guy-ealey-morag requested a review from yosefe March 16, 2026 09:52
@brminich brminich merged commit 92cc96f into openucx:master Mar 17, 2026
152 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants