TEST/COMMON: Fail tests when file descriptors leak is detected#11221
TEST/COMMON: Fail tests when file descriptors leak is detected#11221brminich merged 16 commits intoopenucx:masterfrom
Conversation
ColinNV
left a comment
There was a problem hiding this comment.
LGTM, just a few nitpicks.
test/gtest/common/test_helpers.h
Outdated
| /** | ||
| * Return the symlink target of the given file descriptor. | ||
| */ | ||
| std::string fd_target(int fd); |
There was a problem hiding this comment.
I find this name confusing, since it's essentially a readlink(2) wrapper I'd name it e.g. readlink_proc_fd() or something like that.
| std::set<int> collect_open_fds() | ||
| { | ||
| std::set<int> fds; | ||
| DIR *dir = opendir("/proc/self/fd"); |
There was a problem hiding this comment.
std::filesystem::directory_iterator is C++17, UCX tests use C++11
test/gtest/common/test_helpers.cc
Outdated
| const int dir_fd = dirfd(dir); | ||
| struct dirent *entry; | ||
| while ((entry = readdir(dir)) != NULL) { | ||
| if (entry->d_name[0] != '.') { |
There was a problem hiding this comment.
std::filesystem::directory_iterator skips . and ..
test/gtest/common/test_helpers.h
Outdated
| /** | ||
| * Collect the set of currently open file descriptors. | ||
| */ | ||
| std::set<int> collect_open_fds(); |
There was a problem hiding this comment.
[[nodiscard]]
currently_open_fds()
open_fds()
all_open_fds()
?
There was a problem hiding this comment.
[[nodiscard]] is C++17, UCX tests use C++11
renamed to get_open_fds()
test/gtest/common/test.h
Outdated
|
|
||
| private: | ||
| void check_fd_leaks(); | ||
| bool is_fd_whitelisted(const std::string &target) const; |
There was a problem hiding this comment.
This name is confusing, one expects the argument to be an int for a file descriptor.
| std::set<int> collect_open_fds() | ||
| { | ||
| std::set<int> fds; | ||
| DIR *dir = opendir("/proc/self/fd"); |
There was a problem hiding this comment.
std::filesystem::directory_iterator is C++17, UCX tests use C++11
test/gtest/common/test.h
Outdated
| static std::set<int> m_prev_open_fds; | ||
| static int m_consecutive_fd_increases; |
There was a problem hiding this comment.
minor: i'd align with other members for consistency
test/gtest/common/test.cc
Outdated
|
|
||
| if (num_leaked > 0 || num_whitelisted > 0) { | ||
| UCS_TEST_MESSAGE << "new fds detected (" << num_leaked | ||
| << " non-whitelisted, " << num_whitelisted |
There was a problem hiding this comment.
It's following the value of num_leaked
There was a problem hiding this comment.
change to "leaked" instead
test/gtest/common/test.cc
Outdated
|
|
||
| namespace ucs { | ||
|
|
||
| constexpr int CONSECUTIVE_FD_INCREASE_THRESHOLD = 2; |
There was a problem hiding this comment.
Why is there (only) a limit on consecutive increases and not (also) on the total number of increases? Is it because some groups of tests have a first test that opens more files that are then used by the rest?
There was a problem hiding this comment.
Yes, for example the first tests that use infiniband or cuda open some fds that persist for the rest of the tests.
We can also track the total to make sure it doesn't happen more than expected, wdyt?
There was a problem hiding this comment.
I was just thinking that a hard upper limit, and a limit on the increase from one check to the next (that possibly ignores the whitelist) could be useful. Then again we don't want to over-engineer too much...
There was a problem hiding this comment.
I added logging of the total number of increases, I'll see how it behaves to decide if we want to limit it
test/gtest/common/test.cc
Outdated
| bool test_base::is_target_whitelisted(const std::string &target) const | ||
| { | ||
| /* fd targets for external libraries (rdma-core, CUDA driver, etc.) */ | ||
| static const char *targets_whitelist[] = { |
There was a problem hiding this comment.
std::vector< std::string >
|
|
||
| const int dir_fd = dirfd(dir); | ||
| if (dir_fd < 0) { | ||
| closedir(dir); |
There was a problem hiding this comment.
If this weren't test code I'd insist on RAII.
test/gtest/common/test_helpers.cc
Outdated
| @@ -14,6 +14,7 @@ | |||
| #include <ucs/config/parser.h> | |||
|
|
|||
| #include <set> | |||
There was a problem hiding this comment.
Is now included by test_helpers.h.
|
@tvegas1 @ColinNV @brminich |
test/gtest/common/test.cc
Outdated
| std::vector<std::string> test_base::m_warnings; | ||
| std::vector<std::string> test_base::m_first_warns_and_errors; | ||
| std::set<int> test_base::m_prev_open_fds; | ||
| int test_base::m_consecutive_fd_increases = 0; |
| const std::string padding(13, ' '); | ||
| std::set<int> open_fds = get_open_fds(); | ||
|
|
||
| if (!m_prev_open_fds.empty()) { |
There was a problem hiding this comment.
if (m_prev_open_fds.empty()) {
return;
}
There was a problem hiding this comment.
m_prev_open_fds = std::move(open_fds); after the if should be always executed so I shouldn't return early
test/gtest/common/test.cc
Outdated
| } | ||
|
|
||
| if (num_unexpected == 0) { | ||
| m_consecutive_fd_increases = 0; |
There was a problem hiding this comment.
why does it matter if the increases is consecutive?
does it work also when tests are shuffled?
There was a problem hiding this comment.
The initial detection logic raised some false-positives that led me to focus on consecutive leaks instead of the total amount.
After stabilizing the leak check it makes more sense to check the total instead.
test/gtest/common/test.cc
Outdated
| << " (consecutive: " << m_consecutive_fd_increases << ")"; | ||
| } | ||
|
|
||
| UCS_TEST_MESSAGE << "new leaked fds (" << num_unexpected |
There was a problem hiding this comment.
if a file is whitelisted, let's not call it "leaked", may just not print it to make output clean
test/gtest/common/test_helpers.cc
Outdated
| } | ||
|
|
||
| link[len] = '\0'; | ||
| return std::string(link) + (len == max_len ? " (truncated)" : ""); |
There was a problem hiding this comment.
if link is PATH_MAX, i guess it should not be truncated?
There was a problem hiding this comment.
I agree, removed.
What?
Add automatic file descriptor leak detection to the gtest framework.
After each test's teardown, the open FD set is compared against the previous test.
New FDs are identified and reported.
A consecutive-increase threshold tolerates one-time FD increases from external library initialization, then fails the test on repeated leaks.
Why?
FD leaks in tests can exhaust the process file descriptor limit, causing hard-to-diagnose failures in later tests.
How?
collect_open_fds()andfd_target()free functions intest_helpersto enumerate/proc/self/fdand resolve FD symlink targets.check_fd_leaks()private method totest_base, called at the end ofTearDownProxy(). It compares the current FD set against a snapshot from the previous test.ADD_FAILURE().NOTE: CI tests are going to fail until existing fd leaks are fixed by a separate PR