Skip to content

NC Health | add dedicated servers for each forks for health checks #9045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 10, 2025

Conversation

nadavMiz
Copy link
Contributor

@nadavMiz nadavMiz commented May 28, 2025

Describe the Problem

since we can't predict node forwarding of connections, we can't make sure health requests gets to all forks. currently a retry mechanism is used. this is unreliable, and might fail in certain cases. also under heavy system loads health requests can timeout. both those cases cause health to detect the system as being down and as a result causing a failover in CES systems. add a dedicated server for each fork for health requests

Explain the Changes

  1. add a new server for each fork for health requests. the port for each server will be in range between a defined base port and port + number of forks
  2. add a value in config ENDPOINT_FORK_PORT_BASE that define the base for the dedicated ports for each fork
  3. change health script to use the new dedicated servers instead of the endpoint main server to detect forks health
  4. add a list of port offsets in the master fork. if fork is replaced, give his server the port of the replaced fork
  5. add an argument for the health script to disable server validation - disable_service_validation. this value is to enable the use for testing both lacal and on CI

how the mechanism for distributing the different ports to each fork works:
Each fork is assigned a unique port based on its index offset from a base port. To allow a new fork to reuse the same port as a dead one, the primary process maintains a list of port offsets. When spawning a fork, the primary first waits for a "ready" message from the fork (indicating it's ready to receive instructions). Only then does the primary send the assigned port. This ensures the fork doesn't miss the port assignment message. The fork then starts its server on the received port. It's crucial that the primary starts listening for the "ready" message immediately after spawning to avoid missing it.

Issues: Fixed #xxx / Gap #xxx

  1. https://issues.redhat.com/browse/DFBUGS-2284

Testing Instructions:

  1. sudo NC_CORETEST=true node ./node_modules/mocha/bin/mocha src/test/unit_tests/test_nc_with_a_couple_of_forks.js
  • Doc added/updated
  • Tests added

Summary by CodeRabbit

  • New Features

    • Introduced a dedicated health server for forked processes in non-containerized environments, improving health check reliability and visibility.
    • Added new configuration options for fork server port base, timeout, and retry attempts.
    • Health checks now query each fork individually, providing more accurate status reporting.
  • Bug Fixes

    • Improved handling of worker process restarts to maintain consistent port assignments and robust server startup.
  • Tests

    • Added comprehensive tests for fork health server behavior, including concurrency and fork restart scenarios.
    • Centralized and enhanced test utilities for mocking health check functions.
  • Chores

    • Updated documentation and parameter types for server-related functions to reflect new health server support.

@nadavMiz nadavMiz force-pushed the fork_server branch 7 times, most recently from 4ff4dfb to 599f367 Compare May 28, 2025 13:12
@nadavMiz nadavMiz changed the title NSFS | add dedicated servers for each forks for health checks NC | add dedicated servers for each forks for health checks May 28, 2025
@nadavMiz nadavMiz changed the title NC | add dedicated servers for each forks for health checks NC Health | add dedicated servers for each forks for health checks May 28, 2025
@nadavMiz nadavMiz force-pushed the fork_server branch 3 times, most recently from e1340e5 to 328e021 Compare May 29, 2025 16:37
@nadavMiz nadavMiz requested a review from romayalon May 29, 2025 16:38
@nadavMiz nadavMiz self-assigned this Jun 3, 2025
@nadavMiz nadavMiz force-pushed the fork_server branch 8 times, most recently from 4d894a9 to 8c93185 Compare June 8, 2025 15:31
@nadavMiz nadavMiz requested a review from romayalon June 8, 2025 15:32
Copy link

coderabbitai bot commented Jun 10, 2025

Walkthrough

The changes implement a dedicated fork health server for non-containerized environments, introducing new configuration parameters, utilities, and robust logic for managing worker processes and their health endpoints. The health check mechanism is refactored to query each fork on its assigned port. Related tests and utilities are updated or added to support and validate this new functionality.

Changes

File(s) Change Summary
config.js Added ENDPOINT_FORK_PORT_BASE, NC_FORK_SERVER_TIMEOUT, and NC_FORK_SERVER_RETRIES configuration parameters for fork server port assignment, timeout, and retries.
src/endpoint/endpoint.js Added dedicated fork health server logic for NC environments, new service type FORK_HEALTH, and separated /endpoint_fork_id handler. Exported endpoint_fork_id_handler.
src/manage_nsfs/health.js Refactored NSFSHealth to accept fork_base_port, query each fork on its port for health, and updated related methods and logging. Updated get_health_status to support new logic.
src/nc/nc_utils.js Added and exported is_nc_environment() utility function.
src/test/system_tests/test_utils.js Added set_health_mock_functions utility using sinon to mock health-related methods for testing.
src/test/unit_tests/nc_coretest.js Added and exported get_nsfs_fork_pids() to retrieve child fork PIDs for testing.
src/test/unit_tests/test_nc_health.js Removed local set_mock_functions in favor of centralized test_utils.set_health_mock_functions. Updated all calls accordingly.
src/test/unit_tests/test_nc_with_a_couple_of_forks.js Added tests for fork health server: concurrency under load and fork restart behavior, utilizing new health and config utilities.
src/util/fork_utils.js Enhanced worker management: consistent port offsets, per-worker message handlers, fork server startup with timeout/retry, and robust readiness signaling. Added create_worker_message_handler and _send_fork_server_message.
src/util/http_utils.js Added 'FORK_HEALTH' as a valid server type for HTTP/HTTPS server startup and updated logic to exclude it from setup_endpoint_server. Updated JSDoc and parameter types.
src/util/ssl_utils.js Added FORK_HEALTH certificate entry and updated logic to load certs from NSFS config for this service type. Minor formatting fix.

Sequence Diagram(s)

sequenceDiagram
    participant MainProcess as Main Process (Primary)
    participant Worker as Worker Process
    participant ForkHealthServer as Fork Health Server
    participant NSFSHealth as NSFS Health Checker

    MainProcess->>Worker: Fork worker (with offset)
    Worker->>MainProcess: "ready_to_start_fork_server"
    MainProcess->>Worker: Command to start fork health server on port (base + offset)
    Worker->>ForkHealthServer: Start fork health server on assigned port
    NSFSHealth->>ForkHealthServer: HTTP GET /endpoint_fork_id
    ForkHealthServer-->>NSFSHealth: Respond with worker ID
Loading

Poem

In the warren of forks, each on its port,
The health bunnies hop, with reports to sort.
With new configs and checks, the code’s more robust,
Each worker’s alive—on this you can trust!
🐇✨
Forks in a row, health shining bright,
The system’s as sturdy as a rabbit’s bite!

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

npm error Exit handler never called!
npm error This is an error with npm itself. Please report this error at:
npm error https://github.com/npm/cli/issues
npm error A complete log of this run can be found in: /.npm/_logs/2025-06-10T09_50_25_889Z-debug-0.log

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (3)
src/endpoint/endpoint.js (1)

384-395: ⚠️ Potential issue

Unawaited P.delay() is a no-op

endpoint_fork_id_handler fires P.delay(500); without await.
The promise is discarded, producing no delay and an unhandled-rejection risk if P.delay ever rejects.

Either drop the call or convert the handler to async and await it:

-function endpoint_fork_id_handler(req, res) {
+async function endpoint_fork_id_handler(req, res) {
     let reply = {};
     if (cluster.isWorker) {
         reply = { worker_id: cluster.worker.id };
     }
-    P.delay(500);
+    // artificial latency – keep or remove as needed
+    await P.delay(500);
src/manage_nsfs/health.js (2)

103-108: ⚠️ Potential issue

Missing validation for fork_base_port leads to NaN-based port calculation

fork_base_port is accepted as-is from CLI / config and later used in arithmetic (this.fork_base_port + i).
If the value is undefined, non-numeric or an empty string, the result becomes NaN, causing every request to be sent to localhost:NaN, which the Node runtime treats as port 0.

Add a numeric fallback and early validation.

this.fork_base_port = Number(options.fork_base_port);
if (Number.isNaN(this.fork_base_port)) {
    throw new Error('Invalid fork_base_port value');
}

260-299: 🛠️ Refactor suggestion

Sequential fork probing may block the health route under load

for (let i = 0; i < total_fork_count; i++) { await make_endpoint_health_request … } runs serially, so with N forks the timeout is N × per-request-timeout. On a busy node dozens of forks turn a fast health check into seconds.

Launch the probes concurrently and short-circuit after first failure for the common “all good” case.

- for (let i = 0; i < total_fork_count; i++) {
-     const port = this.fork_base_port + i;
-     try {
-         const fork_id_response = await this.make_endpoint_health_request(url_path, port);
-         worker_ids.push(fork_id_response.worker_id);
-     } catch (err) {
-         dbg.log0('Error while pinging fork :' + HOSTNAME + ', port ' + port, err);
-     }
- }
+ await Promise.all(
+     Array.from({ length: total_fork_count }, (_, i) => {
+         const port = this.fork_base_port + i;
+         return this.make_endpoint_health_request(url_path, port)
+             .then(res => worker_ids.push(res.worker_id))
+             .catch(err => dbg.log0(`Error while pinging fork :${HOSTNAME}:${port}`, err));
+     })
+ );
🧹 Nitpick comments (8)
src/test/unit_tests/nc_coretest.js (1)

168-171: Consider adding error handling and cross-platform compatibility.

The function implementation is correct, but consider these improvements:

  1. Error handling: The shell command execution should handle cases where the parent process doesn't exist or the command fails.
  2. Platform compatibility: Verify that the ps command options work consistently across different Unix-like systems.

Consider this enhanced implementation:

 async function get_nsfs_fork_pids() {
+    if (!nsfs_process || !nsfs_process.pid) {
+        return [];
+    }
+    try {
         const res = await os_utils.exec(`ps -o pid --ppid ${nsfs_process.pid}`, { return_stdout: true, trim_stdout: true });
         return res.split('\n').slice(1).map(pid => parseInt(pid, 10));
+    } catch (err) {
+        console.warn('Failed to get fork PIDs:', err.message);
+        return [];
+    }
 }
src/test/system_tests/test_utils.js (1)

865-885: Excellent centralized mocking utility with room for minor improvements.

The function implementation is well-designed and handles the complexities of mocking health functions properly:

Strengths:

  • Proper restoration of existing stubs prevents test interference
  • Smart object selection (Health vs Health.config_fs) based on function name
  • Support for multiple sequential call responses using onCall
  • Clear JSDoc documentation

Consider these enhancements for increased robustness:

 function set_health_mock_functions(Health, mock_function_responses) {
     for (const mock_function_name of Object.keys(mock_function_responses)) {
         const mock_function_responses_arr = mock_function_responses[mock_function_name];
         const obj_to_stub = mock_function_name === 'get_system_config_file' ? Health.config_fs : Health;

+        // Validate that the function exists on the target object
+        if (typeof obj_to_stub[mock_function_name] !== 'function') {
+            console.warn(`Mock function ${mock_function_name} does not exist on target object`);
+            continue;
+        }
+
         if (obj_to_stub[mock_function_name]?.restore) obj_to_stub[mock_function_name]?.restore();
         const stub = sinon.stub(obj_to_stub, mock_function_name);
         for (let i = 0; i < mock_function_responses_arr.length; i++) {
             stub.onCall(i).returns(Promise.resolve(mock_function_responses_arr[i]));
         }
     }
 }
config.js (1)

1023-1024: Timeout unit may be ambiguous

NC_FORK_SERVER_TIMEOUT = 5; // 5 minutes

Down-stream code (e.g. fork_utils) expects the value in minutes, but most other timeout knobs in this file are milliseconds. Consider either:

  1. Converting to milliseconds here (5 * 60 * 1000) to stay consistent, or
  2. Renaming to NC_FORK_SERVER_TIMEOUT_MIN to make the unit explicit.

Minor, yet prevents future misuse.

src/util/ssl_utils.js (1)

50-52: Reuse of S3 certificate for fork health endpoints – intentional?

FORK_HEALTH piggy-backs on the S3 cert path. If the S3 cert is rotated independently from the health cert you may leak the health service or, worse, serve the wrong cert on S3.

Consider allowing a dedicated FORK_HEALTH_SERVICE_CERT_PATH (fallback to S3’s path when unset). This keeps responsibilities isolated and simplifies certificate automation.

src/util/http_utils.js (1)

859-865: listen_port() silently executes on every cluster message

listen_port skips setup_endpoint_server for 'FORK_HEALTH', which is correct, but the function still attaches generic clientError / keep-alive handlers that assume S3-style traffic.

If the fork-health server ever receives malformed input, the current error path writes an S3-style HTTP reply that isn’t expected by the health script. Extract a minimal setup_health_server (or bypass the extra listeners) to keep the endpoint lean.

src/manage_nsfs/health.js (1)

245-249: make_endpoint_health_request duplicates logic – extract common HTTPS options

Every call recreates the same literal object; consider memoising the base options (host, method, rejectUnauthorized) for readability and minor perf gain.

src/util/fork_utils.js (2)

110-114: await params.worker.kill() awaits a non-Promise

Worker.kill() is synchronous and returns the worker, not a Promise. Awaiting it is a no-op and misleading.

- await params.worker.kill();
+ params.worker.kill();

145-157: Spelling / docs

Comment line 143 contains “enviorment”. Fix to “environment” to keep docs tidy.

- * NOTE - currently runs only on non containerized enviorment
+ * NOTE - currently runs only on non-containerised environment
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5e2803 and d8f1eec.

📒 Files selected for processing (11)
  • config.js (2 hunks)
  • src/endpoint/endpoint.js (5 hunks)
  • src/manage_nsfs/health.js (5 hunks)
  • src/nc/nc_utils.js (1 hunks)
  • src/test/system_tests/test_utils.js (20 hunks)
  • src/test/unit_tests/nc_coretest.js (3 hunks)
  • src/test/unit_tests/test_nc_health.js (23 hunks)
  • src/test/unit_tests/test_nc_with_a_couple_of_forks.js (2 hunks)
  • src/util/fork_utils.js (5 hunks)
  • src/util/http_utils.js (3 hunks)
  • src/util/ssl_utils.js (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
src/nc/nc_utils.js (1)
src/util/http_utils.js (1)
  • process (41-41)
src/test/system_tests/test_utils.js (4)
src/test/unit_tests/nc_coretest.js (3)
  • require (11-11)
  • require (12-12)
  • require (13-13)
src/test/unit_tests/test_bucketspace_versioning.js (18)
  • require (15-16)
  • require (17-17)
  • i (375-375)
  • i (2266-2266)
  • i (2292-2292)
  • i (2297-2297)
  • i (2317-2317)
  • i (2322-2322)
  • i (2345-2345)
  • i (2351-2351)
  • i (2384-2384)
  • i (2389-2389)
  • i (2395-2395)
  • i (2414-2414)
  • i (2419-2419)
  • i (2426-2426)
  • i (2444-2444)
  • i (2449-2449)
src/test/unit_tests/test_nc_with_a_couple_of_forks.js (1)
  • Health (234-234)
src/test/unit_tests/test_nc_health.js (2)
  • Health (66-66)
  • Health (699-699)
src/test/unit_tests/test_nc_health.js (1)
src/test/unit_tests/sudo_index.js (1)
  • test_utils (6-6)
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: run-sanity-tests
  • GitHub Check: run-sanity-ssl-tests
  • GitHub Check: run-unit-tests-postgres
  • GitHub Check: run-jest-unit-tests
  • GitHub Check: ceph-s3-tests
  • GitHub Check: nsfs-ceph-s3-tests
  • GitHub Check: run-nc-unit-tests
  • GitHub Check: warp-nc-tests
  • GitHub Check: run-unit-tests
🔇 Additional comments (5)
src/nc/nc_utils.js (1)

27-32: LGTM! Well-implemented environment detection utility.

The function correctly checks for the NC environment by verifying both the presence and exact value of the environment variable. The strict equality check with 'true' ensures proper boolean evaluation, which is consistent with environment variable handling patterns in the codebase.

src/test/unit_tests/test_nc_health.js (1)

177-177: LGTM! Good refactoring to centralize mock utilities.

This change successfully eliminates code duplication by replacing the local set_mock_functions helper with the centralized test_utils.set_health_mock_functions. This improves maintainability and follows the DRY principle without affecting test behavior.

Also applies to: 209-212, 248-252, 259-263, 273-277, 292-296, 314-318, 339-343, 359-363, 376-380, 393-397, 409-413, 425-429, 444-448, 466-470, 486-490, 509-514, 531-535, 550-555, 572-577, 594-598, 615-619, 632-636

src/test/system_tests/test_utils.js (1)

19-19: LGTM! Appropriate import for mocking functionality.

Adding the sinon import to support the new centralized mocking utility is correct and follows the existing import pattern in the file.

src/test/unit_tests/test_nc_with_a_couple_of_forks.js (1)

248-256: Race condition in concurrent health-check mocking

set_health_mock_functions is called inside the concurrency loop; all 10 promises mutate the global mock almost simultaneously, so later calls may unexpectedly consume another test-case’s stub data.

Set the mocks once before spawning the promises, or clone the Health instance per iteration to ensure deterministic results.

src/manage_nsfs/health.js (1)

635-647: Hard-coding NC-specific flag leaks into containerised deployments

fork_base_port is always forwarded to NSFSHealth, yet fork servers exist only in NC mode.
Add an environment check (similar to is_nc_environment) before passing / using the value to avoid pointless port math and confusing log noise in containerised runs.

Comment on lines +1009 to 1011
// each fork will get port in range [ENDPOINT_FORK_PORT_BASE, ENDPOINT_FORK_PORT_BASE + number of forks - 1)]
config.ENDPOINT_FORK_PORT_BASE = Number(process.env.ENDPOINT_FORK_PORT_BASE) || 6002;
config.ALLOW_HTTP = false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard against port-range collisions for fork servers

ENDPOINT_FORK_PORT_BASE is hard-coded to 6002, while other well-known ports (S3 = 6001, metrics = 7001-7004, etc.) are nearby.
Before the primary starts assigning ports (base + fork_index) you should verify that none of them overlap with:

  • ENDPOINT_PORT / ENDPOINT_SSL_PORT
  • metrics ports
  • any value explicitly supplied via env-vars or config.json

A quick sanity check during startup will save you from cryptic “EADDRINUSE” crashes.

+assert(_.difference(
+    _.range(config.ENDPOINT_FORK_PORT_BASE,
+            config.ENDPOINT_FORK_PORT_BASE + config.ENDPOINT_FORKS),
+    [
+        config.ENDPOINT_PORT,
+        config.ENDPOINT_SSL_PORT,
+        config.EP_METRICS_SERVER_PORT,
+        config.EP_METRICS_SERVER_SSL_PORT
+    ]).length === config.ENDPOINT_FORKS,
+   'Fork-health port range overlaps with an existing service');

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +570 to +576
async function fork_message_request_handler(msg) {
await http_utils.start_https_server(msg.health_port,
SERVICES_TYPES_ENUM.FORK_HEALTH,
fork_main_handler,
msg.nsfs_config_root
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Validate message payload before boot-strapping a health server

fork_message_request_handler trusts msg.health_port blindly.
Protect against malformed or repeated messages (cluster control traffic, user error) to avoid EADDRINUSE crashes:

async function fork_message_request_handler(msg) {
-    await http_utils.start_https_server(msg.health_port,
+    if (!msg || typeof msg.health_port !== 'number') return;
+    await http_utils.start_https_server(msg.health_port,
         SERVICES_TYPES_ENUM.FORK_HEALTH,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async function fork_message_request_handler(msg) {
await http_utils.start_https_server(msg.health_port,
SERVICES_TYPES_ENUM.FORK_HEALTH,
fork_main_handler,
msg.nsfs_config_root
);
}
async function fork_message_request_handler(msg) {
if (!msg || typeof msg.health_port !== 'number') return;
await http_utils.start_https_server(msg.health_port,
SERVICES_TYPES_ENUM.FORK_HEALTH,
fork_main_handler,
msg.nsfs_config_root
);
}
🤖 Prompt for AI Agents
In src/endpoint/endpoint.js around lines 570 to 576, the function
fork_message_request_handler uses msg.health_port without validation, which can
cause EADDRINUSE errors if the port is invalid or already in use. Add validation
to check that msg.health_port is a valid, unused port number before calling
start_https_server. Also, implement logic to handle repeated or malformed
messages gracefully to prevent server crashes.

Comment on lines +40 to 46
const fork_port_offsets = [];
if (cluster.isPrimary && count > 0) {
for (let i = 0; i < count; ++i) {
const worker = cluster.fork();
console.warn('WORKER started', { id: worker.id, pid: worker.process.pid });
fork_port_offsets.push(worker.id);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

fork_port_offsets stores worker IDs instead of offsets

You push worker.id (1…n) and later treat its index as the offset.
If any worker dies before fork_port_offsets.push, or if Node assigns non-sequential IDs, offsets drift.

Store explicit offsets instead:

- fork_port_offsets.push(worker.id);
+ fork_port_offsets.push(i); // stable logical offset
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const fork_port_offsets = [];
if (cluster.isPrimary && count > 0) {
for (let i = 0; i < count; ++i) {
const worker = cluster.fork();
console.warn('WORKER started', { id: worker.id, pid: worker.process.pid });
fork_port_offsets.push(worker.id);
}
const fork_port_offsets = [];
if (cluster.isPrimary && count > 0) {
for (let i = 0; i < count; ++i) {
const worker = cluster.fork();
console.warn('WORKER started', { id: worker.id, pid: worker.process.pid });
- fork_port_offsets.push(worker.id);
+ fork_port_offsets.push(i); // stable logical offset
}
}
🤖 Prompt for AI Agents
In src/util/fork_utils.js around lines 40 to 46, the array fork_port_offsets
currently stores worker IDs instead of explicit offsets, which can cause offset
drift if workers die early or IDs are non-sequential. To fix this, push the loop
index i (the explicit offset) into fork_port_offsets instead of worker.id,
ensuring offsets remain consistent regardless of worker ID assignment or
lifecycle.

Signed-off-by: nadav mizrahi <nadav.mizrahi16@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
src/util/fork_utils.js (3)

40-46: Critical issue: fork_port_offsets stores worker IDs instead of stable offsets

The current implementation pushes worker.id into fork_port_offsets, but later treats the array index as the port offset. This can cause offset drift if workers die before being added to the array or if Node.js assigns non-sequential worker IDs.

Store explicit offsets instead:

-            fork_port_offsets.push(worker.id);
+            fork_port_offsets.push(i); // stable logical offset based on loop index

72-82: Critical issue: offset can be -1, leading to invalid port calculation

When findIndex doesn't find the worker ID, it returns -1, which would result in an invalid port like ENDPOINT_FORK_PORT_BASE - 1. This case needs proper error handling.

Add validation before using the offset:

 const offset = fork_port_offsets.findIndex(id => id === worker.id);
+if (offset === -1) {
+    dbg.error(`Failed to map worker ${worker.id} to port offset – skipping fork server setup`);
+    return;
+}

105-118: Critical issue: Global fork_server_retries couples unrelated workers

The shared retry counter means one worker exhausting its retries prevents all other workers from recovering later. Each worker should have independent retry tracking.

Implement per-worker retry tracking:

-//global variable as it is shared between all workers
-let fork_server_retries = config.NC_FORK_SERVER_RETRIES;
+const worker_retries = new Map();

 function create_worker_message_handler(params) {
+    if (!worker_retries.has(params.worker.id)) {
+        worker_retries.set(params.worker.id, config.NC_FORK_SERVER_RETRIES);
+    }
     let fork_server_timer;
-    if (is_nc_environment() && fork_server_retries > 0) {
+    if (is_nc_environment() && worker_retries.get(params.worker.id) > 0) {
         fork_server_timer = setTimeout(async () => {
             dbg.error(`Timeout: It took more than ${config.NC_FORK_SERVER_TIMEOUT} minutes to get a ready message from worker ${params.worker.id}, killing the worker`);
-            fork_server_retries -= 1;
+            const retries = worker_retries.get(params.worker.id) - 1;
+            worker_retries.set(params.worker.id, retries);
             await params.worker.kill();
-            if (fork_server_retries <= 0) {
+            if (retries <= 0) {
                 dbg.error(`ERROR: fork health server failed to start for ${config.NC_FORK_SERVER_RETRIES} attempts stop retrying to reload the worker`);
             }
         }, config.NC_FORK_SERVER_TIMEOUT * 60 * 1000);
🧹 Nitpick comments (1)
src/util/fork_utils.js (1)

119-139: Message handler structure is correct but could benefit from early returns

The message handler correctly processes different message types, but the nested if statements could be simplified for better readability.

Consider using early returns for cleaner code structure:

 return function(msg) {
-        if (msg.io_stats) {
+        if (msg.io_stats) {
         for (const [key, value] of Object.entries(msg.io_stats)) {
             io_stats[key] += value;
         }
         prom_reporting.set_io_stats(io_stats);
-        }
-        if (msg.op_stats) {
+            return;
+        }
+        
+        if (msg.op_stats) {
             _update_ops_stats(msg.op_stats);
             prom_reporting.set_ops_stats(op_stats);
-        }
-        if (msg.fs_workers_stats) {
+            return;
+        }
+        
+        if (msg.fs_workers_stats) {
             _update_fs_stats(msg.fs_workers_stats);
             prom_reporting.set_fs_worker_stats(fs_workers_stats);
-        }
-        if (msg.ready_to_start_fork_server) {
+            return;
+        }
+        
+        if (msg.ready_to_start_fork_server) {
             clearTimeout(fork_server_timer);
             _send_fork_server_message(params);
-        }
+        }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8f1eec and 5423b6e.

📒 Files selected for processing (11)
  • config.js (2 hunks)
  • src/endpoint/endpoint.js (5 hunks)
  • src/manage_nsfs/health.js (5 hunks)
  • src/nc/nc_utils.js (1 hunks)
  • src/test/system_tests/test_utils.js (20 hunks)
  • src/test/unit_tests/nc_coretest.js (3 hunks)
  • src/test/unit_tests/test_nc_health.js (23 hunks)
  • src/test/unit_tests/test_nc_with_a_couple_of_forks.js (2 hunks)
  • src/util/fork_utils.js (5 hunks)
  • src/util/http_utils.js (3 hunks)
  • src/util/ssl_utils.js (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
  • src/test/unit_tests/test_nc_health.js
  • src/nc/nc_utils.js
  • src/test/unit_tests/nc_coretest.js
  • config.js
  • src/util/ssl_utils.js
  • src/test/unit_tests/test_nc_with_a_couple_of_forks.js
  • src/util/http_utils.js
  • src/endpoint/endpoint.js
  • src/test/system_tests/test_utils.js
  • src/manage_nsfs/health.js
⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: run-unit-tests
  • GitHub Check: run-package-lock-validation
  • GitHub Check: run-unit-tests-postgres
  • GitHub Check: nsfs-ceph-s3-tests
  • GitHub Check: warp-tests
  • GitHub Check: warp-nc-tests
  • GitHub Check: ceph-s3-tests
  • GitHub Check: run-nc-unit-tests
  • GitHub Check: run-jest-unit-tests
  • GitHub Check: run-sanity-tests
  • GitHub Check: run-sanity-ssl-tests

Comment on lines 84 to 93
if (id) {
cluster.workers[id].on('message', nsfs_io_stats_handler);
const offset = fork_port_offsets.findIndex(worker_id => worker_id === cluster.workers[id].id);
const listener = create_worker_message_handler({
worker: cluster.workers[id],
offset: offset,
nsfs_config_root: nsfs_config_root
});
cluster.workers[id].on('message', listener);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Potential duplicate offset assignment issue

The existing workers setup uses the same findIndex logic as the restart handler, which could assign the same offset to multiple workers if there are duplicate IDs in fork_port_offsets.

Ensure unique offset assignment by cleaning up the offset tracking logic when combined with the previous fixes for storing actual offsets instead of worker IDs.

🤖 Prompt for AI Agents
In src/util/fork_utils.js around lines 84 to 93, the offset assignment uses
findIndex on fork_port_offsets with worker IDs, which can cause duplicate
offsets if IDs repeat. To fix this, update the offset tracking logic to store
and assign unique offsets rather than relying on worker IDs. Ensure that when
assigning offsets to workers, you remove or mark used offsets to prevent
duplicates, aligning this with previous fixes that store actual offsets instead
of IDs.

health_port: config.ENDPOINT_FORK_PORT_BASE + offset
});
} catch (err) {
dbg.warn(`Timeout: It took more than 5 minute to get a message from worker ${worker.id} do not send start server message`);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Misleading error message and insufficient error handling

The error message mentions "5 minute" timeout but should use the configurable timeout value. Also, the catch block only logs a warning when worker communication fails - this could leave the fork in an inconsistent state.

Improve error handling and message accuracy:

         } catch (err) {
-            dbg.warn(`Timeout: It took more than 5 minute to get a message from worker ${worker.id} do not send start server message`);
+            dbg.error(`Failed to send fork server message to worker ${worker.id}: ${err.message}`);
+            // Consider killing the worker if communication fails
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
dbg.warn(`Timeout: It took more than 5 minute to get a message from worker ${worker.id} do not send start server message`);
} catch (err) {
dbg.error(`Failed to send fork server message to worker ${worker.id}: ${err.message}`);
// Consider killing the worker if communication fails
}
🤖 Prompt for AI Agents
In src/util/fork_utils.js at line 155, update the timeout warning message to use
the actual configurable timeout value instead of the hardcoded "5 minute" text
for accuracy. Additionally, enhance the catch block to handle worker
communication failures more robustly by adding proper error handling logic
beyond just logging a warning, such as cleaning up or resetting the fork state
to avoid inconsistencies.

@nadavMiz nadavMiz merged commit 6b67473 into noobaa:master Jun 10, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants