-
Notifications
You must be signed in to change notification settings - Fork 83
NC Health | add dedicated servers for each forks for health checks #9045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4ff4dfb
to
599f367
Compare
e1340e5
to
328e021
Compare
4d894a9
to
8c93185
Compare
WalkthroughThe changes implement a dedicated fork health server for non-containerized environments, introducing new configuration parameters, utilities, and robust logic for managing worker processes and their health endpoints. The health check mechanism is refactored to query each fork on its assigned port. Related tests and utilities are updated or added to support and validate this new functionality. Changes
Sequence Diagram(s)sequenceDiagram
participant MainProcess as Main Process (Primary)
participant Worker as Worker Process
participant ForkHealthServer as Fork Health Server
participant NSFSHealth as NSFS Health Checker
MainProcess->>Worker: Fork worker (with offset)
Worker->>MainProcess: "ready_to_start_fork_server"
MainProcess->>Worker: Command to start fork health server on port (base + offset)
Worker->>ForkHealthServer: Start fork health server on assigned port
NSFSHealth->>ForkHealthServer: HTTP GET /endpoint_fork_id
ForkHealthServer-->>NSFSHealth: Respond with worker ID
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
npm error Exit handler never called! ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (3)
src/endpoint/endpoint.js (1)
384-395
:⚠️ Potential issueUnawaited
P.delay()
is a no-op
endpoint_fork_id_handler
firesP.delay(500);
withoutawait
.
The promise is discarded, producing no delay and an unhandled-rejection risk ifP.delay
ever rejects.Either drop the call or convert the handler to
async
andawait
it:-function endpoint_fork_id_handler(req, res) { +async function endpoint_fork_id_handler(req, res) { let reply = {}; if (cluster.isWorker) { reply = { worker_id: cluster.worker.id }; } - P.delay(500); + // artificial latency – keep or remove as needed + await P.delay(500);src/manage_nsfs/health.js (2)
103-108
:⚠️ Potential issueMissing validation for
fork_base_port
leads to NaN-based port calculation
fork_base_port
is accepted as-is from CLI / config and later used in arithmetic (this.fork_base_port + i
).
If the value isundefined
, non-numeric or an empty string, the result becomesNaN
, causing every request to be sent tolocalhost:NaN
, which the Node runtime treats as port0
.Add a numeric fallback and early validation.
this.fork_base_port = Number(options.fork_base_port); if (Number.isNaN(this.fork_base_port)) { throw new Error('Invalid fork_base_port value'); }
260-299
: 🛠️ Refactor suggestionSequential fork probing may block the health route under load
for (let i = 0; i < total_fork_count; i++) { await make_endpoint_health_request … }
runs serially, so with N forks the timeout is N × per-request-timeout. On a busy node dozens of forks turn a fast health check into seconds.Launch the probes concurrently and short-circuit after first failure for the common “all good” case.
- for (let i = 0; i < total_fork_count; i++) { - const port = this.fork_base_port + i; - try { - const fork_id_response = await this.make_endpoint_health_request(url_path, port); - worker_ids.push(fork_id_response.worker_id); - } catch (err) { - dbg.log0('Error while pinging fork :' + HOSTNAME + ', port ' + port, err); - } - } + await Promise.all( + Array.from({ length: total_fork_count }, (_, i) => { + const port = this.fork_base_port + i; + return this.make_endpoint_health_request(url_path, port) + .then(res => worker_ids.push(res.worker_id)) + .catch(err => dbg.log0(`Error while pinging fork :${HOSTNAME}:${port}`, err)); + }) + );
🧹 Nitpick comments (8)
src/test/unit_tests/nc_coretest.js (1)
168-171
: Consider adding error handling and cross-platform compatibility.The function implementation is correct, but consider these improvements:
- Error handling: The shell command execution should handle cases where the parent process doesn't exist or the command fails.
- Platform compatibility: Verify that the
ps
command options work consistently across different Unix-like systems.Consider this enhanced implementation:
async function get_nsfs_fork_pids() { + if (!nsfs_process || !nsfs_process.pid) { + return []; + } + try { const res = await os_utils.exec(`ps -o pid --ppid ${nsfs_process.pid}`, { return_stdout: true, trim_stdout: true }); return res.split('\n').slice(1).map(pid => parseInt(pid, 10)); + } catch (err) { + console.warn('Failed to get fork PIDs:', err.message); + return []; + } }src/test/system_tests/test_utils.js (1)
865-885
: Excellent centralized mocking utility with room for minor improvements.The function implementation is well-designed and handles the complexities of mocking health functions properly:
✅ Strengths:
- Proper restoration of existing stubs prevents test interference
- Smart object selection (Health vs Health.config_fs) based on function name
- Support for multiple sequential call responses using
onCall
- Clear JSDoc documentation
Consider these enhancements for increased robustness:
function set_health_mock_functions(Health, mock_function_responses) { for (const mock_function_name of Object.keys(mock_function_responses)) { const mock_function_responses_arr = mock_function_responses[mock_function_name]; const obj_to_stub = mock_function_name === 'get_system_config_file' ? Health.config_fs : Health; + // Validate that the function exists on the target object + if (typeof obj_to_stub[mock_function_name] !== 'function') { + console.warn(`Mock function ${mock_function_name} does not exist on target object`); + continue; + } + if (obj_to_stub[mock_function_name]?.restore) obj_to_stub[mock_function_name]?.restore(); const stub = sinon.stub(obj_to_stub, mock_function_name); for (let i = 0; i < mock_function_responses_arr.length; i++) { stub.onCall(i).returns(Promise.resolve(mock_function_responses_arr[i])); } } }config.js (1)
1023-1024
: Timeout unit may be ambiguous
NC_FORK_SERVER_TIMEOUT = 5; // 5 minutes
Down-stream code (e.g.
fork_utils
) expects the value in minutes, but most other timeout knobs in this file are milliseconds. Consider either:
- Converting to milliseconds here (
5 * 60 * 1000
) to stay consistent, or- Renaming to
NC_FORK_SERVER_TIMEOUT_MIN
to make the unit explicit.Minor, yet prevents future misuse.
src/util/ssl_utils.js (1)
50-52
: Reuse of S3 certificate for fork health endpoints – intentional?
FORK_HEALTH
piggy-backs on the S3 cert path. If the S3 cert is rotated independently from the health cert you may leak the health service or, worse, serve the wrong cert on S3.Consider allowing a dedicated
FORK_HEALTH_SERVICE_CERT_PATH
(fallback to S3’s path when unset). This keeps responsibilities isolated and simplifies certificate automation.src/util/http_utils.js (1)
859-865
:listen_port()
silently executes on every cluster message
listen_port
skipssetup_endpoint_server
for'FORK_HEALTH'
, which is correct, but the function still attaches genericclientError
/ keep-alive handlers that assume S3-style traffic.If the fork-health server ever receives malformed input, the current error path writes an S3-style HTTP reply that isn’t expected by the health script. Extract a minimal
setup_health_server
(or bypass the extra listeners) to keep the endpoint lean.src/manage_nsfs/health.js (1)
245-249
:make_endpoint_health_request
duplicates logic – extract common HTTPS optionsEvery call recreates the same literal object; consider memoising the base options (host, method, rejectUnauthorized) for readability and minor perf gain.
src/util/fork_utils.js (2)
110-114
:await params.worker.kill()
awaits a non-Promise
Worker.kill()
is synchronous and returns the worker, not a Promise. Awaiting it is a no-op and misleading.- await params.worker.kill(); + params.worker.kill();
145-157
: Spelling / docsComment line 143 contains “enviorment”. Fix to “environment” to keep docs tidy.
- * NOTE - currently runs only on non containerized enviorment + * NOTE - currently runs only on non-containerised environment
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
config.js
(2 hunks)src/endpoint/endpoint.js
(5 hunks)src/manage_nsfs/health.js
(5 hunks)src/nc/nc_utils.js
(1 hunks)src/test/system_tests/test_utils.js
(20 hunks)src/test/unit_tests/nc_coretest.js
(3 hunks)src/test/unit_tests/test_nc_health.js
(23 hunks)src/test/unit_tests/test_nc_with_a_couple_of_forks.js
(2 hunks)src/util/fork_utils.js
(5 hunks)src/util/http_utils.js
(3 hunks)src/util/ssl_utils.js
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
src/nc/nc_utils.js (1)
src/util/http_utils.js (1)
process
(41-41)
src/test/system_tests/test_utils.js (4)
src/test/unit_tests/nc_coretest.js (3)
require
(11-11)require
(12-12)require
(13-13)src/test/unit_tests/test_bucketspace_versioning.js (18)
require
(15-16)require
(17-17)i
(375-375)i
(2266-2266)i
(2292-2292)i
(2297-2297)i
(2317-2317)i
(2322-2322)i
(2345-2345)i
(2351-2351)i
(2384-2384)i
(2389-2389)i
(2395-2395)i
(2414-2414)i
(2419-2419)i
(2426-2426)i
(2444-2444)i
(2449-2449)src/test/unit_tests/test_nc_with_a_couple_of_forks.js (1)
Health
(234-234)src/test/unit_tests/test_nc_health.js (2)
Health
(66-66)Health
(699-699)
src/test/unit_tests/test_nc_health.js (1)
src/test/unit_tests/sudo_index.js (1)
test_utils
(6-6)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: run-sanity-tests
- GitHub Check: run-sanity-ssl-tests
- GitHub Check: run-unit-tests-postgres
- GitHub Check: run-jest-unit-tests
- GitHub Check: ceph-s3-tests
- GitHub Check: nsfs-ceph-s3-tests
- GitHub Check: run-nc-unit-tests
- GitHub Check: warp-nc-tests
- GitHub Check: run-unit-tests
🔇 Additional comments (5)
src/nc/nc_utils.js (1)
27-32
: LGTM! Well-implemented environment detection utility.The function correctly checks for the NC environment by verifying both the presence and exact value of the environment variable. The strict equality check with 'true' ensures proper boolean evaluation, which is consistent with environment variable handling patterns in the codebase.
src/test/unit_tests/test_nc_health.js (1)
177-177
: LGTM! Good refactoring to centralize mock utilities.This change successfully eliminates code duplication by replacing the local
set_mock_functions
helper with the centralizedtest_utils.set_health_mock_functions
. This improves maintainability and follows the DRY principle without affecting test behavior.Also applies to: 209-212, 248-252, 259-263, 273-277, 292-296, 314-318, 339-343, 359-363, 376-380, 393-397, 409-413, 425-429, 444-448, 466-470, 486-490, 509-514, 531-535, 550-555, 572-577, 594-598, 615-619, 632-636
src/test/system_tests/test_utils.js (1)
19-19
: LGTM! Appropriate import for mocking functionality.Adding the
sinon
import to support the new centralized mocking utility is correct and follows the existing import pattern in the file.src/test/unit_tests/test_nc_with_a_couple_of_forks.js (1)
248-256
: Race condition in concurrent health-check mocking
set_health_mock_functions
is called inside the concurrency loop; all 10 promises mutate the global mock almost simultaneously, so later calls may unexpectedly consume another test-case’s stub data.Set the mocks once before spawning the promises, or clone the
Health
instance per iteration to ensure deterministic results.src/manage_nsfs/health.js (1)
635-647
: Hard-coding NC-specific flag leaks into containerised deployments
fork_base_port
is always forwarded toNSFSHealth
, yet fork servers exist only in NC mode.
Add an environment check (similar tois_nc_environment
) before passing / using the value to avoid pointless port math and confusing log noise in containerised runs.
// each fork will get port in range [ENDPOINT_FORK_PORT_BASE, ENDPOINT_FORK_PORT_BASE + number of forks - 1)] | ||
config.ENDPOINT_FORK_PORT_BASE = Number(process.env.ENDPOINT_FORK_PORT_BASE) || 6002; | ||
config.ALLOW_HTTP = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Guard against port-range collisions for fork servers
ENDPOINT_FORK_PORT_BASE
is hard-coded to 6002
, while other well-known ports (S3 = 6001, metrics = 7001-7004, etc.) are nearby.
Before the primary starts assigning ports (base + fork_index
) you should verify that none of them overlap with:
ENDPOINT_PORT / ENDPOINT_SSL_PORT
- metrics ports
- any value explicitly supplied via env-vars or config.json
A quick sanity check during startup will save you from cryptic “EADDRINUSE” crashes.
+assert(_.difference(
+ _.range(config.ENDPOINT_FORK_PORT_BASE,
+ config.ENDPOINT_FORK_PORT_BASE + config.ENDPOINT_FORKS),
+ [
+ config.ENDPOINT_PORT,
+ config.ENDPOINT_SSL_PORT,
+ config.EP_METRICS_SERVER_PORT,
+ config.EP_METRICS_SERVER_SSL_PORT
+ ]).length === config.ENDPOINT_FORKS,
+ 'Fork-health port range overlaps with an existing service');
Committable suggestion skipped: line range outside the PR's diff.
async function fork_message_request_handler(msg) { | ||
await http_utils.start_https_server(msg.health_port, | ||
SERVICES_TYPES_ENUM.FORK_HEALTH, | ||
fork_main_handler, | ||
msg.nsfs_config_root | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Validate message payload before boot-strapping a health server
fork_message_request_handler
trusts msg.health_port
blindly.
Protect against malformed or repeated messages (cluster control traffic, user error) to avoid EADDRINUSE
crashes:
async function fork_message_request_handler(msg) {
- await http_utils.start_https_server(msg.health_port,
+ if (!msg || typeof msg.health_port !== 'number') return;
+ await http_utils.start_https_server(msg.health_port,
SERVICES_TYPES_ENUM.FORK_HEALTH,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
async function fork_message_request_handler(msg) { | |
await http_utils.start_https_server(msg.health_port, | |
SERVICES_TYPES_ENUM.FORK_HEALTH, | |
fork_main_handler, | |
msg.nsfs_config_root | |
); | |
} | |
async function fork_message_request_handler(msg) { | |
if (!msg || typeof msg.health_port !== 'number') return; | |
await http_utils.start_https_server(msg.health_port, | |
SERVICES_TYPES_ENUM.FORK_HEALTH, | |
fork_main_handler, | |
msg.nsfs_config_root | |
); | |
} |
🤖 Prompt for AI Agents
In src/endpoint/endpoint.js around lines 570 to 576, the function
fork_message_request_handler uses msg.health_port without validation, which can
cause EADDRINUSE errors if the port is invalid or already in use. Add validation
to check that msg.health_port is a valid, unused port number before calling
start_https_server. Also, implement logic to handle repeated or malformed
messages gracefully to prevent server crashes.
const fork_port_offsets = []; | ||
if (cluster.isPrimary && count > 0) { | ||
for (let i = 0; i < count; ++i) { | ||
const worker = cluster.fork(); | ||
console.warn('WORKER started', { id: worker.id, pid: worker.process.pid }); | ||
fork_port_offsets.push(worker.id); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fork_port_offsets
stores worker IDs instead of offsets
You push worker.id
(1…n
) and later treat its index as the offset.
If any worker dies before fork_port_offsets.push
, or if Node assigns non-sequential IDs, offsets drift.
Store explicit offsets instead:
- fork_port_offsets.push(worker.id);
+ fork_port_offsets.push(i); // stable logical offset
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const fork_port_offsets = []; | |
if (cluster.isPrimary && count > 0) { | |
for (let i = 0; i < count; ++i) { | |
const worker = cluster.fork(); | |
console.warn('WORKER started', { id: worker.id, pid: worker.process.pid }); | |
fork_port_offsets.push(worker.id); | |
} | |
const fork_port_offsets = []; | |
if (cluster.isPrimary && count > 0) { | |
for (let i = 0; i < count; ++i) { | |
const worker = cluster.fork(); | |
console.warn('WORKER started', { id: worker.id, pid: worker.process.pid }); | |
- fork_port_offsets.push(worker.id); | |
+ fork_port_offsets.push(i); // stable logical offset | |
} | |
} |
🤖 Prompt for AI Agents
In src/util/fork_utils.js around lines 40 to 46, the array fork_port_offsets
currently stores worker IDs instead of explicit offsets, which can cause offset
drift if workers die early or IDs are non-sequential. To fix this, push the loop
index i (the explicit offset) into fork_port_offsets instead of worker.id,
ensuring offsets remain consistent regardless of worker ID assignment or
lifecycle.
Signed-off-by: nadav mizrahi <nadav.mizrahi16@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
src/util/fork_utils.js (3)
40-46
: Critical issue:fork_port_offsets
stores worker IDs instead of stable offsetsThe current implementation pushes
worker.id
intofork_port_offsets
, but later treats the array index as the port offset. This can cause offset drift if workers die before being added to the array or if Node.js assigns non-sequential worker IDs.Store explicit offsets instead:
- fork_port_offsets.push(worker.id); + fork_port_offsets.push(i); // stable logical offset based on loop index
72-82
: Critical issue:offset
can be-1
, leading to invalid port calculationWhen
findIndex
doesn't find the worker ID, it returns-1
, which would result in an invalid port likeENDPOINT_FORK_PORT_BASE - 1
. This case needs proper error handling.Add validation before using the offset:
const offset = fork_port_offsets.findIndex(id => id === worker.id); +if (offset === -1) { + dbg.error(`Failed to map worker ${worker.id} to port offset – skipping fork server setup`); + return; +}
105-118
: Critical issue: Globalfork_server_retries
couples unrelated workersThe shared retry counter means one worker exhausting its retries prevents all other workers from recovering later. Each worker should have independent retry tracking.
Implement per-worker retry tracking:
-//global variable as it is shared between all workers -let fork_server_retries = config.NC_FORK_SERVER_RETRIES; +const worker_retries = new Map(); function create_worker_message_handler(params) { + if (!worker_retries.has(params.worker.id)) { + worker_retries.set(params.worker.id, config.NC_FORK_SERVER_RETRIES); + } let fork_server_timer; - if (is_nc_environment() && fork_server_retries > 0) { + if (is_nc_environment() && worker_retries.get(params.worker.id) > 0) { fork_server_timer = setTimeout(async () => { dbg.error(`Timeout: It took more than ${config.NC_FORK_SERVER_TIMEOUT} minutes to get a ready message from worker ${params.worker.id}, killing the worker`); - fork_server_retries -= 1; + const retries = worker_retries.get(params.worker.id) - 1; + worker_retries.set(params.worker.id, retries); await params.worker.kill(); - if (fork_server_retries <= 0) { + if (retries <= 0) { dbg.error(`ERROR: fork health server failed to start for ${config.NC_FORK_SERVER_RETRIES} attempts stop retrying to reload the worker`); } }, config.NC_FORK_SERVER_TIMEOUT * 60 * 1000);
🧹 Nitpick comments (1)
src/util/fork_utils.js (1)
119-139
: Message handler structure is correct but could benefit from early returnsThe message handler correctly processes different message types, but the nested if statements could be simplified for better readability.
Consider using early returns for cleaner code structure:
return function(msg) { - if (msg.io_stats) { + if (msg.io_stats) { for (const [key, value] of Object.entries(msg.io_stats)) { io_stats[key] += value; } prom_reporting.set_io_stats(io_stats); - } - if (msg.op_stats) { + return; + } + + if (msg.op_stats) { _update_ops_stats(msg.op_stats); prom_reporting.set_ops_stats(op_stats); - } - if (msg.fs_workers_stats) { + return; + } + + if (msg.fs_workers_stats) { _update_fs_stats(msg.fs_workers_stats); prom_reporting.set_fs_worker_stats(fs_workers_stats); - } - if (msg.ready_to_start_fork_server) { + return; + } + + if (msg.ready_to_start_fork_server) { clearTimeout(fork_server_timer); _send_fork_server_message(params); - } + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
config.js
(2 hunks)src/endpoint/endpoint.js
(5 hunks)src/manage_nsfs/health.js
(5 hunks)src/nc/nc_utils.js
(1 hunks)src/test/system_tests/test_utils.js
(20 hunks)src/test/unit_tests/nc_coretest.js
(3 hunks)src/test/unit_tests/test_nc_health.js
(23 hunks)src/test/unit_tests/test_nc_with_a_couple_of_forks.js
(2 hunks)src/util/fork_utils.js
(5 hunks)src/util/http_utils.js
(3 hunks)src/util/ssl_utils.js
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- src/test/unit_tests/test_nc_health.js
- src/nc/nc_utils.js
- src/test/unit_tests/nc_coretest.js
- config.js
- src/util/ssl_utils.js
- src/test/unit_tests/test_nc_with_a_couple_of_forks.js
- src/util/http_utils.js
- src/endpoint/endpoint.js
- src/test/system_tests/test_utils.js
- src/manage_nsfs/health.js
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: run-unit-tests
- GitHub Check: run-package-lock-validation
- GitHub Check: run-unit-tests-postgres
- GitHub Check: nsfs-ceph-s3-tests
- GitHub Check: warp-tests
- GitHub Check: warp-nc-tests
- GitHub Check: ceph-s3-tests
- GitHub Check: run-nc-unit-tests
- GitHub Check: run-jest-unit-tests
- GitHub Check: run-sanity-tests
- GitHub Check: run-sanity-ssl-tests
if (id) { | ||
cluster.workers[id].on('message', nsfs_io_stats_handler); | ||
const offset = fork_port_offsets.findIndex(worker_id => worker_id === cluster.workers[id].id); | ||
const listener = create_worker_message_handler({ | ||
worker: cluster.workers[id], | ||
offset: offset, | ||
nsfs_config_root: nsfs_config_root | ||
}); | ||
cluster.workers[id].on('message', listener); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential duplicate offset assignment issue
The existing workers setup uses the same findIndex
logic as the restart handler, which could assign the same offset to multiple workers if there are duplicate IDs in fork_port_offsets
.
Ensure unique offset assignment by cleaning up the offset tracking logic when combined with the previous fixes for storing actual offsets instead of worker IDs.
🤖 Prompt for AI Agents
In src/util/fork_utils.js around lines 84 to 93, the offset assignment uses
findIndex on fork_port_offsets with worker IDs, which can cause duplicate
offsets if IDs repeat. To fix this, update the offset tracking logic to store
and assign unique offsets rather than relying on worker IDs. Ensure that when
assigning offsets to workers, you remove or mark used offsets to prevent
duplicates, aligning this with previous fixes that store actual offsets instead
of IDs.
health_port: config.ENDPOINT_FORK_PORT_BASE + offset | ||
}); | ||
} catch (err) { | ||
dbg.warn(`Timeout: It took more than 5 minute to get a message from worker ${worker.id} do not send start server message`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Misleading error message and insufficient error handling
The error message mentions "5 minute" timeout but should use the configurable timeout value. Also, the catch block only logs a warning when worker communication fails - this could leave the fork in an inconsistent state.
Improve error handling and message accuracy:
} catch (err) {
- dbg.warn(`Timeout: It took more than 5 minute to get a message from worker ${worker.id} do not send start server message`);
+ dbg.error(`Failed to send fork server message to worker ${worker.id}: ${err.message}`);
+ // Consider killing the worker if communication fails
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
dbg.warn(`Timeout: It took more than 5 minute to get a message from worker ${worker.id} do not send start server message`); | |
} catch (err) { | |
dbg.error(`Failed to send fork server message to worker ${worker.id}: ${err.message}`); | |
// Consider killing the worker if communication fails | |
} |
🤖 Prompt for AI Agents
In src/util/fork_utils.js at line 155, update the timeout warning message to use
the actual configurable timeout value instead of the hardcoded "5 minute" text
for accuracy. Additionally, enhance the catch block to handle worker
communication failures more robustly by adding proper error handling logic
beyond just logging a warning, such as cleaning up or resetting the fork state
to avoid inconsistencies.
Describe the Problem
since we can't predict node forwarding of connections, we can't make sure health requests gets to all forks. currently a retry mechanism is used. this is unreliable, and might fail in certain cases. also under heavy system loads health requests can timeout. both those cases cause health to detect the system as being down and as a result causing a failover in CES systems. add a dedicated server for each fork for health requests
Explain the Changes
disable_service_validation
. this value is to enable the use for testing both lacal and on CIhow the mechanism for distributing the different ports to each fork works:
Each fork is assigned a unique port based on its index offset from a base port. To allow a new fork to reuse the same port as a dead one, the primary process maintains a list of port offsets. When spawning a fork, the primary first waits for a "ready" message from the fork (indicating it's ready to receive instructions). Only then does the primary send the assigned port. This ensures the fork doesn't miss the port assignment message. The fork then starts its server on the received port. It's crucial that the primary starts listening for the "ready" message immediately after spawning to avoid missing it.
Issues: Fixed #xxx / Gap #xxx
Testing Instructions:
sudo NC_CORETEST=true node ./node_modules/mocha/bin/mocha src/test/unit_tests/test_nc_with_a_couple_of_forks.js
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores