Skip to content

system store - reduce load time (part of performance effort 4.20) #9096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

alphaprinz
Copy link
Contributor

@alphaprinz alphaprinz commented Jun 12, 2025

Describe the Problem

Reduce system_store.load() cpu load.

Explain the Changes

  1. REMOVED - Introduce "fresh enough" - if load was performed recently, don't load again.
  2. Perform db queries concurrently (instead of sequentially).
  3. REMOVED - Reduce sempahore's block
  4. Remove publish changes from hb and md_aggregator.

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

  • Doc added/updated
  • Tests added

Summary by CodeRabbit

Summary by CodeRabbit

Here are the updated release notes:

  • New Features
    • Added a new configuration option to control the concurrency level when reading data from the database.
  • Improvements
    • Optimized the data loading process by limiting the number of parallel database collection reads.
    • Implemented a periodic refresh mechanism to ensure the system store data is up-to-date.
    • Enhanced control over system updates by allowing selective suppression of cluster-wide notifications.

Copy link

coderabbitai bot commented Jun 12, 2025

Walkthrough

The changes in this pull request introduce a new configuration constant SYSTEM_STORE_LOAD_CONCURRENCY to limit concurrency during database reads in the SystemStore class. The refresh() method is modified to schedule periodic refreshes. Calls to make_changes() in background services now include an optional parameter to suppress cluster-wide notifications.

Changes

Files/Groups Change Summary
config.js Added SYSTEM_STORE_LOAD_CONCURRENCY constant initialized from environment variable with default 5.
src/server/system_services/system_store.js Added SYSTEM_STORE_LOAD_CONCURRENCY instance property; limited DB read concurrency in _read_new_data_from_db; modified refresh() to schedule periodic calls; added optional publish parameter to make_changes().
src/server/bg_services/cluster_hb.js and src/server/bg_services/md_aggregator.js Updated calls to system_store.make_changes() to pass false as second argument to suppress notifications.

Sequence Diagram(s)

sequenceDiagram
    participant SystemStore
    participant Database

    SystemStore->>Database: Read new data (limit concurrency by SYSTEM_STORE_LOAD_CONCURRENCY)
    Database-->>SystemStore: Return data
    SystemStore->>SystemStore: Schedule periodic refresh calls
Loading

The sequence diagram shows the SystemStore reading data from the database with controlled concurrency and scheduling periodic refreshes to keep data updated.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

npm error Exit handler never called!
npm error This is an error with npm itself. Please report this error at:
npm error https://github.com/npm/cli/issues
npm error A complete log of this run can be found in: /.npm/_logs/2025-06-25T17_21_21_688Z-debug-0.log


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b2fe6d and 36b9892.

📒 Files selected for processing (4)
  • config.js (2 hunks)
  • src/server/bg_services/cluster_hb.js (1 hunks)
  • src/server/bg_services/md_aggregator.js (3 hunks)
  • src/server/system_services/system_store.js (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/server/bg_services/cluster_hb.js
  • config.js
  • src/server/bg_services/md_aggregator.js
  • src/server/system_services/system_store.js
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Build Noobaa Image
  • GitHub Check: run-jest-unit-tests
  • GitHub Check: run-package-lock-validation
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/server/system_services/system_store.js (1)

439-444: Minor: expensive debug payload serialisation

JSON.stringify(new_data) and util.inspect on large datasets allocate tens of MB and block the event-loop even when log level is low.
Consider wrapping both calls with the same dbg.should_log(1) guard to avoid the work entirely.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c247ca1 and 6d4e3a1.

📒 Files selected for processing (2)
  • config.js (1 hunks)
  • src/server/system_services/system_store.js (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build-noobaa-image
  • GitHub Check: run-jest-unit-tests
  • GitHub Check: run-package-lock-validation
🔇 Additional comments (3)
src/server/system_services/system_store.js (3)

354-356: Constructor: keep the config values but fail fast on nonsense input

config.LOAD_FRESH_THRESHOLD and config.READ_NEW_DATA_CONCURRENCY are already integers.
Still, enforce basic sanity ( > 0 ) to avoid later surprises.

-this.LOAD_FRESH_THRESHOLD = config.LOAD_FRESH_THRESHOLD || 100;
-this.READ_NEW_DATA_CONCURRENCY = config.READ_NEW_DATA_CONCURRENCY || 5;
+this.LOAD_FRESH_THRESHOLD = Math.max(1, config.LOAD_FRESH_THRESHOLD);
+this.READ_NEW_DATA_CONCURRENCY = Math.max(1, config.READ_NEW_DATA_CONCURRENCY);

418-419: Good move: instantiate SystemStoreData once per load

Creating new_data outside the semaphore saves redundant allocations in contended scenarios.


520-520:

✅ Verification successful

Verify P.map_with_concurrency exists – otherwise this is a breaking change

P.map_with_concurrency is not part of the standard bluebird or native Promise helpers.
If the local util/promise never exported it, this will throw at runtime.

If not found, either add the helper or fall back to P.map with explicit throttling.


🏁 Script executed:

#!/bin/bash
# Check whether the helper is defined anywhere in the repo
rg --no-heading -n "function .*map_with_concurrency" | head

Length of output: 146


Verified helper implementation: map_with_concurrency is defined in src/util/promise.js (line 52), so this call is safe and does not introduce a breaking change.

config.js Outdated
Comment on lines 253 to 254
config.LOAD_FRESH_THRESHOLD = parseInt(process.env.LOAD_FRESH_THRESHOLD, 10) || 100;
config.READ_NEW_DATA_CONCURRENCY = parseInt(process.env.READ_NEW_DATA_CONCURRENCY, 10) || 5;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Guard against non-positive or NaN values when parsing the new tunables

parseInt() returns NaN for empty/non-numeric input, and users might also pass 0 or negative numbers by mistake.
Relying on the || fallback keeps the default safe, but if a caller explicitly sets READ_NEW_DATA_CONCURRENCY=0 or <0, the code will silently fall back to 5, hiding the mis-configuration. Make the validation explicit and fail early.

-config.LOAD_FRESH_THRESHOLD = parseInt(process.env.LOAD_FRESH_THRESHOLD, 10) || 100;
-config.READ_NEW_DATA_CONCURRENCY = parseInt(process.env.READ_NEW_DATA_CONCURRENCY, 10) || 5;
+const _freshRaw = parseInt(process.env.LOAD_FRESH_THRESHOLD, 10);
+config.LOAD_FRESH_THRESHOLD = Number.isFinite(_freshRaw) && _freshRaw > 0 ? _freshRaw : 100;
+
+const _concurrencyRaw = parseInt(process.env.READ_NEW_DATA_CONCURRENCY, 10);
+config.READ_NEW_DATA_CONCURRENCY =
+    Number.isFinite(_concurrencyRaw) && _concurrencyRaw > 0 ? _concurrencyRaw : 5;

This makes wrong inputs visible instead of being silently ignored.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
config.LOAD_FRESH_THRESHOLD = parseInt(process.env.LOAD_FRESH_THRESHOLD, 10) || 100;
config.READ_NEW_DATA_CONCURRENCY = parseInt(process.env.READ_NEW_DATA_CONCURRENCY, 10) || 5;
const _freshRaw = parseInt(process.env.LOAD_FRESH_THRESHOLD, 10);
config.LOAD_FRESH_THRESHOLD = Number.isFinite(_freshRaw) && _freshRaw > 0 ? _freshRaw : 100;
const _concurrencyRaw = parseInt(process.env.READ_NEW_DATA_CONCURRENCY, 10);
config.READ_NEW_DATA_CONCURRENCY =
Number.isFinite(_concurrencyRaw) && _concurrencyRaw > 0 ? _concurrencyRaw : 5;
🤖 Prompt for AI Agents
In config.js around lines 253 to 255, the code uses parseInt with a fallback
that silently ignores invalid, zero, or negative values for LOAD_FRESH_THRESHOLD
and READ_NEW_DATA_CONCURRENCY. To fix this, add explicit validation after
parsing to check if the values are positive integers; if not, throw an error or
exit early to fail fast and make misconfigurations visible instead of defaulting
silently.

@alphaprinz alphaprinz force-pushed the system_store_load branch 3 times, most recently from 38341fc to 50e4d2c Compare June 19, 2025 05:05
@pull-request-size pull-request-size bot added size/S and removed size/M labels Jun 19, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/server/system_services/system_store.js (1)

415-426: Address the past review concern about fresh-skip logic.

The past review comment identified a critical issue where the fresh-skip logic ignores the since parameter, potentially dropping updates when a caller requests data with an older timestamp but the store is considered "fresh".

The concern raised in the past review about the fresh-skip logic ignoring the since hint is still valid and should be addressed. The current implementation could skip necessary reloads when since < last_update_time but the data is considered fresh according to the threshold.

🧹 Nitpick comments (1)
src/server/system_services/system_store.js (1)

363-364: Review the periodic refresh strategy for potential interference.

The periodic refresh using setInterval ensures data freshness, but consider potential issues:

  1. The refresh runs every 10 minutes regardless of system activity
  2. It might interfere with explicit load operations or cause unnecessary database load during low-activity periods
  3. The arrow function binding might not preserve the correct this context

Consider using a more sophisticated refresh strategy:

-        setInterval(this.refresh, this.START_REFRESH_THRESHOLD);
+        setInterval(() => this.refresh(), this.START_REFRESH_THRESHOLD);

And potentially add activity-based refresh control to avoid unnecessary refreshes during idle periods.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50e4d2c and 5c2e8fb.

📒 Files selected for processing (4)
  • config.js (1 hunks)
  • src/server/bg_services/cluster_hb.js (1 hunks)
  • src/server/bg_services/md_aggregator.js (3 hunks)
  • src/server/system_services/system_store.js (6 hunks)
✅ Files skipped from review due to trivial changes (1)
  • config.js
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: run-jest-unit-tests
  • GitHub Check: Build Noobaa Image
  • GitHub Check: run-package-lock-validation
  • GitHub Check: Summary
🔇 Additional comments (9)
src/server/bg_services/cluster_hb.js (1)

96-96: LGTM - Performance optimization for heartbeat updates.

Disabling notifications for heartbeat updates reduces unnecessary cluster propagation overhead. This is appropriate since heartbeat data is typically consumed locally and doesn't require immediate cluster-wide synchronization.

src/server/bg_services/md_aggregator.js (3)

71-71: LGTM - Performance optimization for storage stats updates.

Disabling notifications for incremental storage statistics updates is appropriate since these are internal maintenance operations that don't require immediate cluster-wide propagation.


81-81: LGTM - Consistent with batch processing approach.

Disabling notifications for the global last update timestamp aligns with the incremental processing strategy where notifications are handled separately from the update batching.


209-209: Verify notification strategy for time skew reset operations.

While disabling notifications here aligns with the performance optimization, the time skew reset operation resets all storage statistics to NOOBAA_EPOCH. This is a significant system state change that might benefit from immediate cluster propagation to ensure consistency.

Consider whether this critical recovery operation should retain notifications or if the delayed propagation is acceptable for your use case.

src/server/system_services/system_store.js (5)

354-354: LGTM - Configurable concurrency control for database load balancing.

Adding configurable concurrency control helps prevent database overload during system store loading. The default value of 5 seems reasonable for most deployments.


515-515: LGTM - Concurrency control for database collection reads.

Replacing P.map with P.map_with_concurrency using the configured limit prevents overwhelming the database during incremental data loading. This addresses the performance objective of optimizing concurrent database queries.


607-607: LGTM - Flexible notification control for performance optimization.

Adding the publish parameter with a default value of true maintains backward compatibility while allowing callers to disable notifications for performance-sensitive operations. This supports the performance optimization goals.


617-617: LGTM - Conditional notification publishing implementation.

The conditional check properly implements the notification control based on the publish parameter, maintaining the existing behavior when notifications are enabled.


433-439: Enhanced logging provides valuable performance insights.

The detailed logging with conditional debug level checks helps monitor system store performance without impacting production performance when debugging is disabled.

@alphaprinz alphaprinz marked this pull request as draft June 24, 2025 22:28
@alphaprinz alphaprinz marked this pull request as ready for review June 25, 2025 16:43
Signed-off-by: Amit Prinz Setter <alphaprinz@gmail.com>
}
//call refresh periodically
P.delay_unblocking(this.START_REFRESH_THRESHOLD).then(this.refresh);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can cause a lot of refresh() calls to be queued. I believe that refresh() is called on every RPC request as a middleware (see here), so I don't think we should schedule another refresh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants