Skip to content

Conversation

MichaelBuessemeyer
Copy link
Contributor

@MichaelBuessemeyer MichaelBuessemeyer commented Jul 21, 2025

URL of deployed dev instance (used for testing):

  • https://___.webknossos.xyz

Steps to test:

  • abc

TODOs:

  • test local file system
  • test GCS
  • test S3
  • Bulk DS requests to complete attachment paths & discuss whether thats the correct direction

Issues:


(Please delete unneeded items, merge only when none are left open)

  • Added changelog entry (create a $PR_NUMBER.md file in unreleased_changes or use ./tools/create-changelog-entry.py)
  • Added migration guide entry if applicable (edit the same file as for the changelog)
  • Updated documentation if applicable
  • Adapted wk-libs python client if relevant API parts change
  • Removed dev-only changes like prints and application.conf edits
  • Considered common edge cases
  • Needs datastore update after deployment

frcroth and others added 28 commits June 25, 2025 09:54
Copy link
Contributor

coderabbitai bot commented Jul 21, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch calculate-s3-storage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@MichaelBuessemeyer MichaelBuessemeyer self-assigned this Jul 21, 2025
Copy link
Contributor Author

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During testing I noticed that measuring the space used by a GCS stored dataset takes a lot of time due to its sheer size. Imo we should discuss how to handle this.

Comment on lines +38 to +39
// Left for mags, right for attachments
_artifactId: Either[ObjectId, ObjectId],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a design decision I initially made. It is also possible to split this into two options. What do you prefer?
I mean

Suggested change
// Left for mags, right for attachments
_artifactId: Either[ObjectId, ObjectId],
_magId: Option[ObjectId],
_attachmentId: Option[ObjectId],

as an alternative

fetchUsedStorage {
rescanInterval = 24 hours # do not scan organizations whose last scan is more recent than this
tickerInterval = 10 minutes # scan some organizations at each tick
fetchUsedStorage { # TODOM: undo these changes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Comment on lines +27 to +52
-- ObjectId generation function taken and modified from https://thinhdanggroup.github.io/mongo-id-in-postgresql/
CREATE SEQUENCE webknossos.objectid_sequence;

CREATE FUNCTION webknossos.generate_object_id() RETURNS TEXT AS $$
DECLARE
time_component TEXT;
machine_id TEXT;
process_id TEXT;
counter TEXT;
result TEXT;
BEGIN
-- Extract the current timestamp in seconds since the Unix epoch (4 bytes, 8 hex chars)
SELECT LPAD(TO_HEX(FLOOR(EXTRACT(EPOCH FROM clock_timestamp()))::BIGINT), 8, '0') INTO time_component;
-- Generate a machine identifier using the hash of the server IP (3 bytes, 6 hex chars)
SELECT SUBSTRING(md5(CAST(inet_server_addr() AS TEXT)) FROM 1 FOR 6) INTO machine_id;
-- Retrieve the current backend process ID, limited to 2 bytes (4 hex chars)
SELECT LPAD(TO_HEX(pg_backend_pid() % 65536), 4, '0') INTO process_id;
-- Generate a counter using a sequence, ensuring it's 3 bytes (6 hex chars)
SELECT LPAD(TO_HEX(nextval('webknossos.objectid_sequence')::BIGINT % 16777216), 6, '0') INTO counter;
-- Concatenate all parts to form a 24-character ObjectId
result := time_component || machine_id || process_id || counter;

RETURN result;
END;
$$ LANGUAGE plpgsql;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to have the type available earlier in the file / schema to use it in e.g. organization_usedStorage

Comment on lines +146 to +150
DataStore(conf.Datastore.name,
conf.Http.uri,
conf.Datastore.publicUri.getOrElse(conf.Http.uri),
conf.Datastore.key,
reportUsedStorageEnabled = true) // TODOM: Undo this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Copy link
Contributor Author

@MichaelBuessemeyer MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next Major TODO: Split used storage table into two:
One for mags and one for attachments. Ids for mags & attachments is not ideal as they are regularly recreated (with new ids; introduced in this PR)

Comment on lines 186 to 212
r._artifactId match {
case Left(magId) =>
q"""
INSERT INTO webknossos.organization_usedStorage (
_organization, _dataset, _dataset_mag, _layer_attachment, path, usedStorageBytes, lastUpdated
)
VALUES (${organizationId}, ${r._datasetId}, ${magId}, NULL, ${r.path}, ${r.usedStorageBytes}, NOW())
ON CONFLICT ON CONSTRAINT unique_dataset_mag
DO UPDATE SET
path = EXCLUDED.path,
usedStorageBytes = EXCLUDED.usedStorageBytes,
lastUpdated = EXCLUDED.lastUpdated;
""".asUpdate
case Right(attachmentId) =>
q"""
INSERT INTO webknossos.organization_usedStorage (
_organization, _dataset, _dataset_mag, _layer_attachment, path, usedStorageBytes, lastUpdated
) -- TODO: test why no s3 test dataset is included
VALUES (${organizationId}, ${r._datasetId}, NULL, ${attachmentId}, ${r.path}, ${r.usedStorageBytes}, NOW())
ON CONFLICT ON CONSTRAINT unique_layer_attachment
DO UPDATE SET
path = EXCLUDED.path,
usedStorageBytes = EXCLUDED.usedStorageBytes,
lastUpdated = EXCLUDED.lastUpdated;
""".asUpdate
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO @me Make DRYer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calculate dataset storage for datasets on s3
2 participants