Calculate s3 storage #8789

MichaelBuessemeyer · 2025-07-21T12:21:47Z

URL of deployed dev instance (used for testing):

https://___.webknossos.xyz

Steps to test:

abc

TODOs:

test local file system
test GCS
test S3
Bulk DS requests to complete attachment paths & discuss whether thats the correct direction

Issues:

fixes Calculate dataset storage for datasets on s3 #8414

(Please delete unneeded items, merge only when none are left open)

Added changelog entry (create a $PR_NUMBER.md file in unreleased_changes or use ./tools/create-changelog-entry.py)
Added migration guide entry if applicable (edit the same file as for the changelog)
Updated documentation if applicable
Adapted wk-libs python client if relevant API parts change
Removed dev-only changes like prints and application.conf edits
Considered common edge cases
Needs datastore update after deployment

…etid routes

coderabbitai · 2025-07-21T12:21:54Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch calculate-s3-storage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

…culate-s3-storage

MichaelBuessemeyer

During testing I noticed that measuring the space used by a GCS stored dataset takes a lot of time due to its sheer size. Imo we should discuss how to handle this.

MichaelBuessemeyer · 2025-08-21T09:23:37Z

app/models/organization/Organization.scala

+    // Left for mags, right for attachments
+    _artifactId: Either[ObjectId, ObjectId],


This was a design decision I initially made. It is also possible to split this into two options. What do you prefer?
I mean

Suggested change

// Left for mags, right for attachments

_artifactId: Either[ObjectId, ObjectId],

_magId: Option[ObjectId],

_attachmentId: Option[ObjectId],

as an alternative

MichaelBuessemeyer · 2025-08-21T09:28:02Z

conf/application.conf

-  fetchUsedStorage {
-    rescanInterval = 24 hours # do not scan organizations whose last scan is more recent than this
-    tickerInterval = 10 minutes # scan some organizations at each tick
+  fetchUsedStorage { # TODOM: undo these changes


MichaelBuessemeyer · 2025-08-21T09:29:29Z

tools/postgres/schema.sql

+-- ObjectId generation function taken and modified from https://thinhdanggroup.github.io/mongo-id-in-postgresql/
+CREATE SEQUENCE webknossos.objectid_sequence;
+
+CREATE FUNCTION webknossos.generate_object_id() RETURNS TEXT AS $$
+DECLARE
+  time_component TEXT;
+  machine_id TEXT;
+  process_id TEXT;
+  counter TEXT;
+  result TEXT;
+BEGIN
+  -- Extract the current timestamp in seconds since the Unix epoch (4 bytes, 8 hex chars)
+  SELECT LPAD(TO_HEX(FLOOR(EXTRACT(EPOCH FROM clock_timestamp()))::BIGINT), 8, '0') INTO time_component;
+  -- Generate a machine identifier using the hash of the server IP (3 bytes, 6 hex chars)
+  SELECT SUBSTRING(md5(CAST(inet_server_addr() AS TEXT)) FROM 1 FOR 6) INTO machine_id;
+  -- Retrieve the current backend process ID, limited to 2 bytes (4 hex chars)
+  SELECT LPAD(TO_HEX(pg_backend_pid() % 65536), 4, '0') INTO process_id;
+  -- Generate a counter using a sequence, ensuring it's 3 bytes (6 hex chars)
+  SELECT LPAD(TO_HEX(nextval('webknossos.objectid_sequence')::BIGINT % 16777216), 6, '0') INTO counter;
+  -- Concatenate all parts to form a 24-character ObjectId
+  result := time_component || machine_id || process_id || counter;
+
+  RETURN result;
+END;
+$$ LANGUAGE plpgsql;
+


Moved to have the type available earlier in the file / schema to use it in e.g. organization_usedStorage

MichaelBuessemeyer · 2025-08-21T11:27:29Z

app/controllers/InitialDataController.scala

+    DataStore(conf.Datastore.name,
+              conf.Http.uri,
+              conf.Datastore.publicUri.getOrElse(conf.Http.uri),
+              conf.Datastore.key,
+              reportUsedStorageEnabled = true) // TODOM: Undo this


…culate-s3-storage

MichaelBuessemeyer

Next Major TODO: Split used storage table into two:
One for mags and one for attachments. Ids for mags & attachments is not ideal as they are regularly recreated (with new ids; introduced in this PR)

MichaelBuessemeyer · 2025-08-25T12:11:51Z

app/models/organization/Organization.scala

+      r._artifactId match {
+        case Left(magId) =>
+          q"""
+          INSERT INTO webknossos.organization_usedStorage (
+            _organization, _dataset, _dataset_mag, _layer_attachment, path, usedStorageBytes, lastUpdated
+          )
+          VALUES (${organizationId}, ${r._datasetId}, ${magId}, NULL, ${r.path}, ${r.usedStorageBytes}, NOW())
+          ON CONFLICT ON CONSTRAINT unique_dataset_mag
+          DO UPDATE SET
+            path = EXCLUDED.path,
+            usedStorageBytes = EXCLUDED.usedStorageBytes,
+            lastUpdated = EXCLUDED.lastUpdated;
+          """.asUpdate
+        case Right(attachmentId) =>
+          q"""
+          INSERT INTO webknossos.organization_usedStorage (
+            _organization, _dataset, _dataset_mag, _layer_attachment, path, usedStorageBytes, lastUpdated
+          ) -- TODO: test why no s3 test dataset is included
+          VALUES (${organizationId}, ${r._datasetId}, NULL, ${attachmentId}, ${r.path}, ${r.usedStorageBytes}, NOW())
+          ON CONFLICT ON CONSTRAINT unique_layer_attachment
+          DO UPDATE SET
+            path = EXCLUDED.path,
+            usedStorageBytes = EXCLUDED.usedStorageBytes,
+            lastUpdated = EXCLUDED.lastUpdated;
+          """.asUpdate
+      }
+    }


TODO @me Make DRYer

frcroth and others added 28 commits June 25, 2025 09:54

Explore remote datasets as virtual datasets

9f89d38

Do not have virtual remote datasets deleted

42101a9

Put mag in db

18dfe98

Add temporary front end for testing virtual datasets

9c3cf74

Use mags for WKW datasets

391227a

Merge branch 'master' into explore-virtual-datasets

916542d

Move zarr streaming stuff to service, todo: add controller with datas…

3b3b13c

…etid routes

Move old zarr routes to LegacyController, update zarr routes to use id

3f81a85

Use datasetId in BinaryDataController

ac0f66d

Agglomerate files by dataset id

d51dea9

Merge branch 'master' into explore-virtual-datasets

371f3fb

Update more routes to use dataset id

611e552

Disable deletion route on virtual datasets for now

a4aaff4

Merge branch 'master' into explore-virtual-datasets

677c8fe

Use datasetId for connectome routes

5b220ac

Move compose to webknossos

0fc1834

Merge branch 'master' into explore-virtual-datasets

6e27ba5

Fix WKW dataset mags being lost in parsing

b1797fc

Merge branch 'master' into explore-virtual-datasets

72de557

Adapt RemoteFallbackLayer to use datasetIds

ffdb99f

Add 'isVirtual' column to datasets

f4c2c0c

Remove usages of datasource id in rest api

f4ec53f

WIP add getting used storage bytes for remote datasets

6654d1e

WIP more adding getting used storage bytes for remote datasets

d9fb2f3

WIP measure storage

bed0122

fix usedstorage schema and path mapping

092da3a

Add filtering for remote mag which are part of registered datavaults

96e1451

update used storage dataset wise

72805db

MichaelBuessemeyer self-assigned this Jul 21, 2025

MichaelBuessemeyer added backend new feature labels Jul 21, 2025

MichaelBuessemeyer added 14 commits July 21, 2025 14:23

Merge branch 'master' of github.com:scalableminds/webknossos into cal…

e5c5c84

…culate-s3-storage

Merge branch 'master' of github.com:scalableminds/webknossos into cal…

179c4c9

…culate-s3-storage

WIP fix errors caused by merge

77f1677

Clean up messed up merge

db75710

Merge branch 'master' of github.com:scalableminds/webknossos into cal…

81d7b97

…culate-s3-storage

fix storage scanning for attachments with relative paths

203d257

format backend

805ddd1

try fix schema & evolution & reversion

186bdcc

fix schema

1c31c60

add changelog entry

19d05b7

fix some queries and e2e db data csv

9e3b7de

remove unused file accidantilly kept during merge

3b637ad

remove outdated outcommented code from old storage measuring mechanism

2ee0ad2

remove outdated comment

dd04d37

MichaelBuessemeyer commented Aug 21, 2025

View reviewed changes

MichaelBuessemeyer added 4 commits August 21, 2025 15:31

normalize manually resolved attachment paths

5357917

remove outdated comment

9a435f3

remove outdated comment

ae59726

Merge branch 'master' of github.com:scalableminds/webknossos into cal…

8d839fc

…culate-s3-storage

MichaelBuessemeyer commented Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Calculate s3 storage #8789

Calculate s3 storage #8789

Uh oh!

MichaelBuessemeyer commented Jul 21, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Jul 21, 2025 •

edited

Loading

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

MichaelBuessemeyer left a comment

Uh oh!

MichaelBuessemeyer Aug 21, 2025

Uh oh!

MichaelBuessemeyer Aug 21, 2025

Uh oh!

MichaelBuessemeyer Aug 21, 2025

Uh oh!

MichaelBuessemeyer Aug 21, 2025

Uh oh!

MichaelBuessemeyer left a comment

Uh oh!

MichaelBuessemeyer Aug 25, 2025

Uh oh!

Uh oh!

		// Left for mags, right for attachments
		_artifactId: Either[ObjectId, ObjectId],

Calculate s3 storage #8789

Are you sure you want to change the base?

Calculate s3 storage #8789

Uh oh!

Conversation

MichaelBuessemeyer commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

URL of deployed dev instance (used for testing):

Steps to test:

TODOs:

Issues:

Uh oh!

coderabbitai bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

Uh oh!

MichaelBuessemeyer Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelBuessemeyer Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelBuessemeyer Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelBuessemeyer Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelBuessemeyer left a comment

Choose a reason for hiding this comment

Uh oh!

MichaelBuessemeyer Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MichaelBuessemeyer commented Jul 21, 2025 •

edited

Loading

coderabbitai bot commented Jul 21, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)