-
Notifications
You must be signed in to change notification settings - Fork 29
Calculate s3 storage #8789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Calculate s3 storage #8789
Conversation
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
…culate-s3-storage
…culate-s3-storage
…culate-s3-storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During testing I noticed that measuring the space used by a GCS stored dataset takes a lot of time due to its sheer size. Imo we should discuss how to handle this.
// Left for mags, right for attachments | ||
_artifactId: Either[ObjectId, ObjectId], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a design decision I initially made. It is also possible to split this into two options. What do you prefer?
I mean
// Left for mags, right for attachments | |
_artifactId: Either[ObjectId, ObjectId], | |
_magId: Option[ObjectId], | |
_attachmentId: Option[ObjectId], |
as an alternative
fetchUsedStorage { | ||
rescanInterval = 24 hours # do not scan organizations whose last scan is more recent than this | ||
tickerInterval = 10 minutes # scan some organizations at each tick | ||
fetchUsedStorage { # TODOM: undo these changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
-- ObjectId generation function taken and modified from https://thinhdanggroup.github.io/mongo-id-in-postgresql/ | ||
CREATE SEQUENCE webknossos.objectid_sequence; | ||
|
||
CREATE FUNCTION webknossos.generate_object_id() RETURNS TEXT AS $$ | ||
DECLARE | ||
time_component TEXT; | ||
machine_id TEXT; | ||
process_id TEXT; | ||
counter TEXT; | ||
result TEXT; | ||
BEGIN | ||
-- Extract the current timestamp in seconds since the Unix epoch (4 bytes, 8 hex chars) | ||
SELECT LPAD(TO_HEX(FLOOR(EXTRACT(EPOCH FROM clock_timestamp()))::BIGINT), 8, '0') INTO time_component; | ||
-- Generate a machine identifier using the hash of the server IP (3 bytes, 6 hex chars) | ||
SELECT SUBSTRING(md5(CAST(inet_server_addr() AS TEXT)) FROM 1 FOR 6) INTO machine_id; | ||
-- Retrieve the current backend process ID, limited to 2 bytes (4 hex chars) | ||
SELECT LPAD(TO_HEX(pg_backend_pid() % 65536), 4, '0') INTO process_id; | ||
-- Generate a counter using a sequence, ensuring it's 3 bytes (6 hex chars) | ||
SELECT LPAD(TO_HEX(nextval('webknossos.objectid_sequence')::BIGINT % 16777216), 6, '0') INTO counter; | ||
-- Concatenate all parts to form a 24-character ObjectId | ||
result := time_component || machine_id || process_id || counter; | ||
|
||
RETURN result; | ||
END; | ||
$$ LANGUAGE plpgsql; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to have the type available earlier in the file / schema to use it in e.g. organization_usedStorage
DataStore(conf.Datastore.name, | ||
conf.Http.uri, | ||
conf.Datastore.publicUri.getOrElse(conf.Http.uri), | ||
conf.Datastore.key, | ||
reportUsedStorageEnabled = true) // TODOM: Undo this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next Major TODO: Split used storage table into two:
One for mags and one for attachments. Ids for mags & attachments is not ideal as they are regularly recreated (with new ids; introduced in this PR)
r._artifactId match { | ||
case Left(magId) => | ||
q""" | ||
INSERT INTO webknossos.organization_usedStorage ( | ||
_organization, _dataset, _dataset_mag, _layer_attachment, path, usedStorageBytes, lastUpdated | ||
) | ||
VALUES (${organizationId}, ${r._datasetId}, ${magId}, NULL, ${r.path}, ${r.usedStorageBytes}, NOW()) | ||
ON CONFLICT ON CONSTRAINT unique_dataset_mag | ||
DO UPDATE SET | ||
path = EXCLUDED.path, | ||
usedStorageBytes = EXCLUDED.usedStorageBytes, | ||
lastUpdated = EXCLUDED.lastUpdated; | ||
""".asUpdate | ||
case Right(attachmentId) => | ||
q""" | ||
INSERT INTO webknossos.organization_usedStorage ( | ||
_organization, _dataset, _dataset_mag, _layer_attachment, path, usedStorageBytes, lastUpdated | ||
) -- TODO: test why no s3 test dataset is included | ||
VALUES (${organizationId}, ${r._datasetId}, NULL, ${attachmentId}, ${r.path}, ${r.usedStorageBytes}, NOW()) | ||
ON CONFLICT ON CONSTRAINT unique_layer_attachment | ||
DO UPDATE SET | ||
path = EXCLUDED.path, | ||
usedStorageBytes = EXCLUDED.usedStorageBytes, | ||
lastUpdated = EXCLUDED.lastUpdated; | ||
""".asUpdate | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO @me Make DRYer
URL of deployed dev instance (used for testing):
Steps to test:
TODOs:
Issues:
(Please delete unneeded items, merge only when none are left open)
$PR_NUMBER.md
file inunreleased_changes
or use./tools/create-changelog-entry.py
)