Add Google Drive and AWS S3 as a Remote Storage option by cs7-shrey · Pull Request #503 · neuroinformatics-unit/datashuttle

cs7-shrey · 2025-03-27T20:23:13Z

Description

This pull requests is a first version of Google Drive and AWS S3 integration in the datashuttle API and TUI.

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?

What does this PR do?
This PR is to discuss the approach for implementing Google Drive and AWS S3 as a remote storage option.

References

Issue #407

How has this PR been tested?

This PR was manually tested in the Datashuttle's Python API and TUI. Comprehensive tests are yet to be written.

Is this a breaking change?

No

Does this PR require an update to the documentation?

Yes

Checklist:

The code has been tested locally
Tests have been added to cover all new functionality
The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

cs7-shrey · 2025-04-02T12:48:52Z

Hey @adamltyson @JoeZiminski, I know reviewing this PR wouldn't be feasible at the moment. But it would be great if you can try it out locally whenever you have time and let me know if it works correctly / any suggestions you'd have. No hurry at all, I understand if it takes a bit longer.

JoeZiminski

Hey @cs7-shrey thanks a lot for this, this is excellent. It works really well and the code is structured neatly and clearly. Please see comments for some suggestions / questions / general comments.

The only thought I had on the user-end was whether we can let them provide a specific google-drive folder (rather than giving access to all of their drive folders). This might be a little more secure and avoid us accidentally interfering with other folders on their drive. The user could pass the google drive folder ID and this could be used for connection. We will have to make it clear that central_path now means relative to the root of the folder. It would also be nice to do this for AWS, but I think it is less simple and not in keeping with the way that AWS buckets works (but if it is possible, that would be great).

Please let me know if you have any questions!

datashuttle/configs/canonical_configs.py

datashuttle/configs/regions.py

datashuttle/utils/gdrive.py

JoeZiminski · 2025-05-23T19:02:01Z

datashuttle/utils/gdrive.py

+        pipe_std=True,
+    )
+
+    # TODO: make this more robust


Actually, I put this TODO so as to be absolutely sure that rclone would behave in the same sequence of steps as we expect it to as I hackishly put together some rclone commands before. But with more and more testing and reading rclone's docmentation, I think it will most likely behave as expected.

So, the only concern here is error handling, for the json loading, it will raise any exception itself if the json doesn't have the structure we expect. To be sure that the rclone's output message is what we suppose it should be, shall we check something like
if <some_string> not in message: raise Exception ?

what do you suggest?

Hey @cs7-shrey okay great, I think when we do the full internal / integration tests we can test this case in full on an HPC system. Then any unexpected issues with how rclone is working will be caught, so that's good we have this note here to remind us when writing these tests. We can also add a direct test on the rclone output in the test suite rather than an assert, then any issues will be flagged directly on the weekly tests.

datashuttle/utils/gdrive.py

JoeZiminski · 2025-06-03T17:20:48Z

Hey Shrey thanks for this, will respond today, just a quick note before I forget, we should change the order of the radiobuttons that allow selection of the connection type, so 'Local only' is at the left not the right

…ile setting up connection

for more information, see https://pre-commit.ci

…datashuttle into add_gdrive_aws_remote

JoeZiminski

Hey @cs7-shrey this looks great, I just have a comment on the runtime type unpacking from str which unfortunately does't work for python < 3.11. Otherwise I think this is good to go, we could squash-merge into a aws_gdrive_dev branch and then make a new branch for tests?

I made a couple of changes, let me know what you think. The central_path had to be provided for aws and gdrive due to how it used to be set up, but I don't think this made sense in case the folder or bucket is the project folder. I set it so it can be None for these connection methods, and the TUI now reflects this. I changed the order of the boxes so the optional ones are together. Because of these changes, it led to an opportunity to refactor further the ConfigsContent.

Once change made is in the most recent commits is to not reload the widgets from configs when the connection method radiobutton is changed. Now, the content of the widgets is held for the lifetime of the tab session (e.g. a user switching onto, then off, the tab). The reason for this is that it was reloading Local Path from configs when switching the connection method, which was slightly confusing. A downside of this is that the Central Path content is carried over when switching connection method, which is slightly confusing too. However, on balance I think it is less confusing than the local path switching. Let me know what you think!

datashuttle/configs/aws_regions.py

datashuttle/configs/canonical_configs.py

… popen

cs7-shrey · 2025-07-05T12:30:26Z

Hey @JoeZiminski I introduced call_rclone_with_popen and made related refactors in the google drive setup. Also fixed syntax issues that were causing the tests to fail. Please take and look and let me know what you think. If this looks good, I'll go ahead and create another pull request to a new dev branch.

JoeZiminski

Hey @cs7-shrey this looks great, some extremely minor suggestions and general discussion about thread / processes. Feel free to move onto the next stage after the minor suggestions are addressed, cheers for this!

datashuttle/tui/interface.py

JoeZiminski · 2025-07-10T11:18:45Z

datashuttle/tui/interface.py

+
+        process = self.google_drive_rclone_setup_process
+
+        if process.poll() is not None:


I think poll() returns None if the process is still running, in this case we want to check poll() is None? (i.e. the process is still running so we can kill it?`

Yes, just to be clear to kill the process if it's running.

okay great, unless I am mistaken I think poll() counter-intuitively will return None when the process is running, so the line should be:

if process.poll() is None: self.gdrive_setup_process_killed = True process.kill()

oh yes, you're right. minor oversight by me.

datashuttle/tui/screens/setup_gdrive.py

datashuttle/tui/interface.py

JoeZiminski · 2025-07-10T12:06:57Z

datashuttle/tui/screens/setup_gdrive.py

+        gdrive_client_secret: Optional[str] = None,
+        service_account_filepath: Optional[str] = None,
+    ) -> None:
+        """Start the Google Drive connection setup in a separate thread.


Because the set up is quite cryptic (e.g. a thread -> a process which is then cancelled from here) I think it is worth documenting in detail the reasons for it and logic, maybe here and then more briefly in interface.

If I understand correctly at present the logic is:

UI thread spawns a worker in a new thread

this worker thread spawns a new process that calls rclone to check the connection, and stores the process on Interface

The POpen process python returns is just an python-side object that represents the OS process. So I don't think it matters that this is passed across thread boundaries (which share memory). In the end, this process object is simply used to close the OS process because the owning thread does not. If the process is closed before the worker thread, you have the protection flag in the interface.

I think this is all good but we can test thoroughly in the next stage cross-platform. This is really nice and powerful, and its a great template for the future if we run into any similar issues that require process managment. On the flip side, working across threads / processes can be buggy and adds complexity to the codebase, so if we are running into issues during testing we may need to partially revert this and go a less user-friendly but more maintainable route. But for the time being I think this is very cool and I can't see a reason at the moment why it wouldn't work cross platform. It's not that complicated, with some extended documentation it should be quite manageable, and provides a seamless experience for the user! Cheers for this cool addition, it was very fun to explore.

datashuttle/utils/rclone.py

JoeZiminski · 2025-07-14T16:55:16Z

Great! Thanks @cs7-shrey this all looks good to me, Once you've made the new branch let me know and I'll set the target branch to your new branch and approve. Great work!

cs7-shrey · 2025-07-14T18:33:57Z

Hey @JoeZiminski, I've made the new branch and set it to target for this PR.

* google drive setup via python api first draft * enable google drive config setup via tui * minor compatibility and ui changes * protectedclient secret input box * google drive connection setup via TUI * add aws as remote storage via python api first draft * add: logging and connection check for aws s3 * update: type checking for aws regions * add: save aws configs via TUI * add: setup aws connection via TUI * feat: setup google drive on machines with no browser * fix: minor bug * fix: logical error * add: logging for google drive connections * refactor: move google drive client secret to be entered at runtime while setting up connection * refactor: aws_regions.py; provide aws secret access key at runtime * add: docstrings to gdrive.py * add: root_folder_id config to google drive; some refactor * refactor: radiobuttons switch in configs.py * edit: minor changes to SetupAwsScreen for setting up aws connection * refactor: SetupGdriveScreen and handle errors * add: some tooltips for google drive configs * fix: vanishing central path, radio button order, minor refactor * fix: minor bug * refactor: single button for setup connection * add: backwards compatibility to configs while load from config file * edit: raise error on bucket not present * rename: aws region config key * rename: connection method from aws_s3 to aws * add: utility function to remove duplicate code * add: docstrings to setup gdrive dialog * update: config dict inplace change for backward compatibility, use existing rclone function, moving type hint imports in the conditional block * add: docstrings to setup connection functions; remove: aws region class * add: docstrings to setup widgets function; use backwards compatibility * add: docstrings to rclone function, change arugment order * minor changes * refactor: centralize the get secret function * extend centralized function for sensitive information input to ssh connection method * convert stage from float to int * add: function for getting aws bucket name * refactor: connection methods list * move: widgets not match saved configs * Fix linting. * fix: overly long container in configs tab * add: first version of service account file setup method * remove: utility functions and docstrings in python api for config token setup method * fix: css * update: setup gdrive screen to use service account file and refactor state transition * rename: existing functions from config token -> service account * edit: docstrings in accordance with new connection setup * update: function for checking successful rclone connection * edit: setup aws screen css * Fix assert in CreateFolders. * Allow no central_path for aws or gdrive and refactor configs_content. * Fix tests. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linting. * Add tests for new base path behaviour. * add: call rclone with popen; refactor: google drive connection to use popen * fix: failing typehint on python 3.9 * fix: another failing typehint * fix: failing monkeypatch * add: docstrings to functions invovled in killing rclone setup process * fix: minor bug --------- Co-authored-by: shrey <shrey@kali> Co-authored-by: JoeZiminski <joseph.j.ziminski@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

cs7-shrey changed the title ~~Add Google Drive as a Remote Storage option~~ Add Google Drive and AWS S3 as a Remote Storage option Mar 30, 2025

JoeZiminski requested changes May 23, 2025

View reviewed changes

cs7-shrey requested a review from JoeZiminski June 9, 2025 20:06

shrey added 25 commits June 14, 2025 19:04

google drive setup via python api first draft

83ac1f4

enable google drive config setup via tui

e37861e

minor compatibility and ui changes

01503f7

protectedclient secret input box

7380bac

google drive connection setup via TUI

965e016

add aws as remote storage via python api first draft

6698cb0

add: logging and connection check for aws s3

60b77aa

update: type checking for aws regions

d9755da

add: save aws configs via TUI

f3947a7

add: setup aws connection via TUI

b328af9

feat: setup google drive on machines with no browser

d2a4d83

fix: minor bug

d9ae864

fix: logical error

965bb17

add: logging for google drive connections

377cea7

refactor: move google drive client secret to be entered at runtime wh…

409d448

…ile setting up connection

refactor: aws_regions.py; provide aws secret access key at runtime

0b33b86

add: docstrings to gdrive.py

0733f51

add: root_folder_id config to google drive; some refactor

023ada3

refactor: radiobuttons switch in configs.py

e869986

edit: minor changes to SetupAwsScreen for setting up aws connection

9edbc8f

refactor: SetupGdriveScreen and handle errors

150f2ea

add: some tooltips for google drive configs

70659cd

fix: vanishing central path, radio button order, minor refactor

985e921

fix: minor bug

d7f13d4

refactor: single button for setup connection

f7807d1

JoeZiminski and others added 8 commits July 2, 2025 22:16

Fix assert in CreateFolders.

42d2ef0

Allow no central_path for aws or gdrive and refactor configs_content.

d0f948c

Fix tests.

af60167

Merge remote-tracking branch 'upstream/main' into add_gdrive_aws_remote

887587a

[pre-commit.ci] auto fixes from pre-commit.com hooks

05e9fe8

for more information, see https://pre-commit.ci

Fix linting.

590872b

Merge branch 'add_gdrive_aws_remote' of https://github.com/cs7-shrey/…

24f93a8

…datashuttle into add_gdrive_aws_remote

Add tests for new base path behaviour.

8bfa77d

JoeZiminski reviewed Jul 3, 2025

View reviewed changes

datashuttle/configs/aws_regions.py Outdated Show resolved Hide resolved

datashuttle/configs/canonical_configs.py Outdated Show resolved Hide resolved

shrey added 4 commits July 5, 2025 02:43

add: call rclone with popen; refactor: google drive connection to use…

9fb0601

… popen

fix: failing typehint on python 3.9

ac54a8f

fix: another failing typehint

6276c0e

fix: failing monkeypatch

ed5cfd2

cs7-shrey marked this pull request as ready for review July 5, 2025 12:26

cs7-shrey requested a review from JoeZiminski July 5, 2025 12:26

JoeZiminski requested changes Jul 10, 2025

View reviewed changes

add: docstrings to functions invovled in killing rclone setup process

57dd928

cs7-shrey requested a review from JoeZiminski July 14, 2025 11:48

fix: minor bug

acd8ef1

cs7-shrey changed the base branch from main to add_gdrive_aws_remote July 14, 2025 18:14

JoeZiminski approved these changes Jul 17, 2025

View reviewed changes

JoeZiminski merged commit 0a6c69d into neuroinformatics-unit:add_gdrive_aws_remote Jul 17, 2025
16 checks passed

JoeZiminski mentioned this pull request Jul 18, 2025

Use lsjson for searches #551

Merged

This was referenced Sep 8, 2025

Check that central_path can be written to during SSH setup. #449

Closed

Handle Nonetype cfg["central_path"] more robustly #554

Closed


		process = self.google_drive_rclone_setup_process

		if process.poll() is not None:

Conversation

cs7-shrey commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

References

How has this PR been tested?

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist:

Uh oh!

cs7-shrey commented Apr 2, 2025

Uh oh!

JoeZiminski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JoeZiminski commented Jun 3, 2025

Uh oh!

JoeZiminski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cs7-shrey commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoeZiminski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JoeZiminski commented Jul 14, 2025

Uh oh!

cs7-shrey commented Jul 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cs7-shrey commented Mar 27, 2025 •

edited

Loading

cs7-shrey commented Jul 5, 2025 •

edited

Loading