Skip to content

Add Google Drive and AWS S3 as a Remote Storage option#503

Merged
JoeZiminski merged 67 commits intoneuroinformatics-unit:add_gdrive_aws_remotefrom
cs7-shrey:add_gdrive_aws_remote
Jul 17, 2025
Merged

Add Google Drive and AWS S3 as a Remote Storage option#503
JoeZiminski merged 67 commits intoneuroinformatics-unit:add_gdrive_aws_remotefrom
cs7-shrey:add_gdrive_aws_remote

Conversation

@cs7-shrey
Copy link
Copy Markdown
Collaborator

@cs7-shrey cs7-shrey commented Mar 27, 2025

Description

This pull requests is a first version of Google Drive and AWS S3 integration in the datashuttle API and TUI.

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?

What does this PR do?
This PR is to discuss the approach for implementing Google Drive and AWS S3 as a remote storage option.

References

Issue #407

How has this PR been tested?

This PR was manually tested in the Datashuttle's Python API and TUI. Comprehensive tests are yet to be written.

Is this a breaking change?

No

Does this PR require an update to the documentation?

Yes

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@cs7-shrey cs7-shrey changed the title Add Google Drive as a Remote Storage option Add Google Drive and AWS S3 as a Remote Storage option Mar 30, 2025
@cs7-shrey
Copy link
Copy Markdown
Collaborator Author

Hey @adamltyson @JoeZiminski, I know reviewing this PR wouldn't be feasible at the moment. But it would be great if you can try it out locally whenever you have time and let me know if it works correctly / any suggestions you'd have. No hurry at all, I understand if it takes a bit longer.

Copy link
Copy Markdown
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cs7-shrey thanks a lot for this, this is excellent. It works really well and the code is structured neatly and clearly. Please see comments for some suggestions / questions / general comments.

The only thought I had on the user-end was whether we can let them provide a specific google-drive folder (rather than giving access to all of their drive folders). This might be a little more secure and avoid us accidentally interfering with other folders on their drive. The user could pass the google drive folder ID and this could be used for connection. We will have to make it clear that central_path now means relative to the root of the folder. It would also be nice to do this for AWS, but I think it is less simple and not in keeping with the way that AWS buckets works (but if it is possible, that would be great).

Please let me know if you have any questions!

pipe_std=True,
)

# TODO: make this more robust
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I put this TODO so as to be absolutely sure that rclone would behave in the same sequence of steps as we expect it to as I hackishly put together some rclone commands before. But with more and more testing and reading rclone's docmentation, I think it will most likely behave as expected.

So, the only concern here is error handling, for the json loading, it will raise any exception itself if the json doesn't have the structure we expect. To be sure that the rclone's output message is what we suppose it should be, shall we check something like
if <some_string> not in message: raise Exception ?

what do you suggest?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cs7-shrey okay great, I think when we do the full internal / integration tests we can test this case in full on an HPC system. Then any unexpected issues with how rclone is working will be caught, so that's good we have this note here to remind us when writing these tests. We can also add a direct test on the rclone output in the test suite rather than an assert, then any issues will be flagged directly on the weekly tests.

@JoeZiminski
Copy link
Copy Markdown
Member

Hey Shrey thanks for this, will respond today, just a quick note before I forget, we should change the order of the radiobuttons that allow selection of the connection type, so 'Local only' is at the left not the right

@cs7-shrey cs7-shrey requested a review from JoeZiminski June 9, 2025 20:06
Copy link
Copy Markdown
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cs7-shrey this looks great, I just have a comment on the runtime type unpacking from str which unfortunately does't work for python < 3.11. Otherwise I think this is good to go, we could squash-merge into a aws_gdrive_dev branch and then make a new branch for tests?

I made a couple of changes, let me know what you think. The central_path had to be provided for aws and gdrive due to how it used to be set up, but I don't think this made sense in case the folder or bucket is the project folder. I set it so it can be None for these connection methods, and the TUI now reflects this. I changed the order of the boxes so the optional ones are together. Because of these changes, it led to an opportunity to refactor further the ConfigsContent.

Once change made is in the most recent commits is to not reload the widgets from configs when the connection method radiobutton is changed. Now, the content of the widgets is held for the lifetime of the tab session (e.g. a user switching onto, then off, the tab). The reason for this is that it was reloading Local Path from configs when switching the connection method, which was slightly confusing. A downside of this is that the Central Path content is carried over when switching connection method, which is slightly confusing too. However, on balance I think it is less confusing than the local path switching. Let me know what you think!

@cs7-shrey cs7-shrey marked this pull request as ready for review July 5, 2025 12:26
@cs7-shrey cs7-shrey requested a review from JoeZiminski July 5, 2025 12:26
@cs7-shrey
Copy link
Copy Markdown
Collaborator Author

cs7-shrey commented Jul 5, 2025

Hey @JoeZiminski I introduced call_rclone_with_popen and made related refactors in the google drive setup. Also fixed syntax issues that were causing the tests to fail. Please take and look and let me know what you think. If this looks good, I'll go ahead and create another pull request to a new dev branch.

Copy link
Copy Markdown
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cs7-shrey this looks great, some extremely minor suggestions and general discussion about thread / processes. Feel free to move onto the next stage after the minor suggestions are addressed, cheers for this!


process = self.google_drive_rclone_setup_process

if process.poll() is not None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think poll() returns None if the process is still running, in this case we want to check poll() is None? (i.e. the process is still running so we can kill it?`

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just to be clear to kill the process if it's running.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay great, unless I am mistaken I think poll() counter-intuitively will return None when the process is running, so the line should be:

if process.poll() is None:
    self.gdrive_setup_process_killed = True
    process.kill()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, you're right. minor oversight by me.

gdrive_client_secret: Optional[str] = None,
service_account_filepath: Optional[str] = None,
) -> None:
"""Start the Google Drive connection setup in a separate thread.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the set up is quite cryptic (e.g. a thread -> a process which is then cancelled from here) I think it is worth documenting in detail the reasons for it and logic, maybe here and then more briefly in interface.

If I understand correctly at present the logic is:

  1. UI thread spawns a worker in a new thread
  2. this worker thread spawns a new process that calls rclone to check the connection, and stores the process on Interface
  3. The POpen process python returns is just an python-side object that represents the OS process. So I don't think it matters that this is passed across thread boundaries (which share memory). In the end, this process object is simply used to close the OS process because the owning thread does not. If the process is closed before the worker thread, you have the protection flag in the interface.

I think this is all good but we can test thoroughly in the next stage cross-platform. This is really nice and powerful, and its a great template for the future if we run into any similar issues that require process managment. On the flip side, working across threads / processes can be buggy and adds complexity to the codebase, so if we are running into issues during testing we may need to partially revert this and go a less user-friendly but more maintainable route. But for the time being I think this is very cool and I can't see a reason at the moment why it wouldn't work cross platform. It's not that complicated, with some extended documentation it should be quite manageable, and provides a seamless experience for the user! Cheers for this cool addition, it was very fun to explore.

@cs7-shrey cs7-shrey requested a review from JoeZiminski July 14, 2025 11:48
@JoeZiminski
Copy link
Copy Markdown
Member

Great! Thanks @cs7-shrey this all looks good to me, Once you've made the new branch let me know and I'll set the target branch to your new branch and approve. Great work!

@cs7-shrey cs7-shrey changed the base branch from main to add_gdrive_aws_remote July 14, 2025 18:14
@cs7-shrey
Copy link
Copy Markdown
Collaborator Author

Hey @JoeZiminski, I've made the new branch and set it to target for this PR.

@JoeZiminski JoeZiminski merged commit 0a6c69d into neuroinformatics-unit:add_gdrive_aws_remote Jul 17, 2025
16 checks passed
cs7-shrey added a commit that referenced this pull request Aug 1, 2025
* google drive setup via python api first draft

* enable google drive config setup via tui

* minor compatibility and ui changes

* protectedclient secret input box

* google drive connection setup via TUI

* add aws as remote storage via python api first draft

* add: logging and connection check for aws s3

* update: type checking for aws regions

* add: save aws configs via TUI

* add: setup aws connection via TUI

* feat: setup google drive on machines with no browser

* fix: minor bug

* fix: logical error

* add: logging for google drive connections

* refactor: move google drive client secret to be entered at runtime while setting up connection

* refactor: aws_regions.py; provide aws secret access key at runtime

* add: docstrings to gdrive.py

* add: root_folder_id config to google drive; some refactor

* refactor: radiobuttons switch in configs.py

* edit: minor changes to SetupAwsScreen for setting up aws connection

* refactor: SetupGdriveScreen and handle errors

* add: some tooltips for google drive configs

* fix: vanishing central path, radio button order, minor refactor

* fix: minor bug

* refactor: single button for setup connection

* add: backwards compatibility to configs while load from config file

* edit: raise error on bucket not present

* rename: aws region config key

* rename: connection method from aws_s3 to aws

* add: utility function to remove duplicate code

* add: docstrings to setup gdrive dialog

* update: config dict inplace change for backward compatibility, use existing rclone function, moving type hint imports in the conditional block

* add: docstrings to setup connection functions; remove: aws region class

* add: docstrings to setup widgets function; use backwards compatibility

* add: docstrings to rclone function, change arugment order

* minor changes

* refactor: centralize the get secret function

* extend centralized function for sensitive information input to ssh connection method

* convert stage from float to int

* add: function for getting aws bucket name

* refactor: connection methods list

* move: widgets not match saved configs

* Fix linting.

* fix: overly long container in configs tab

* add: first version of service account file setup method

* remove: utility functions and docstrings in python api for config token setup method

* fix: css

* update: setup gdrive screen to use service account file and refactor state transition

* rename: existing functions from config token -> service account

* edit: docstrings in accordance with new connection setup

* update: function for checking successful rclone connection

* edit: setup aws screen css

* Fix assert in CreateFolders.

* Allow no central_path for aws or gdrive and refactor configs_content.

* Fix tests.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix linting.

* Add tests for new base path behaviour.

* add: call rclone with popen; refactor: google drive connection to use popen

* fix: failing typehint on python 3.9

* fix: another failing typehint

* fix: failing monkeypatch

* add: docstrings to functions invovled in killing rclone setup process

* fix: minor bug

---------

Co-authored-by: shrey <shrey@kali>
Co-authored-by: JoeZiminski <joseph.j.ziminski@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
JoeZiminski added a commit that referenced this pull request Aug 27, 2025
* google drive setup via python api first draft

* enable google drive config setup via tui

* minor compatibility and ui changes

* protectedclient secret input box

* google drive connection setup via TUI

* add aws as remote storage via python api first draft

* add: logging and connection check for aws s3

* update: type checking for aws regions

* add: save aws configs via TUI

* add: setup aws connection via TUI

* feat: setup google drive on machines with no browser

* fix: minor bug

* fix: logical error

* add: logging for google drive connections

* refactor: move google drive client secret to be entered at runtime while setting up connection

* refactor: aws_regions.py; provide aws secret access key at runtime

* add: docstrings to gdrive.py

* add: root_folder_id config to google drive; some refactor

* refactor: radiobuttons switch in configs.py

* edit: minor changes to SetupAwsScreen for setting up aws connection

* refactor: SetupGdriveScreen and handle errors

* add: some tooltips for google drive configs

* fix: vanishing central path, radio button order, minor refactor

* fix: minor bug

* refactor: single button for setup connection

* add: backwards compatibility to configs while load from config file

* edit: raise error on bucket not present

* rename: aws region config key

* rename: connection method from aws_s3 to aws

* add: utility function to remove duplicate code

* add: docstrings to setup gdrive dialog

* update: config dict inplace change for backward compatibility, use existing rclone function, moving type hint imports in the conditional block

* add: docstrings to setup connection functions; remove: aws region class

* add: docstrings to setup widgets function; use backwards compatibility

* add: docstrings to rclone function, change arugment order

* minor changes

* refactor: centralize the get secret function

* extend centralized function for sensitive information input to ssh connection method

* convert stage from float to int

* add: function for getting aws bucket name

* refactor: connection methods list

* move: widgets not match saved configs

* Fix linting.

* fix: overly long container in configs tab

* add: first version of service account file setup method

* remove: utility functions and docstrings in python api for config token setup method

* fix: css

* update: setup gdrive screen to use service account file and refactor state transition

* rename: existing functions from config token -> service account

* edit: docstrings in accordance with new connection setup

* update: function for checking successful rclone connection

* edit: setup aws screen css

* Fix assert in CreateFolders.

* Allow no central_path for aws or gdrive and refactor configs_content.

* Fix tests.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix linting.

* Add tests for new base path behaviour.

* add: call rclone with popen; refactor: google drive connection to use popen

* fix: failing typehint on python 3.9

* fix: another failing typehint

* fix: failing monkeypatch

* add: docstrings to functions invovled in killing rclone setup process

* fix: minor bug

---------

Co-authored-by: shrey <shrey@kali>
Co-authored-by: JoeZiminski <joseph.j.ziminski@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for Amazon AWS and Google Drive

2 participants