Add Google Drive and AWS S3 as a Remote Storage option#503
Conversation
|
Hey @adamltyson @JoeZiminski, I know reviewing this PR wouldn't be feasible at the moment. But it would be great if you can try it out locally whenever you have time and let me know if it works correctly / any suggestions you'd have. No hurry at all, I understand if it takes a bit longer. |
JoeZiminski
left a comment
There was a problem hiding this comment.
Hey @cs7-shrey thanks a lot for this, this is excellent. It works really well and the code is structured neatly and clearly. Please see comments for some suggestions / questions / general comments.
The only thought I had on the user-end was whether we can let them provide a specific google-drive folder (rather than giving access to all of their drive folders). This might be a little more secure and avoid us accidentally interfering with other folders on their drive. The user could pass the google drive folder ID and this could be used for connection. We will have to make it clear that central_path now means relative to the root of the folder. It would also be nice to do this for AWS, but I think it is less simple and not in keeping with the way that AWS buckets works (but if it is possible, that would be great).
Please let me know if you have any questions!
datashuttle/utils/gdrive.py
Outdated
| pipe_std=True, | ||
| ) | ||
|
|
||
| # TODO: make this more robust |
There was a problem hiding this comment.
Actually, I put this TODO so as to be absolutely sure that rclone would behave in the same sequence of steps as we expect it to as I hackishly put together some rclone commands before. But with more and more testing and reading rclone's docmentation, I think it will most likely behave as expected.
So, the only concern here is error handling, for the json loading, it will raise any exception itself if the json doesn't have the structure we expect. To be sure that the rclone's output message is what we suppose it should be, shall we check something like
if <some_string> not in message: raise Exception ?
what do you suggest?
There was a problem hiding this comment.
Hey @cs7-shrey okay great, I think when we do the full internal / integration tests we can test this case in full on an HPC system. Then any unexpected issues with how rclone is working will be caught, so that's good we have this note here to remind us when writing these tests. We can also add a direct test on the rclone output in the test suite rather than an assert, then any issues will be flagged directly on the weekly tests.
|
Hey Shrey thanks for this, will respond today, just a quick note before I forget, we should change the order of the radiobuttons that allow selection of the connection type, so 'Local only' is at the left not the right |
…ile setting up connection
for more information, see https://pre-commit.ci
…datashuttle into add_gdrive_aws_remote
JoeZiminski
left a comment
There was a problem hiding this comment.
Hey @cs7-shrey this looks great, I just have a comment on the runtime type unpacking from str which unfortunately does't work for python < 3.11. Otherwise I think this is good to go, we could squash-merge into a aws_gdrive_dev branch and then make a new branch for tests?
I made a couple of changes, let me know what you think. The central_path had to be provided for aws and gdrive due to how it used to be set up, but I don't think this made sense in case the folder or bucket is the project folder. I set it so it can be None for these connection methods, and the TUI now reflects this. I changed the order of the boxes so the optional ones are together. Because of these changes, it led to an opportunity to refactor further the ConfigsContent.
Once change made is in the most recent commits is to not reload the widgets from configs when the connection method radiobutton is changed. Now, the content of the widgets is held for the lifetime of the tab session (e.g. a user switching onto, then off, the tab). The reason for this is that it was reloading Local Path from configs when switching the connection method, which was slightly confusing. A downside of this is that the Central Path content is carried over when switching connection method, which is slightly confusing too. However, on balance I think it is less confusing than the local path switching. Let me know what you think!
|
Hey @JoeZiminski I introduced |
JoeZiminski
left a comment
There was a problem hiding this comment.
Hey @cs7-shrey this looks great, some extremely minor suggestions and general discussion about thread / processes. Feel free to move onto the next stage after the minor suggestions are addressed, cheers for this!
datashuttle/tui/interface.py
Outdated
|
|
||
| process = self.google_drive_rclone_setup_process | ||
|
|
||
| if process.poll() is not None: |
There was a problem hiding this comment.
I think poll() returns None if the process is still running, in this case we want to check poll() is None? (i.e. the process is still running so we can kill it?`
There was a problem hiding this comment.
Yes, just to be clear to kill the process if it's running.
There was a problem hiding this comment.
okay great, unless I am mistaken I think poll() counter-intuitively will return None when the process is running, so the line should be:
if process.poll() is None:
self.gdrive_setup_process_killed = True
process.kill()
There was a problem hiding this comment.
oh yes, you're right. minor oversight by me.
| gdrive_client_secret: Optional[str] = None, | ||
| service_account_filepath: Optional[str] = None, | ||
| ) -> None: | ||
| """Start the Google Drive connection setup in a separate thread. |
There was a problem hiding this comment.
Because the set up is quite cryptic (e.g. a thread -> a process which is then cancelled from here) I think it is worth documenting in detail the reasons for it and logic, maybe here and then more briefly in interface.
If I understand correctly at present the logic is:
- UI thread spawns a worker in a new thread
- this worker thread spawns a new process that calls
rcloneto check the connection, and stores the process onInterface - The
POpenprocess python returns is just an python-side object that represents the OS process. So I don't think it matters that this is passed across thread boundaries (which share memory). In the end, this process object is simply used to close the OS process because the owning thread does not. If the process is closed before the worker thread, you have the protection flag in the interface.
I think this is all good but we can test thoroughly in the next stage cross-platform. This is really nice and powerful, and its a great template for the future if we run into any similar issues that require process managment. On the flip side, working across threads / processes can be buggy and adds complexity to the codebase, so if we are running into issues during testing we may need to partially revert this and go a less user-friendly but more maintainable route. But for the time being I think this is very cool and I can't see a reason at the moment why it wouldn't work cross platform. It's not that complicated, with some extended documentation it should be quite manageable, and provides a seamless experience for the user! Cheers for this cool addition, it was very fun to explore.
|
Great! Thanks @cs7-shrey this all looks good to me, Once you've made the new branch let me know and I'll set the target branch to your new branch and approve. Great work! |
|
Hey @JoeZiminski, I've made the new branch and set it to target for this PR. |
0a6c69d
into
neuroinformatics-unit:add_gdrive_aws_remote
* google drive setup via python api first draft * enable google drive config setup via tui * minor compatibility and ui changes * protectedclient secret input box * google drive connection setup via TUI * add aws as remote storage via python api first draft * add: logging and connection check for aws s3 * update: type checking for aws regions * add: save aws configs via TUI * add: setup aws connection via TUI * feat: setup google drive on machines with no browser * fix: minor bug * fix: logical error * add: logging for google drive connections * refactor: move google drive client secret to be entered at runtime while setting up connection * refactor: aws_regions.py; provide aws secret access key at runtime * add: docstrings to gdrive.py * add: root_folder_id config to google drive; some refactor * refactor: radiobuttons switch in configs.py * edit: minor changes to SetupAwsScreen for setting up aws connection * refactor: SetupGdriveScreen and handle errors * add: some tooltips for google drive configs * fix: vanishing central path, radio button order, minor refactor * fix: minor bug * refactor: single button for setup connection * add: backwards compatibility to configs while load from config file * edit: raise error on bucket not present * rename: aws region config key * rename: connection method from aws_s3 to aws * add: utility function to remove duplicate code * add: docstrings to setup gdrive dialog * update: config dict inplace change for backward compatibility, use existing rclone function, moving type hint imports in the conditional block * add: docstrings to setup connection functions; remove: aws region class * add: docstrings to setup widgets function; use backwards compatibility * add: docstrings to rclone function, change arugment order * minor changes * refactor: centralize the get secret function * extend centralized function for sensitive information input to ssh connection method * convert stage from float to int * add: function for getting aws bucket name * refactor: connection methods list * move: widgets not match saved configs * Fix linting. * fix: overly long container in configs tab * add: first version of service account file setup method * remove: utility functions and docstrings in python api for config token setup method * fix: css * update: setup gdrive screen to use service account file and refactor state transition * rename: existing functions from config token -> service account * edit: docstrings in accordance with new connection setup * update: function for checking successful rclone connection * edit: setup aws screen css * Fix assert in CreateFolders. * Allow no central_path for aws or gdrive and refactor configs_content. * Fix tests. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linting. * Add tests for new base path behaviour. * add: call rclone with popen; refactor: google drive connection to use popen * fix: failing typehint on python 3.9 * fix: another failing typehint * fix: failing monkeypatch * add: docstrings to functions invovled in killing rclone setup process * fix: minor bug --------- Co-authored-by: shrey <shrey@kali> Co-authored-by: JoeZiminski <joseph.j.ziminski@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* google drive setup via python api first draft * enable google drive config setup via tui * minor compatibility and ui changes * protectedclient secret input box * google drive connection setup via TUI * add aws as remote storage via python api first draft * add: logging and connection check for aws s3 * update: type checking for aws regions * add: save aws configs via TUI * add: setup aws connection via TUI * feat: setup google drive on machines with no browser * fix: minor bug * fix: logical error * add: logging for google drive connections * refactor: move google drive client secret to be entered at runtime while setting up connection * refactor: aws_regions.py; provide aws secret access key at runtime * add: docstrings to gdrive.py * add: root_folder_id config to google drive; some refactor * refactor: radiobuttons switch in configs.py * edit: minor changes to SetupAwsScreen for setting up aws connection * refactor: SetupGdriveScreen and handle errors * add: some tooltips for google drive configs * fix: vanishing central path, radio button order, minor refactor * fix: minor bug * refactor: single button for setup connection * add: backwards compatibility to configs while load from config file * edit: raise error on bucket not present * rename: aws region config key * rename: connection method from aws_s3 to aws * add: utility function to remove duplicate code * add: docstrings to setup gdrive dialog * update: config dict inplace change for backward compatibility, use existing rclone function, moving type hint imports in the conditional block * add: docstrings to setup connection functions; remove: aws region class * add: docstrings to setup widgets function; use backwards compatibility * add: docstrings to rclone function, change arugment order * minor changes * refactor: centralize the get secret function * extend centralized function for sensitive information input to ssh connection method * convert stage from float to int * add: function for getting aws bucket name * refactor: connection methods list * move: widgets not match saved configs * Fix linting. * fix: overly long container in configs tab * add: first version of service account file setup method * remove: utility functions and docstrings in python api for config token setup method * fix: css * update: setup gdrive screen to use service account file and refactor state transition * rename: existing functions from config token -> service account * edit: docstrings in accordance with new connection setup * update: function for checking successful rclone connection * edit: setup aws screen css * Fix assert in CreateFolders. * Allow no central_path for aws or gdrive and refactor configs_content. * Fix tests. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linting. * Add tests for new base path behaviour. * add: call rclone with popen; refactor: google drive connection to use popen * fix: failing typehint on python 3.9 * fix: another failing typehint * fix: failing monkeypatch * add: docstrings to functions invovled in killing rclone setup process * fix: minor bug --------- Co-authored-by: shrey <shrey@kali> Co-authored-by: JoeZiminski <joseph.j.ziminski@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Description
This pull requests is a first version of Google Drive and AWS S3 integration in the datashuttle API and TUI.
What is this PR
Why is this PR needed?
What does this PR do?
This PR is to discuss the approach for implementing Google Drive and AWS S3 as a remote storage option.
References
Issue #407
How has this PR been tested?
This PR was manually tested in the Datashuttle's Python API and TUI. Comprehensive tests are yet to be written.
Is this a breaking change?
No
Does this PR require an update to the documentation?
Yes
Checklist: