-
Notifications
You must be signed in to change notification settings - Fork 22
Add Google Drive and AWS S3 as a Remote Storage option #503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Google Drive and AWS S3 as a Remote Storage option #503
Conversation
Hey @adamltyson @JoeZiminski, I know reviewing this PR wouldn't be feasible at the moment. But it would be great if you can try it out locally whenever you have time and let me know if it works correctly / any suggestions you'd have. No hurry at all, I understand if it takes a bit longer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cs7-shrey thanks a lot for this, this is excellent. It works really well and the code is structured neatly and clearly. Please see comments for some suggestions / questions / general comments.
The only thought I had on the user-end was whether we can let them provide a specific google-drive folder (rather than giving access to all of their drive folders). This might be a little more secure and avoid us accidentally interfering with other folders on their drive. The user could pass the google drive folder ID and this could be used for connection. We will have to make it clear that central_path
now means relative to the root of the folder. It would also be nice to do this for AWS, but I think it is less simple and not in keeping with the way that AWS buckets works (but if it is possible, that would be great).
Please let me know if you have any questions!
datashuttle/utils/gdrive.py
Outdated
pipe_std=True, | ||
) | ||
|
||
# TODO: make this more robust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I put this TODO so as to be absolutely sure that rclone would behave in the same sequence of steps as we expect it to as I hackishly put together some rclone commands before. But with more and more testing and reading rclone's docmentation, I think it will most likely behave as expected.
So, the only concern here is error handling, for the json loading, it will raise any exception itself if the json doesn't have the structure we expect. To be sure that the rclone's output message is what we suppose it should be, shall we check something like
if <some_string> not in message: raise Exception
?
what do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cs7-shrey okay great, I think when we do the full internal / integration tests we can test this case in full on an HPC system. Then any unexpected issues with how rclone is working will be caught, so that's good we have this note here to remind us when writing these tests. We can also add a direct test on the rclone output in the test suite rather than an assert, then any issues will be flagged directly on the weekly tests.
Hey Shrey thanks for this, will respond today, just a quick note before I forget, we should change the order of the radiobuttons that allow selection of the connection type, so 'Local only' is at the left not the right |
…ile setting up connection
Hey @JoeZiminski I refactored the google drive connection to the service account method. Also, the Also, do you remember about a Long story short, on clicking "cancel", the worker is cancelled but the subprocess that was started by it (which in turn opened the google's oauth in a browser) is still running. The problem is, since the previous rclone command is still running, it is still using port The easiest fix I think is to kill whatever process is listening on port import psutil
def kill_process_on_port(port):
for conn in psutil.net_connections(kind='inet'):
if conn.laddr.port == port and conn.status == psutil.CONN_LISTEN:
pid = conn.pid
if pid:
try:
proc = psutil.Process(pid)
print(f"Killing PID {pid} ({proc.name()}) listening on port {port}")
proc.kill()
except Exception:
pass Another way would be to change the |
Hey @cs7-shrey thanks for this! in answer to your questions:
|
for more information, see https://pre-commit.ci
…datashuttle into add_gdrive_aws_remote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cs7-shrey this looks great, I just have a comment on the runtime type unpacking from str which unfortunately does't work for python < 3.11. Otherwise I think this is good to go, we could squash-merge into a aws_gdrive_dev
branch and then make a new branch for tests?
I made a couple of changes, let me know what you think. The central_path
had to be provided for aws
and gdrive
due to how it used to be set up, but I don't think this made sense in case the folder or bucket is the project folder. I set it so it can be None
for these connection methods, and the TUI now reflects this. I changed the order of the boxes so the optional ones are together. Because of these changes, it led to an opportunity to refactor further the ConfigsContent
.
Once change made is in the most recent commits is to not reload the widgets from configs when the connection method radiobutton is changed. Now, the content of the widgets is held for the lifetime of the tab session (e.g. a user switching onto, then off, the tab). The reason for this is that it was reloading Local Path
from configs when switching the connection method, which was slightly confusing. A downside of this is that the Central Path
content is carried over when switching connection method, which is slightly confusing too. However, on balance I think it is less confusing than the local path switching. Let me know what you think!
Hey @JoeZiminski I introduced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cs7-shrey this looks great, some extremely minor suggestions and general discussion about thread / processes. Feel free to move onto the next stage after the minor suggestions are addressed, cheers for this!
|
||
if not self.gdrive_setup_process_killed: | ||
if process.returncode != 0: | ||
utils.log_and_raise_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that logging is not handled properly when setting up connections through the TUI (this was an oversight from the SSH implementation). At present, all logging is handled (i.e. set up and torn down) through the datashuttle class public functions. Therefore if the TUI calls are not routed through these public functions the logger is not set up. At this stage, I don't think any logger has been set up and so the logging will not do anything, so we can use utils.raise_error(...)
. I'll make an error to address this in a new PR! Luckily it's not critical as the TUI is giving direct feedback to the user if the set up fails so logging is less important.
|
||
process = self.google_drive_rclone_setup_process | ||
|
||
if process.poll() is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think poll()
returns None
if the process is still running, in this case we want to check poll() is None
? (i.e. the process is still running so we can kill it?`
event.button.id == "setup_gdrive_cancel_button" | ||
or event.button.id == "setup_gdrive_finish_button" | ||
): | ||
if self.setup_worker and self.setup_worker.is_running: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a quick comment here # see setup_gdrive_connection_and_update_ui()
.
gdrive_client_secret, service_account_filepath | ||
) | ||
self.google_drive_rclone_setup_process = process | ||
self.gdrive_setup_process_killed = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag is to stop race conditions if the process is killed before the worker? That's great protection, I think the docstring can be expanded to explain this. In general this docstring and the one that sets up the worker can have a lot of documentation explaining the reasoning behind this logic and how it works. It's a really cool set up as it lets us do some powerful stuff, but working across processes and threads is unfortunately often buggy cross-platform and very difficult to debug, so any added explanations are a benefit.
gdrive_client_secret: Optional[str] = None, | ||
service_account_filepath: Optional[str] = None, | ||
) -> None: | ||
"""Start the Google Drive connection setup in a separate thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the set up is quite cryptic (e.g. a thread -> a process which is then cancelled from here) I think it is worth documenting in detail the reasons for it and logic, maybe here and then more briefly in interface
.
If I understand correctly at present the logic is:
- UI thread spawns a worker in a new thread
- this worker thread spawns a new process that calls
rclone
to check the connection, and stores the process onInterface
- The
POpen
process python returns is just an python-side object that represents the OS process. So I don't think it matters that this is passed across thread boundaries (which share memory). In the end, this process object is simply used to close the OS process because the owning thread does not. If the process is closed before the worker thread, you have the protection flag in the interface.
I think this is all good but we can test thoroughly in the next stage cross-platform. This is really nice and powerful, and its a great template for the future if we run into any similar issues that require process managment. On the flip side, working across threads / processes can be buggy and adds complexity to the codebase, so if we are running into issues during testing we may need to partially revert this and go a less user-friendly but more maintainable route. But for the time being I think this is very cool and I can't see a reason at the moment why it wouldn't work cross platform. It's not that complicated, with some extended documentation it should be quite manageable, and provides a seamless experience for the user! Cheers for this cool addition, it was very fun to explore.
def call_rclone_with_popen(command: str) -> subprocess.Popen: | ||
"""Call rclone using `subprocess.Popen` for control over process termination. | ||
|
||
It is not possible to kill a process while running it using `subprocess.run`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a quick addition here, under what circumstances we may need to kill it (i.e. this is only relevant for running rclone commands from a TUI worker, where the worker will not clean up the process)
is done to have more control over the setup process in case the user wishes to | ||
cancel the setup. Since the rclone setup for google drive uses a local web server | ||
for authentication to google drive, the running process must be killed before the | ||
setup can be started again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent
Description
This pull requests is a first version of Google Drive and AWS S3 integration in the datashuttle API and TUI.
What is this PR
Why is this PR needed?
What does this PR do?
This PR is to discuss the approach for implementing Google Drive and AWS S3 as a remote storage option.
References
Issue #407
How has this PR been tested?
This PR was manually tested in the Datashuttle's Python API and TUI. Comprehensive tests are yet to be written.
Is this a breaking change?
No
Does this PR require an update to the documentation?
Yes
Checklist: