Skip to content

Add Google Drive and AWS S3 as a Remote Storage option #503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 65 commits into
base: main
Choose a base branch
from

Conversation

cs7-shrey
Copy link
Collaborator

@cs7-shrey cs7-shrey commented Mar 27, 2025

Description

This pull requests is a first version of Google Drive and AWS S3 integration in the datashuttle API and TUI.

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?

What does this PR do?
This PR is to discuss the approach for implementing Google Drive and AWS S3 as a remote storage option.

References

Issue #407

How has this PR been tested?

This PR was manually tested in the Datashuttle's Python API and TUI. Comprehensive tests are yet to be written.

Is this a breaking change?

No

Does this PR require an update to the documentation?

Yes

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@cs7-shrey cs7-shrey changed the title Add Google Drive as a Remote Storage option Add Google Drive and AWS S3 as a Remote Storage option Mar 30, 2025
@cs7-shrey
Copy link
Collaborator Author

Hey @adamltyson @JoeZiminski, I know reviewing this PR wouldn't be feasible at the moment. But it would be great if you can try it out locally whenever you have time and let me know if it works correctly / any suggestions you'd have. No hurry at all, I understand if it takes a bit longer.

Copy link
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cs7-shrey thanks a lot for this, this is excellent. It works really well and the code is structured neatly and clearly. Please see comments for some suggestions / questions / general comments.

The only thought I had on the user-end was whether we can let them provide a specific google-drive folder (rather than giving access to all of their drive folders). This might be a little more secure and avoid us accidentally interfering with other folders on their drive. The user could pass the google drive folder ID and this could be used for connection. We will have to make it clear that central_path now means relative to the root of the folder. It would also be nice to do this for AWS, but I think it is less simple and not in keeping with the way that AWS buckets works (but if it is possible, that would be great).

Please let me know if you have any questions!

pipe_std=True,
)

# TODO: make this more robust
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I put this TODO so as to be absolutely sure that rclone would behave in the same sequence of steps as we expect it to as I hackishly put together some rclone commands before. But with more and more testing and reading rclone's docmentation, I think it will most likely behave as expected.

So, the only concern here is error handling, for the json loading, it will raise any exception itself if the json doesn't have the structure we expect. To be sure that the rclone's output message is what we suppose it should be, shall we check something like
if <some_string> not in message: raise Exception ?

what do you suggest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cs7-shrey okay great, I think when we do the full internal / integration tests we can test this case in full on an HPC system. Then any unexpected issues with how rclone is working will be caught, so that's good we have this note here to remind us when writing these tests. We can also add a direct test on the rclone output in the test suite rather than an assert, then any issues will be flagged directly on the weekly tests.

@JoeZiminski
Copy link
Member

Hey Shrey thanks for this, will respond today, just a quick note before I forget, we should change the order of the radiobuttons that allow selection of the connection type, so 'Local only' is at the left not the right

@cs7-shrey cs7-shrey requested a review from JoeZiminski June 9, 2025 20:06
@cs7-shrey cs7-shrey requested a review from JoeZiminski June 28, 2025 09:23
@cs7-shrey
Copy link
Collaborator Author

Hey @JoeZiminski I refactored the google drive connection to the service account method. Also, the rclone ls command was silently exiting even if it did not have access to the google drive folder ID so, I changed the approach used in check_successful_connection_and_raise_error_on_fail. It now checks the connection by creating a temp.txt file on the remote and deleting it. Although, the error message is throws is a bit longer. Let me know what you think of this.

Also, do you remember about a # fix comment I had in setup_gdrive.py? In the part of the code that would execute when a "cancel" button was pressed during google drive setup connection. The thing is, if the cancel button is pressed when the final rclone connection setup (via browser) is started (in a worker), the self.setup_worker.cancel() will surely cancel the worker but the call_rclone function that invoked a subprocess.run is still running. It's terminal equivalent is if you call rclone config create ... and the browser opens up but you close the browser tab; the rclone config create ... still hasn't exited and you'd need to force stop it by doing something like ctrl + c.

Long story short, on clicking "cancel", the worker is cancelled but the subprocess that was started by it (which in turn opened the google's oauth in a browser) is still running.

The problem is, since the previous rclone command is still running, it is still using port 53682 for the oauth setup. If the user tries to setup connection again it fails with rclone saying something like port 53682 is already in use and the the user would have to quit the TUI and run it again.

The easiest fix I think is to kill whatever process is listening on port 53682. I don't know how good of an approach this is since we're hardcoding the port here.

import psutil
def kill_process_on_port(port):
    for conn in psutil.net_connections(kind='inet'):
        if conn.laddr.port == port and conn.status == psutil.CONN_LISTEN:
            pid = conn.pid
            if pid:
                try:
                    proc = psutil.Process(pid)
                    print(f"Killing PID {pid} ({proc.name()}) listening on port {port}")
                    proc.kill()
                except Exception:
                    pass

Another way would be to change the call_rclone function itself and maybe use subprocess.Popen that can give us more control over managing the process. Let me know what you think.

@JoeZiminski
Copy link
Member

Hey @cs7-shrey thanks for this! in answer to your questions:

  1. I think this is a good approach, I was thinking to add something similar to the SSH as well down the line. The file created could have a long random string for a name and/or a made-up suffix e.g. fgdgergs.datashuttleTest to avoid clashes with any existing file. Otherwise I think that's a good approach.
  2. Hmm that's a tricky one! I think it is easiest to use the POpen approach as you suggest, and kill the process directly. This will probably be a little easier to manage across different operating systems and machines as the OS can handle it's own clean up at the port. The only thing I'm unsure about is how to get the worker to trigger the killing of the process when it is itself closed, do you have any ideas? In terms of calling with POpen, for now we can just create a function called call_rclone_with_popen that returns the popen object. I think it is the best solution, otherwise we could:
  • rewrite call_rclone to have an optional use_popen and return the process if that is True, but it makes the signature quite confusing just for this one edge case
  • don't use a centralised function and just use popen directly. I was tempted but I think centralising is better, just so it is clear all the different ways rclone is being called in the codebase just by looking in rclone.py. We can think about tidying this up later (as we will now have call_rclone, call_rclone_with_popen and call_rclone_from_script), based on future use cases that appear. What do you think?

Copy link
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cs7-shrey this looks great, I just have a comment on the runtime type unpacking from str which unfortunately does't work for python < 3.11. Otherwise I think this is good to go, we could squash-merge into a aws_gdrive_dev branch and then make a new branch for tests?

I made a couple of changes, let me know what you think. The central_path had to be provided for aws and gdrive due to how it used to be set up, but I don't think this made sense in case the folder or bucket is the project folder. I set it so it can be None for these connection methods, and the TUI now reflects this. I changed the order of the boxes so the optional ones are together. Because of these changes, it led to an opportunity to refactor further the ConfigsContent.

Once change made is in the most recent commits is to not reload the widgets from configs when the connection method radiobutton is changed. Now, the content of the widgets is held for the lifetime of the tab session (e.g. a user switching onto, then off, the tab). The reason for this is that it was reloading Local Path from configs when switching the connection method, which was slightly confusing. A downside of this is that the Central Path content is carried over when switching connection method, which is slightly confusing too. However, on balance I think it is less confusing than the local path switching. Let me know what you think!

@cs7-shrey cs7-shrey marked this pull request as ready for review July 5, 2025 12:26
@cs7-shrey cs7-shrey requested a review from JoeZiminski July 5, 2025 12:26
@cs7-shrey
Copy link
Collaborator Author

cs7-shrey commented Jul 5, 2025

Hey @JoeZiminski I introduced call_rclone_with_popen and made related refactors in the google drive setup. Also fixed syntax issues that were causing the tests to fail. Please take and look and let me know what you think. If this looks good, I'll go ahead and create another pull request to a new dev branch.

Copy link
Member

@JoeZiminski JoeZiminski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cs7-shrey this looks great, some extremely minor suggestions and general discussion about thread / processes. Feel free to move onto the next stage after the minor suggestions are addressed, cheers for this!


if not self.gdrive_setup_process_killed:
if process.returncode != 0:
utils.log_and_raise_error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that logging is not handled properly when setting up connections through the TUI (this was an oversight from the SSH implementation). At present, all logging is handled (i.e. set up and torn down) through the datashuttle class public functions. Therefore if the TUI calls are not routed through these public functions the logger is not set up. At this stage, I don't think any logger has been set up and so the logging will not do anything, so we can use utils.raise_error(...). I'll make an error to address this in a new PR! Luckily it's not critical as the TUI is giving direct feedback to the user if the set up fails so logging is less important.


process = self.google_drive_rclone_setup_process

if process.poll() is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think poll() returns None if the process is still running, in this case we want to check poll() is None? (i.e. the process is still running so we can kill it?`

event.button.id == "setup_gdrive_cancel_button"
or event.button.id == "setup_gdrive_finish_button"
):
if self.setup_worker and self.setup_worker.is_running:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a quick comment here # see setup_gdrive_connection_and_update_ui().

gdrive_client_secret, service_account_filepath
)
self.google_drive_rclone_setup_process = process
self.gdrive_setup_process_killed = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag is to stop race conditions if the process is killed before the worker? That's great protection, I think the docstring can be expanded to explain this. In general this docstring and the one that sets up the worker can have a lot of documentation explaining the reasoning behind this logic and how it works. It's a really cool set up as it lets us do some powerful stuff, but working across processes and threads is unfortunately often buggy cross-platform and very difficult to debug, so any added explanations are a benefit.

gdrive_client_secret: Optional[str] = None,
service_account_filepath: Optional[str] = None,
) -> None:
"""Start the Google Drive connection setup in a separate thread.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the set up is quite cryptic (e.g. a thread -> a process which is then cancelled from here) I think it is worth documenting in detail the reasons for it and logic, maybe here and then more briefly in interface.

If I understand correctly at present the logic is:

  1. UI thread spawns a worker in a new thread
  2. this worker thread spawns a new process that calls rclone to check the connection, and stores the process on Interface
  3. The POpen process python returns is just an python-side object that represents the OS process. So I don't think it matters that this is passed across thread boundaries (which share memory). In the end, this process object is simply used to close the OS process because the owning thread does not. If the process is closed before the worker thread, you have the protection flag in the interface.

I think this is all good but we can test thoroughly in the next stage cross-platform. This is really nice and powerful, and its a great template for the future if we run into any similar issues that require process managment. On the flip side, working across threads / processes can be buggy and adds complexity to the codebase, so if we are running into issues during testing we may need to partially revert this and go a less user-friendly but more maintainable route. But for the time being I think this is very cool and I can't see a reason at the moment why it wouldn't work cross platform. It's not that complicated, with some extended documentation it should be quite manageable, and provides a seamless experience for the user! Cheers for this cool addition, it was very fun to explore.

def call_rclone_with_popen(command: str) -> subprocess.Popen:
"""Call rclone using `subprocess.Popen` for control over process termination.

It is not possible to kill a process while running it using `subprocess.run`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a quick addition here, under what circumstances we may need to kill it (i.e. this is only relevant for running rclone commands from a TUI worker, where the worker will not clean up the process)

is done to have more control over the setup process in case the user wishes to
cancel the setup. Since the rclone setup for google drive uses a local web server
for authentication to google drive, the running process must be killed before the
setup can be started again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Amazon AWS and Google Drive
2 participants