-
Notifications
You must be signed in to change notification settings - Fork 22
Add Google Drive and AWS S3 as a Remote Storage option #503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Google Drive and AWS S3 as a Remote Storage option #503
Conversation
Hey @adamltyson @JoeZiminski, I know reviewing this PR wouldn't be feasible at the moment. But it would be great if you can try it out locally whenever you have time and let me know if it works correctly / any suggestions you'd have. No hurry at all, I understand if it takes a bit longer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cs7-shrey thanks a lot for this, this is excellent. It works really well and the code is structured neatly and clearly. Please see comments for some suggestions / questions / general comments.
The only thought I had on the user-end was whether we can let them provide a specific google-drive folder (rather than giving access to all of their drive folders). This might be a little more secure and avoid us accidentally interfering with other folders on their drive. The user could pass the google drive folder ID and this could be used for connection. We will have to make it clear that central_path
now means relative to the root of the folder. It would also be nice to do this for AWS, but I think it is less simple and not in keeping with the way that AWS buckets works (but if it is possible, that would be great).
Please let me know if you have any questions!
@@ -39,9 +40,16 @@ def get_canonical_configs() -> dict: | |||
canonical_configs = { | |||
"local_path": Union[str, Path], | |||
"central_path": Optional[Union[str, Path]], | |||
"connection_method": Optional[Literal["ssh", "local_filesystem"]], | |||
"connection_method": Optional[ | |||
Literal["ssh", "local_filesystem", "gdrive", "aws_s3"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "aws"
as a name will be sufficient (?) as S3
will cover nearly all cases we will encounter (if we add other aws types, we can extend the name for those types only)
"central_host_id": Optional[str], | ||
"central_host_username": Optional[str], | ||
"gdrive_client_id": Optional[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good and extend the existing approach, similar for the checks. Down the line we may want to refactor this, we can create an issue to think about it (maybe there is a way to simplfy it as a user-facing API). For now however, let's proceed with this structure as it is simple, and refactoring this might mean changing a lot of internal things related to configs.
@@ -128,6 +136,32 @@ def check_dict_values_raise_on_fail(config_dict: Configs) -> None: | |||
ConfigError, | |||
) | |||
|
|||
# Check gdrive settings | |||
elif config_dict["connection_method"] == "gdrive" and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, is it possible be set to gdrive
but supply neither gdrive_client_id
nor gdrive_client_secret
?
US_GOV_WEST_1 = "us-gov-west-1" | ||
|
||
@classmethod | ||
def get_all_regions(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool and nice use of class and class method, it is cleaner than a dictionary. I think however there is a subtly here that means it would be preferable to define this as a dictionary and return it from a function (there can be two functions, get_canonical_regions()
or get_all_region_names() : return list(get_canonical_regions().keys())
). One of the reasons is that the getter function is slightly cleaner as it does not have to workaround __
-prefixed instances of the class, but this is negligible.
There is a more subtle issue around scope, for this class the scope becomes global across the entire application, meaning that if one of the regions is accidentally changed somewhere (not that it ever should be) it will propagate across the application. For example, if you create a file called mod1.py1
:
class myMod:
attr=1
and in a second file mod2.py
:
from mod1 import myMod
import mod1
myMod.attr = 2
print(myMod.attr)
print(mod1.myMod.attr)
# prints 2, 2
you will see that the attribute value is changed to 2
in both modules. If we return a dict from a function, the returned value is instantiated at call time and so will not be global i.e. in mod1.py
:
def get_my_mod():
myMod = {"attr": 1}
return myMod
and in mod2.py
:
import mod1
from mod1 import get_my_mod
myMod1 = get_my_mod()
myMod1["attr"] = 2
print(myMod1["attr"])
print(mod1.get_my_mod()["attr"])
# prints 2, 1
@@ -0,0 +1,35 @@ | |||
class AWS_REGION: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this module can be called aws_regions.py
just for added clarity
from datashuttle.utils import rclone, utils | ||
|
||
|
||
def preliminary_for_setup_without_browser( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great, giving the option to connect without browser
def preliminary_for_setup_without_browser( | ||
cfg: Configs, rclone_config_name: str, log: bool = True | ||
): | ||
client_id_key_value = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note here as for in rclone, let's handle this at a high level and assume that these are at least None
in the config.
if cfg.get("gdrive_client_secret", None) | ||
else "" | ||
) | ||
output = rclone.call_rclone( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these arguments to the rclone call are a little mysterious, a short comment above would be useful just to explain to readers what is going on. Of course, its always possible to check the rclone docs, but in general it is more efficient to leave a short comment to give the pertinent details.
pipe_std=True, | ||
) | ||
|
||
# TODO: make this more robust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO here
answer = False | ||
|
||
if log: | ||
utils.log(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the answer also be logged here?
Description
This pull requests is a first version of Google Drive and AWS S3 integration in the datashuttle API and TUI.
What is this PR
Why is this PR needed?
What does this PR do?
This PR is to discuss the approach for implementing Google Drive and AWS S3 as a remote storage option.
References
Issue #407
How has this PR been tested?
This PR was manually tested in the Datashuttle's Python API and TUI. Comprehensive tests are yet to be written.
Is this a breaking change?
No
Does this PR require an update to the documentation?
Yes
Checklist: