-
Notifications
You must be signed in to change notification settings - Fork 283
feat: top-level dlt.dataset()
#2654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
I have a few notes to this:
to instantiate a dataset you must provide destination name (or factory) and dataset name. schema is optional. we will pull the schema from the destination (by name or the most recent one). the interface is pretty convenient for what it does and we can promote it to top level
so I still ponder if we should change the dataset interface and remove schema from there and always take all newest dlt schemas from physical dataset to create a union of tables. FYI:
we should promote many to one or one to one. (many pipelines can write to a single location, but a specific pipeline writes to specific location and that never changes)
A pipeline contains many schemas so that is right. as mentioned above 99% of our examples are one-to-one.
this is true for
|
Ok, that's all useful context! You're right that adding user-facing APIs is a big deal. IMO, the |
Purely on the topic of a top level dataset, what I would simply do is:
Possible additions:
|
This adds the ability to load datasets via
dlt.dataset(dataset_name=...)
Description
Previously, datasets were primarily accessed through a
dlt.Pipeline
instanceThis is a bit odd for dataset users who are only interested in the dataset. Also, the mapping between pipelines and datasets is not obvious and adds cognitive loads.
Some notes (AFAIK:
dlt.pipeline(pipeline_name=..., dataset_name=...)
raises a warning{pipeline_name}_dataset
dataset = pipeline.dataset()
suggests a one to one mapping, but actually it's taking thedataset_name
from the currently instantiateddlt.Pipeline
.pipeline name <-> dataset name
is many-to-many.pipeline name -> schema name
is one-to-many.(pipeline name, schema name) - dataset name
is one-to-one.Proposed solution
Automatically discover local datasets using
pipelines_dir
(default:~/.dlt/pipelines
).How it works:
~/.dlt/pipelines
state.json
to retrieve thedataset_name
, destination type, and the list ofschema_names
schemas/{schema_name}.json
Limitations
/path/to/foo
with credentials in/path/to/foo/.dlt/
and now trying to access the dataset while in/path/to/bar
state.json
seems to only store the latest loadeddataset_name
. Couldn't figure out how to retrieve multiple dataset names for a given pipelineIn short, overcoming the limitations requires dlt+ mechanisms. However, this interface could still be useful for devs