Skip to content

feat: top-level dlt.dataset() #2654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: devel
Choose a base branch
from
Draft

feat: top-level dlt.dataset() #2654

wants to merge 2 commits into from

Conversation

zilto
Copy link
Collaborator

@zilto zilto commented May 16, 2025

This adds the ability to load datasets via dlt.dataset(dataset_name=...)

This is a WIP and the PR is to ground the discussion

Description

Previously, datasets were primarily accessed through a dlt.Pipeline instance

import dlt

pipeline = dlt.pipeline(...)
# or for existing pipelines 
pipeline = dlt.attach(pipeline_name=...)

dataset = pipeline.dataset()

This is a bit odd for dataset users who are only interested in the dataset. Also, the mapping between pipelines and datasets is not obvious and adds cognitive loads.

Some notes (AFAIK:

  • Setting the same values for dlt.pipeline(pipeline_name=..., dataset_name=...) raises a warning
  • By default the dataset takes the name {pipeline_name}_dataset
  • The instantiation dataset = pipeline.dataset() suggests a one to one mapping, but actually it's taking the dataset_name from the currently instantiated dlt.Pipeline.
  • the relationship pipeline name <-> dataset name is many-to-many.
  • the relationship pipeline name -> schema name is one-to-many.
  • the relationship (pipeline name, schema name) - dataset name is one-to-one.

Proposed solution

Automatically discover local datasets using pipelines_dir (default: ~/.dlt/pipelines).

How it works:

  • look at pipelines folders in ~/.dlt/pipelines
  • for each pipeline: load the state.json to retrieve the dataset_name, destination type, and the list of schema_names
  • for each schema name: load the schemas/{schema_name}.json
  • instantiate the dataset from (dataset_name, destination type, schema)

Limitations

  • You might not have the credentials to access the dataset. It's common if you ran the pipeline in /path/to/foo with credentials in /path/to/foo/.dlt/ and now trying to access the dataset while in /path/to/bar
  • the state.json seems to only store the latest loaded dataset_name. Couldn't figure out how to retrieve multiple dataset names for a given pipeline
  • You can't discover and access datasets that you didn't load yourself

In short, overcoming the limitations requires dlt+ mechanisms. However, this interface could still be useful for devs

Copy link

netlify bot commented May 16, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit e0bdedd
🔍 Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/682798abb205d70008f662fa

@rudolfix
Copy link
Collaborator

rudolfix commented May 20, 2025

I have a few notes to this:

  1. our dataset implementation can be instantiated without pipeline already. it is just hidden because it is a big decision to expose such interface. (it is like that already 6 months and this is about to change). here's the interface:
from dlt.destinations.dataset.factory import dataset

to instantiate a dataset you must provide destination name (or factory) and dataset name. schema is optional. we will pull the schema from the destination (by name or the most recent one).

the interface is pretty convenient for what it does and we can promote it to top level

  1. the biggest dilemma in the interface above is: dataset() is a materialized dlt schema. and the location where it is materialized (ie. database schema) may host several dlt schemas. this is not immediately obvious when you create a pipeline: dataset_name indicates physical location. (which is true by default, in dlt+ we basically assume that, 99% of our examples are single schemas - except a few multi-source verified sources like zendesk).

so I still ponder if we should change the dataset interface and remove schema from there and always take all newest dlt schemas from physical dataset to create a union of tables.

FYI: dlt started with strict dlt schema == dataset_name policy and for each schema (that is not a default one) a separate physical location was created. this mode is still there but we do not show that in our docs. our users could not understand where those additional datasets are coming from so we switched to current mode.

  1. one more note: all relations in dlt are many to many ;> and we should promote certain constraints as a good practice.

the relationship pipeline name <-> dataset name is many-to-many.

we should promote many to one or one to one. (many pipelines can write to a single location, but a specific pipeline writes to specific location and that never changes)

the relationship pipeline name -> schema name is one-to-many.

A pipeline contains many schemas so that is right. as mentioned above 99% of our examples are one-to-one.
There are some examples where single source (thus a dlt schema) is loaded in many pipelines. then we have many to many...

the relationship (pipeline name, schema name) - dataset name is one-to-one.

this is true for dataset() (our object) not a dataset name!~
thanks to pipeline being bound to a destination. you can translate that to:
(destination, dataset name, schema name) - dataset 1:1

  1. to me your PR is a call for a catalog where we centrally store all schemas and dataset locations :) sounds like iceberg catalog. it is one of the "big things" on our roadmap for this year. Your code IMO will be very useful if we implement dlt dataset cli - in that case we can inspect any dataset from known pipelines and also list them

@zilto
Copy link
Collaborator Author

zilto commented May 20, 2025

Ok, that's all useful context! You're right that adding user-facing APIs is a big deal. IMO, the dlt.dataset() entity helps understand ELT / ETL. It's typically data --pipeline--> dataset --transform--> dataset. Intuitively, working on transform should be independent of pipeline

@sh-rp
Copy link
Collaborator

sh-rp commented May 26, 2025

Purely on the topic of a top level dataset, what I would simply do is:

  1. Release the interface as it is now
  2. Make it so we get this union of all present schemas if no schema or schema name is given. @molkazhani2001 already was confused once when using the streamlit app to check on transformed tables in the same dataset and she was not able to see both the original tables and transformed tables in the same viewer.

Possible additions:

  1. Allow a pipeline to take a dataset instance as a destination. The pipeline would take the destination, the dataset name and if present, the schema name from the dataset instance. I am not 100% sure about this interface though, maybe the schema name should always come from the source.
  2. Have a method to discover all dlt datasets present on a given destination. This would be useful when connecting with our app to a destination to see what is going on there. Right now this always happens in a pipeline context which does not always makes sense as @zilto points out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants