feat: top-level `dlt.dataset()` #2654

zilto · 2025-05-16T19:48:37Z

This adds the ability to load datasets via dlt.dataset(dataset_name=...)

This is a WIP and the PR is to ground the discussion

Description

Previously, datasets were primarily accessed through a dlt.Pipeline instance

import dlt

pipeline = dlt.pipeline(...)
# or for existing pipelines 
pipeline = dlt.attach(pipeline_name=...)

dataset = pipeline.dataset()

This is a bit odd for dataset users who are only interested in the dataset. Also, the mapping between pipelines and datasets is not obvious and adds cognitive loads.

Some notes (AFAIK:

Setting the same values for dlt.pipeline(pipeline_name=..., dataset_name=...) raises a warning
By default the dataset takes the name {pipeline_name}_dataset
The instantiation dataset = pipeline.dataset() suggests a one to one mapping, but actually it's taking the dataset_name from the currently instantiated dlt.Pipeline.
the relationship pipeline name <-> dataset name is many-to-many.
the relationship pipeline name -> schema name is one-to-many.
the relationship (pipeline name, schema name) - dataset name is one-to-one.

Proposed solution

Automatically discover local datasets using pipelines_dir (default: ~/.dlt/pipelines).

How it works:

look at pipelines folders in ~/.dlt/pipelines
for each pipeline: load the state.json to retrieve the dataset_name, destination type, and the list of schema_names
for each schema name: load the schemas/{schema_name}.json
instantiate the dataset from (dataset_name, destination type, schema)

Limitations

You might not have the credentials to access the dataset. It's common if you ran the pipeline in /path/to/foo with credentials in /path/to/foo/.dlt/ and now trying to access the dataset while in /path/to/bar
the state.json seems to only store the latest loaded dataset_name. Couldn't figure out how to retrieve multiple dataset names for a given pipeline
You can't discover and access datasets that you didn't load yourself

In short, overcoming the limitations requires dlt+ mechanisms. However, this interface could still be useful for devs

netlify · 2025-05-16T19:48:42Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`e0bdedd`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/682798abb205d70008f662fa

rudolfix · 2025-05-20T10:44:32Z

I have a few notes to this:

our dataset implementation can be instantiated without pipeline already. it is just hidden because it is a big decision to expose such interface. (it is like that already 6 months and this is about to change). here's the interface:

from dlt.destinations.dataset.factory import dataset

to instantiate a dataset you must provide destination name (or factory) and dataset name. schema is optional. we will pull the schema from the destination (by name or the most recent one).

the interface is pretty convenient for what it does and we can promote it to top level

the biggest dilemma in the interface above is: dataset() is a materialized dlt schema. and the location where it is materialized (ie. database schema) may host several dlt schemas. this is not immediately obvious when you create a pipeline: dataset_name indicates physical location. (which is true by default, in dlt+ we basically assume that, 99% of our examples are single schemas - except a few multi-source verified sources like zendesk).

so I still ponder if we should change the dataset interface and remove schema from there and always take all newest dlt schemas from physical dataset to create a union of tables.

FYI: dlt started with strict dlt schema == dataset_name policy and for each schema (that is not a default one) a separate physical location was created. this mode is still there but we do not show that in our docs. our users could not understand where those additional datasets are coming from so we switched to current mode.

one more note: all relations in dlt are many to many ;> and we should promote certain constraints as a good practice.

the relationship pipeline name <-> dataset name is many-to-many.

we should promote many to one or one to one. (many pipelines can write to a single location, but a specific pipeline writes to specific location and that never changes)

the relationship pipeline name -> schema name is one-to-many.

A pipeline contains many schemas so that is right. as mentioned above 99% of our examples are one-to-one.
There are some examples where single source (thus a dlt schema) is loaded in many pipelines. then we have many to many...

the relationship (pipeline name, schema name) - dataset name is one-to-one.

this is true for dataset() (our object) not a dataset name!~
thanks to pipeline being bound to a destination. you can translate that to:
(destination, dataset name, schema name) - dataset 1:1

to me your PR is a call for a catalog where we centrally store all schemas and dataset locations :) sounds like iceberg catalog. it is one of the "big things" on our roadmap for this year. Your code IMO will be very useful if we implement dlt dataset cli - in that case we can inspect any dataset from known pipelines and also list them

zilto · 2025-05-20T13:55:01Z

Ok, that's all useful context! You're right that adding user-facing APIs is a big deal. IMO, the dlt.dataset() entity helps understand ELT / ETL. It's typically data --pipeline--> dataset --transform--> dataset. Intuitively, working on transform should be independent of pipeline

sh-rp · 2025-05-26T09:45:05Z

Purely on the topic of a top level dataset, what I would simply do is:

Release the interface as it is now
Make it so we get this union of all present schemas if no schema or schema name is given. @molkazhani2001 already was confused once when using the streamlit app to check on transformed tables in the same dataset and she was not able to see both the original tables and transformed tables in the same viewer.

Possible additions:

Allow a pipeline to take a dataset instance as a destination. The pipeline would take the destination, the dataset name and if present, the schema name from the dataset instance. I am not 100% sure about this interface though, maybe the schema name should always come from the source.
Have a method to discover all dlt datasets present on a given destination. This would be useful when connecting with our app to a destination to see what is going on there. Right now this always happens in a pipeline context which does not always makes sense as @zilto points out.

added top-level dataset access

28fa508

linting

e0bdedd

sh-rp assigned zilto May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: top-level `dlt.dataset()` #2654

feat: top-level `dlt.dataset()` #2654

Uh oh!

zilto commented May 16, 2025

Uh oh!

netlify bot commented May 16, 2025 •

edited

Loading

Uh oh!

rudolfix commented May 20, 2025 •

edited

Loading

Uh oh!

zilto commented May 20, 2025

Uh oh!

sh-rp commented May 26, 2025

Uh oh!

Uh oh!

feat: top-level dlt.dataset() #2654

Are you sure you want to change the base?

feat: top-level dlt.dataset() #2654

Uh oh!

Conversation

zilto commented May 16, 2025

Description

Proposed solution

Limitations

Uh oh!

netlify bot commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs canceled.

Uh oh!

rudolfix commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zilto commented May 20, 2025

Uh oh!

sh-rp commented May 26, 2025

Uh oh!

Uh oh!

feat: top-level `dlt.dataset()` #2654

feat: top-level `dlt.dataset()` #2654

netlify bot commented May 16, 2025 •

edited

Loading

rudolfix commented May 20, 2025 •

edited

Loading