feat: `Dataset.write()` #3092

zilto · 2025-09-16T21:02:38Z

Users of dlt.Dataset want a simple way to write data back to the dataset.

Use cases:

manually review data and push corrected records
simple way to add records if you don't have access to the original dlt.Pipeline used to create the dataset

Other motivations

This interface will simplify data-centric operations involved in:

storing data quality checks results on destination
creating a graph of datasets where the "internal pipeline" of dataset is used
integrate with orchestration frameworks

Specs

Look at WritableDataset.save() from dlt-plus
Add Dataset.write() in dlt (this aligns with pipeline.run() operation)
- Alternatives: .write_to(), .load_into(), .load_table()
create an internal dlt.Pipeline named _dlt_dataset_{dataset_name}
find a way for the internal pipeline to use the dlt.Schema from the dlt.Dataset instance; this way, this schema should evolve when Dataset.load() is used

potential API

def write(
  self: dlt.Dataset,
  data: TDataItems,
  *,
  table_name: str,
  write_disposition: TWriteDisposition = "append",
  normalize: bool = False,
) -> LoadInfo: ...

write_disposition is useful to determine if we should append or modify existing records
normalize allows the user to decide to enable normalization (which might create more tables)

can accept a dlt.Relation as input

Out of scope

Dataset.load() doesn't have to support 1-to-1 the dlt.Pipeline.run() method; if user needs full range of config, then they should create a pipeline

netlify · 2025-09-16T21:02:43Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`2c08e88`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/68cdfa69f43aca0008136715
😎 Deploy Preview	https://deploy-preview-3092--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

rudolfix

My take on the interface.

write is OK!
make it a clearly specialized method to write a data chunk: current simple interface is good.
allowing Relation as data is a good idea. so there's certain symmetry (querying works with relations). in that case you can make table_name optional (relation has table name AFAIK?)

Implementation details:
my take would be to make internal pipeline used in write as invisible as possible:

disable destination sync, state sync and schema evolution (a total freeze on a table via contract)
possibly use pipelines-dir to hide it from command line and dashboard.

in essence we pretend that this pipeline does not exist

WDYT?

a helper class to get write pipeline is cool. I'd make it a public helper method. another helper method would be to convert Dataset into Source but that's another problem

I'd keep the internal pipeline that just loads data super simple

cloudflare-workers-and-pages · 2025-09-20T00:50:58Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Updated (UTC)
❌ Deployment failed View logs	docs	`2c08e88`	Sep 20 2025, 12:55 AM

zilto · 2025-09-20T01:11:37Z

my take would be to make internal pipeline used in write as invisible as possible
possibly use pipelines-dir to hide it from command line and dashboard.
in essence we pretend that this pipeline does not exist

I changed the internal pipeline to be a context manager that uses a temporary directory as pipelines_dir

disable destination sync, state sync and schema evolution (a total freeze on a table via contract)

I don't know exactly what I need to change / configure for destination and state sync (it doesn't seem to be in the kwargs for dlt.pipeline() and pipeline.run()).

For schema evolution, users should be able to modify schema. For example, someone wants to add a column or cast types. Though, I would have frozen schema as default and require users to explicitly change it.

sh-rp · 2025-09-29T10:17:10Z

tests/dataset/test_dataset_write.py

+    table_name = "bar"
+    items = [{"id": 0, "value": "bingo"}, {"id": 1, "value": "bongo"}]
+
+    # TODO this is currently odd because the tables exists on the `Schema`


Schemas create the default tables on initialization. We could consider changing this, but that is the reason why they will always exist on any Schema instance regardless of wether anything was materialized.

I think the current behavior is ok and I wouldn't change it. Just wanted to leave a note in the test because the assertion could be surprising.

sh-rp · 2025-09-29T10:20:02Z

dlt/dataset/dataset.py

+
+        Passing a `pipelines_dir` allows you to set a
+        """
+        with tempfile.TemporaryDirectory() as tmp_dir:


I think we should not run in this in a temporary directory but give the pipeline a predictable name and store it with the other pipeline metadata, this way the user can debug the run like any other pipeline run. This is up for debate though.

sh-rp · 2025-09-29T10:22:22Z

dlt/dataset/dataset.py

+        data: TDataItems,
+        *,
+        table_name: str,
+        write_disposition: TWriteDisposition = "append",


I think the columns and loader_file_format args from the run method would also be good canditates here. You can also consider a pipeline_kwargs argument that get's forwarded to the internal pipeline instantiation. But maybe we do not need this and can add it if requested.

sh-rp · 2025-09-29T10:22:31Z

dlt/common/schema/schema.py

        del state["data_item_normalizer"]
        return state
+
+    def __eq__(self, other: Any) -> bool:


sh-rp · 2025-09-29T10:24:08Z

tests/dataset/test_dataset_write.py

@@ -0,0 +1,93 @@
+import pathlib


We need tests for writing into tables that already exist, and reading back from those tables with our database reader methods to see whether the schema was updated properly and works.<

We should also make sure that all the args provided to the write methods are forwarded properly.

sh-rp · 2025-09-29T10:26:33Z

dlt/dataset/dataset.py

+        Passing a `pipelines_dir` allows you to set a
+        """
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            pipeline = _get_internal_pipeline(


You need to forward the staging destination here too, which probably includes knowing the staging destination on the dataset already. Alternatively one would have to provide the staging_destination via the run_kwargs. For working within notebooks where you often get the dataset from a pipeline instace it seems to me it would be good to always have it set on a dataset when you get it from the pipeline.

sh-rp

I have added a few thoughts to consider :)

zilto added 4 commits September 16, 2025 15:25

added write pipeline to dataset

69e4f0a

added __eq__ and __hash__ fields to Schema for easier comparison

73bc8a3

added tests

3f38932

linting and formatting

f519ac3

zilto requested review from rudolfix and sh-rp September 16, 2025 21:02

zilto self-assigned this Sep 16, 2025

zilto added the enhancement New feature or request label Sep 16, 2025

rudolfix reviewed Sep 19, 2025

View reviewed changes

use tmp directory for write pipeline

2c08e88

sh-rp reviewed Sep 29, 2025

View reviewed changes

dlt/common/schema/schema.py

del state["data_item_normalizer"]

return state

def __eq__(self, other: Any) -> bool:

Copy link

Collaborator

sh-rp Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

sh-rp reviewed Sep 29, 2025

View reviewed changes

sh-rp requested changes Sep 29, 2025

View reviewed changes

feat: Dataset.write() #3092

Are you sure you want to change the base?

feat: Dataset.write() #3092

Uh oh!

Conversation

zilto commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other motivations

Specs

Out of scope

Uh oh!

netlify bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs ready!

Uh oh!

rudolfix left a comment

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

zilto commented Sep 20, 2025

Uh oh!

sh-rp Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

zilto Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sh-rp Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sh-rp Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sh-rp Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sh-rp Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sh-rp Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sh-rp left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: `Dataset.write()` #3092

feat: `Dataset.write()` #3092

zilto commented Sep 16, 2025 •

edited

Loading

netlify bot commented Sep 16, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Sep 20, 2025 •

edited

Loading