iterative · ilongin · Oct 20, 2025 · Sep 24, 2025 · Sep 24, 2025 · Sep 25, 2025
diff --git a/docs/guide/checkpoints.md b/docs/guide/checkpoints.md
@@ -0,0 +1,207 @@
+# Checkpoints
+
+Checkpoints allow DataChain to automatically skip re-creating datasets that were successfully saved in previous script runs. When a script fails or is interrupted, you can re-run it and DataChain will resume from where it left off, reusing datasets that were already created.
+
+**Note:** Checkpoints are currently available only for local script runs. Support for Studio is planned for future releases.
+
+## How Checkpoints Work
+
+When you run a Python script locally (e.g., `python my_script.py`), DataChain automatically:
+
+1. **Creates a job** for the script execution, using the script's absolute path as the job name
+2. **Tracks parent jobs** by finding the last job with the same script name
+3. **Calculates hashes** for each dataset save operation based on the DataChain operations chain
+4. **Creates checkpoints** after each successful `.save()` call, storing the hash
+5. **Checks for existing checkpoints** on subsequent runs - if a matching checkpoint exists in the parent job, DataChain skips the save and reuses the existing dataset
+
+This means that if your script creates multiple datasets and fails partway through, the next run will skip recreating the datasets that were already successfully saved.
+
+## Example
+
+Consider this script that processes data in multiple stages:
+
+```python
+import datachain as dc
+
+# Stage 1: Load and filter data
+filtered = (
+    dc.read_csv("s3://mybucket/data.csv")
+    .filter(dc.C("score") > 0.5)
+    .save("filtered_data")
+)
+
+# Stage 2: Transform data
+transformed = (
+    filtered
+    .map(value=lambda x: x * 2, output=float)
+    .save("transformed_data")
+)
+
+# Stage 3: Aggregate results
+result = (
+    transformed
+    .agg(
+        total=lambda values: sum(values),
+        partition_by="category",
+    )
+    .save("final_results")
+)
+```
+
+**First run:** The script executes all three stages and creates three datasets: `filtered_data`, `transformed_data`, and `final_results`. If the script fails during Stage 3, only `filtered_data` and `transformed_data` are saved.
+
+**Second run:** DataChain detects that `filtered_data` and `transformed_data` were already created in the parent job with matching hashes. It skips recreating them and proceeds directly to Stage 3, creating only `final_results`.
+
+## When Checkpoints Are Used
+
+Checkpoints are automatically used when:
+
+- Running a Python script locally (e.g., `python my_script.py`)
+- The script has been run before
+- A dataset with the same name is being saved
+- The chain hash matches a checkpoint from the parent job
+
+Checkpoints are **not** used when:
+
+- Running code interactively (Python REPL, Jupyter notebooks)
+- Running code as a module (e.g., `python -m mymodule`)
+- The `DATACHAIN_CHECKPOINTS_RESET` environment variable is set (see below)
+- Running on Studio (checkpoints support planned for future releases)
+
+## Resetting Checkpoints
+
+To ignore existing checkpoints and run your script from scratch, set the `DATACHAIN_CHECKPOINTS_RESET` environment variable:
+
+```bash
+export DATACHAIN_CHECKPOINTS_RESET=1
+python my_script.py
+```
+
+Or set it inline:
+
+```bash
+DATACHAIN_CHECKPOINTS_RESET=1 python my_script.py
+```
+
+This forces DataChain to recreate all datasets, regardless of existing checkpoints.
+
+## How Job Names Are Determined
+
+DataChain uses different strategies for naming jobs depending on how the code is executed:
+
+### Script Execution (Checkpoints Enabled)
+
+When running `python my_script.py`, DataChain uses the **absolute path** to the script as the job name:
+
+```
+/home/user/projects/my_script.py
+```
+
+This allows DataChain to link runs of the same script together as parent-child jobs, enabling checkpoint lookup.
+
+### Interactive or Module Execution (Checkpoints Disabled)
+
+When running code interactively or as a module, DataChain uses a **unique UUID** as the job name:
+
+```
+a1b2c3d4-e5f6-7890-abcd-ef1234567890
+```
+
+This prevents unrelated executions from being linked together, but also means checkpoints cannot be used.
+
+## How Checkpoint Hashes Are Calculated
+
+For each `.save()` operation, DataChain calculates a hash based on:
+
+1. The hash of the previous checkpoint in the current job (if any)
+2. The hash of the current DataChain operations chain
+
+This creates a chain of hashes that uniquely identifies each stage of data processing. On subsequent runs, DataChain matches these hashes against the parent job's checkpoints and skips recreating datasets where the hashes match.
+
+### Hash Invalidation
+
+**Checkpoints are automatically invalidated when you modify the chain.** Any change to the DataChain operations will result in a different hash, causing DataChain to skip the checkpoint and recompute the dataset.
+
+Changes that invalidate checkpoints include:
+
+- **Modifying filter conditions:** `.filter(dc.C("score") > 0.5)` → `.filter(dc.C("score") > 0.8)`
+- **Changing map/gen/agg functions:** Any modification to UDF logic
+- **Altering function parameters:** Changes to column names, output types, or other parameters
+- **Adding or removing operations:** Inserting new `.filter()`, `.map()`, or other steps
+- **Reordering operations:** Changing the sequence of transformations
+
+### Example
+
+```python
+# First run - creates three checkpoints
+dc.read_csv("data.csv").save("stage1")  # Hash = H1
+
+dc.read_dataset("stage1").filter(dc.C("x") > 5).save("stage2")  # Hash = H2 = hash(H1 + pipeline_hash)
+
+dc.read_dataset("stage2").select("name", "value").save("stage3")  # Hash = H3 = hash(H2 + pipeline_hash)
+```
+
+**Second run (no changes):**
+- All three hashes match → all three datasets are reused → no computation
+
+**Second run (modified filter):**
+```python
+dc.read_csv("data.csv").save("stage1")  # Hash = H1 matches ✓ → reused
+
+dc.read_dataset("stage1").filter(dc.C("x") > 10).save("stage2")  # Hash ≠ H2 ✗ → recomputed
+
+dc.read_dataset("stage2").select("name", "value").save("stage3")  # Hash ≠ H3 ✗ → recomputed
+```
+
+Because the filter changed, `stage2` has a different hash and must be recomputed. Since `stage3` depends on `stage2`, its hash also changes (because it includes H2 in the calculation), so it must be recomputed as well.
+
+**Key insight:** Modifying any step in the chain invalidates that checkpoint and all subsequent checkpoints, because the hash chain is broken.
+
+## Dataset Persistence
+
+Starting with the checkpoints feature, datasets created during script execution persist even if the script fails or is interrupted. This is essential for checkpoint functionality, as it allows subsequent runs to reuse successfully created datasets.
+
+If you need to clean up datasets from failed runs, you can use:
+
+```python
+import datachain as dc
+
+# Remove a specific dataset
+dc.delete_dataset("dataset_name")
+
+# List all datasets to see what's available
+for ds in dc.datasets():
+    print(ds.name)
+```
+
+## Limitations
+
+- **Local only:** Checkpoints currently work only for local script runs. Studio support is planned.
+- **Script-based:** Code must be run as a script (not interactively or as a module).
+- **Hash-based matching:** Any change to the chain will create a different hash, preventing checkpoint reuse.
+- **Same script path:** The script must be run from the same absolute path for parent job linking to work.
+
+## Future Plans
+
+### Studio Support
+
+Support for checkpoints on Studio is planned for future releases, which will enable checkpoint functionality for collaborative workflows and cloud-based data processing.
+
+### UDF-Level Checkpoints
+
+Currently, checkpoints are created only when datasets are saved using `.save()`. This means that if a script fails during a long-running UDF operation (like `.map()`, `.gen()`, or `.agg()`), the entire UDF computation must be rerun on the next execution.
+
+Future versions will support **UDF-level checkpoints**, creating checkpoints after each UDF step in the chain. This will provide much more granular recovery:
+
+```python
+# Future behavior with UDF-level checkpoints
+result = (
+    dc.read_csv("data.csv")
+    .map(heavy_computation_1)  # Checkpoint created after this UDF
+    .map(heavy_computation_2)  # Checkpoint created after this UDF
+    .map(heavy_computation_3)  # Checkpoint created after this UDF
+    .save("result")
+)
+```
+
+If the script fails during `heavy_computation_3`, the next run will skip re-executing `heavy_computation_1` and `heavy_computation_2`, resuming only the work that wasn't completed.
diff --git a/docs/guide/index.md b/docs/guide/index.md
@@ -10,6 +10,7 @@ Welcome to the DataChain User Guide! This section provides comprehensive documen
 - [Data Processing Overview](./processing.md) - Discover DataChain's specialized data processing features.
     - [Delta Processing](./delta.md) - Incremental data processing to efficiently handle large datasets that change over time.
     - [Error Handling and Retries](./retry.md) - Learn how to handle processing errors and selectively reprocess problematic records.
+- [Checkpoints](./checkpoints.md) - Automatically resume script execution from where it left off after failures.
 - [Environment Variables](./env.md) - Configure DataChain's behavior using environment variables.
 - [Namespaces](./namespaces.md) - Learn more about namespaces and projects.
 - [Local DB Migrations](./namespaces.md) - Learn how to handle local DB migrations after upgrading datachain.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -114,6 +114,7 @@ nav:
               - Overview: guide/processing.md
               - Delta Processing: guide/delta.md
               - Errors Handling and Retries: guide/retry.md
+          - Checkpoints: guide/checkpoints.md
           - Environment Variables: guide/env.md
           - Namespaces: guide/namespaces.md
           - Local DB Migrations: guide/db_migrations.md

diff --git a/src/datachain/catalog/catalog.py b/src/datachain/catalog/catalog.py
@@ -793,6 +793,7 @@ def create_dataset(
         description: str | None = None,
         attrs: list[str] | None = None,
         update_version: str | None = "patch",
+        job_id: str | None = None,
     ) -> "DatasetRecord":
         """
         Creates new dataset of a specific version.
@@ -866,6 +867,7 @@ def create_dataset(
             create_rows_table=create_rows,
             columns=columns,
             uuid=uuid,
+            job_id=job_id,
         )
 
     def create_new_dataset_version(

diff --git a/src/datachain/data_storage/metastore.py b/src/datachain/data_storage/metastore.py
@@ -448,6 +448,10 @@ def set_job_status(
     def get_job_status(self, job_id: str) -> JobStatus | None:
         """Returns the status of the given job."""
 
+    @abstractmethod
+    def get_last_job_by_name(self, name: str, conn=None) -> "Job | None":
+        """Returns the last job with the given name, ordered by created_at."""
+
     #
     # Checkpoints
     #
@@ -1685,6 +1689,18 @@ def list_jobs_by_ids(self, ids: list[str], conn=None) -> Iterator["Job"]:
         query = self._jobs_query().where(self._jobs.c.id.in_(ids))
         yield from self._parse_jobs(self.db.execute(query, conn=conn))
 
+    def get_last_job_by_name(self, name: str, conn=None) -> "Job | None":
+        query = (
+            self._jobs_query()
+            .where(self._jobs.c.name == name)
+            .order_by(self._jobs.c.created_at.desc())
+            .limit(1)
+        )
+        results = list(self.db.execute(query, conn=conn))
+        if not results:
+            return None
+        return self._parse_job(results[0])
+
     def create_job(
         self,
         name: str,

diff --git a/src/datachain/job.py b/src/datachain/job.py
@@ -56,5 +56,5 @@ def parse(
             python_version,
             error_message,
             error_stack,
-            parent_job_id,
+            str(parent_job_id) if parent_job_id else None,
         )
diff --git a/src/datachain/lib/dc/datachain.py b/src/datachain/lib/dc/datachain.py
@@ -27,7 +27,6 @@
 from datachain.dataset import DatasetRecord
 from datachain.delta import delta_disabled
 from datachain.error import (
-    JobNotFoundError,
     ProjectCreateNotAllowedError,
     ProjectNotFoundError,
 )
@@ -627,6 +626,9 @@ def save(  # type: ignore[override]
         self._validate_version(version)
         self._validate_update_version(update_version)
 
+        # get existing job if running in SaaS, or creating new one if running locally
+        job = self.session.get_or_create_job()
+
         namespace_name, project_name, name = catalog.get_full_dataset_name(
             name,
             namespace_name=self._settings.namespace,
@@ -635,7 +637,7 @@ def save(  # type: ignore[override]
         project = self._get_or_create_project(namespace_name, project_name)
 
         # Checkpoint handling
-        job, _hash, result = self._resolve_checkpoint(name, project, kwargs)
+        _hash, result = self._resolve_checkpoint(name, project, job, kwargs)
 
         # Schema preparation
         schema = self.signals_schema.clone_without_sys_signals().serialize()
@@ -655,13 +657,12 @@ def save(  # type: ignore[override]
                     attrs=attrs,
                     feature_schema=schema,
                     update_version=update_version,
+                    job_id=job.id,
                     **kwargs,
                 )
             )
 
-        if job:
-            catalog.metastore.create_checkpoint(job.id, _hash)  # type: ignore[arg-type]
-
+        catalog.metastore.create_checkpoint(job.id, _hash)  # type: ignore[arg-type]
         return result
 
     def _validate_version(self, version: str | None) -> None:
@@ -690,23 +691,15 @@ def _resolve_checkpoint(
         self,
         name: str,
         project: Project,
+        job: Job,
         kwargs: dict,
-    ) -> tuple[Job | None, str | None, "DataChain | None"]:
+    ) -> tuple[str, "DataChain | None"]:
         """Check if checkpoint exists and return cached dataset if possible."""
         from .datasets import read_dataset
 
         metastore = self.session.catalog.metastore
-
-        job_id = os.getenv("DATACHAIN_JOB_ID")
         checkpoints_reset = env2bool("DATACHAIN_CHECKPOINTS_RESET", undefined=True)
 
-        if not job_id:
-            return None, None, None
-
-        job = metastore.get_job(job_id)
-        if not job:
-            raise JobNotFoundError(f"Job with id {job_id} not found")
-
         _hash = self._calculate_job_hash(job.id)
 
         if (
@@ -718,9 +711,9 @@ def _resolve_checkpoint(
             chain = read_dataset(
                 name, namespace=project.namespace.name, project=project.name, **kwargs
             )
-            return job, _hash, chain
+            return _hash, chain
 
-        return job, _hash, None
+        return _hash, None
 
     def _handle_delta(
         self,

diff --git a/src/datachain/lib/dc/records.py b/src/datachain/lib/dc/records.py
@@ -78,8 +78,6 @@ def read_records(
         ),
     )
 
-    session.add_dataset_version(dsr, dsr.latest_version)
-
     if isinstance(to_insert, dict):
         to_insert = [to_insert]
     elif not to_insert:

diff --git a/src/datachain/query/dataset.py b/src/datachain/query/dataset.py
@@ -1927,10 +1927,6 @@ def save(
             )
             version = version or dataset.latest_version
 
-            self.session.add_dataset_version(
-                dataset=dataset, version=version, listing=kwargs.get("listing", False)
-            )
-
             dr = self.catalog.warehouse.dataset_rows(dataset)
 
             self.catalog.warehouse.copy_table(dr.get_table(), query.select())