Add some docs for distributed commands

jeromekelleher · jeromekelleher · commit 3fba9cf17129 · 2024-06-09T17:17:42.000+01:00
diff --git a/docs/vcf2zarr/tutorial.md b/docs/vcf2zarr/tutorial.md
@@ -176,9 +176,60 @@ We can then ``inspect`` to see that there is no ``call_HQ`` array in the output:
 vcf2zarr inspect sample_noHQ.vcz
 ```
 
-## Large
+:::{tip}
+Use the ``max-variants-chunks`` option to encode the first few chunks of your 
+dataset while doing these kinds of schema tuning operations!
+:::
+
+## Large dataset
+
+The {ref}`explode<cmd-vcf2zarr-explode>` 
+and {ref}`encode<cmd-vcf2zarr-encode>` commands have powerful features for 
+conversion on a single machine, and can take full advantage of large servers
+with many cores. Current biobank scale datasets, however, are so large that 
+we must go a step further and *distribute* computations over a cluster. 
+Vcf2zarr provides some low-level utilities that allow you to do this, that should 
+be compatible with any cluster scheduler. 
+
+The distributed commands are split into three phases:
+
+- **init <num_partitions>**: Initialise the computation, setting up the data structures needed
+for the bulk computation to be split into ``num_partitions`` independent partitions
+- **partition <j>**: perform the computation of partition ``j``
+- **finalise**: Complete the full process.
 
+When performing large-scale computations like this on a cluster, errors and job
+failures are essentially inevitable, and the commands are resilient to various
+failure modes.
 
+Let's go through the example above using the distributed commands. First, we 
+{ref}`dexplode-init<cmd-vcf2zarr-dexplode-init>` to create an ICF directory:
 
+```{code-cell}
+:tags: [remove-cell]
+rm -fR sample-dist.icf
+```
+```{code-cell}
+vcf2zarr dexplode-init sample.vcf.gz sample-dist.icf 5
+```
 
+Here we asked ``dexplode-init`` to set up an ICF store in which the data 
+is split into 5 partitions. The number of partitions determines the level
+of parallelism, so we would usually set this to the number of 
+parallel jobs we would like to use. The output of ``dexplode-init`` is 
+important though, as it tells us the **actual** number of partitions that 
+we have (partitioning is based on the VCF indexes, which have a limited
+granularity). You should be careful to use this value in your scripts 
+(the format is designed to be machine readable using e.g. ``cut`` and 
+``grep``).  In this case there are only 3 possible partitions.
+
+
+Once ``dexplode-init`` is done and we know how many partitions we have,
+we need to call ``dexplode-partition``  this number of times.
+
+<!-- ```{code-cell} -->
+<!-- vcf2zarr dexplode-partition sample-dist.icf 0 -->
+<!-- vcf2zarr dexplode-partition sample-dist.icf 1 -->
+<!-- vcf2zarr dexplode-partition sample-dist.icf 2 -->
+<!-- ``` -->