Skip to content

Commit 3fba9cf

Browse files
Add some docs for distributed commands
1 parent 27e43a3 commit 3fba9cf

File tree

1 file changed

+52
-1
lines changed

1 file changed

+52
-1
lines changed

docs/vcf2zarr/tutorial.md

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,9 +176,60 @@ We can then ``inspect`` to see that there is no ``call_HQ`` array in the output:
176176
vcf2zarr inspect sample_noHQ.vcz
177177
```
178178

179-
## Large
179+
:::{tip}
180+
Use the ``max-variants-chunks`` option to encode the first few chunks of your
181+
dataset while doing these kinds of schema tuning operations!
182+
:::
183+
184+
## Large dataset
185+
186+
The {ref}`explode<cmd-vcf2zarr-explode>`
187+
and {ref}`encode<cmd-vcf2zarr-encode>` commands have powerful features for
188+
conversion on a single machine, and can take full advantage of large servers
189+
with many cores. Current biobank scale datasets, however, are so large that
190+
we must go a step further and *distribute* computations over a cluster.
191+
Vcf2zarr provides some low-level utilities that allow you to do this, that should
192+
be compatible with any cluster scheduler.
193+
194+
The distributed commands are split into three phases:
195+
196+
- **init <num_partitions>**: Initialise the computation, setting up the data structures needed
197+
for the bulk computation to be split into ``num_partitions`` independent partitions
198+
- **partition <j>**: perform the computation of partition ``j``
199+
- **finalise**: Complete the full process.
180200

201+
When performing large-scale computations like this on a cluster, errors and job
202+
failures are essentially inevitable, and the commands are resilient to various
203+
failure modes.
181204

205+
Let's go through the example above using the distributed commands. First, we
206+
{ref}`dexplode-init<cmd-vcf2zarr-dexplode-init>` to create an ICF directory:
182207

208+
```{code-cell}
209+
:tags: [remove-cell]
210+
rm -fR sample-dist.icf
211+
```
212+
```{code-cell}
213+
vcf2zarr dexplode-init sample.vcf.gz sample-dist.icf 5
214+
```
183215

216+
Here we asked ``dexplode-init`` to set up an ICF store in which the data
217+
is split into 5 partitions. The number of partitions determines the level
218+
of parallelism, so we would usually set this to the number of
219+
parallel jobs we would like to use. The output of ``dexplode-init`` is
220+
important though, as it tells us the **actual** number of partitions that
221+
we have (partitioning is based on the VCF indexes, which have a limited
222+
granularity). You should be careful to use this value in your scripts
223+
(the format is designed to be machine readable using e.g. ``cut`` and
224+
``grep``). In this case there are only 3 possible partitions.
225+
226+
227+
Once ``dexplode-init`` is done and we know how many partitions we have,
228+
we need to call ``dexplode-partition`` this number of times.
229+
230+
<!-- ```{code-cell} -->
231+
<!-- vcf2zarr dexplode-partition sample-dist.icf 0 -->
232+
<!-- vcf2zarr dexplode-partition sample-dist.icf 1 -->
233+
<!-- vcf2zarr dexplode-partition sample-dist.icf 2 -->
234+
<!-- ``` -->
184235

0 commit comments

Comments
 (0)