@@ -176,9 +176,60 @@ We can then ``inspect`` to see that there is no ``call_HQ`` array in the output:
176
176
vcf2zarr inspect sample_noHQ.vcz
177
177
```
178
178
179
- ## Large
179
+ :::{tip}
180
+ Use the `` max-variants-chunks `` option to encode the first few chunks of your
181
+ dataset while doing these kinds of schema tuning operations!
182
+ :::
183
+
184
+ ## Large dataset
185
+
186
+ The {ref}` explode<cmd-vcf2zarr-explode> `
187
+ and {ref}` encode<cmd-vcf2zarr-encode> ` commands have powerful features for
188
+ conversion on a single machine, and can take full advantage of large servers
189
+ with many cores. Current biobank scale datasets, however, are so large that
190
+ we must go a step further and * distribute* computations over a cluster.
191
+ Vcf2zarr provides some low-level utilities that allow you to do this, that should
192
+ be compatible with any cluster scheduler.
193
+
194
+ The distributed commands are split into three phases:
195
+
196
+ - ** init <num_partitions>** : Initialise the computation, setting up the data structures needed
197
+ for the bulk computation to be split into `` num_partitions `` independent partitions
198
+ - ** partition <j >** : perform the computation of partition `` j ``
199
+ - ** finalise** : Complete the full process.
180
200
201
+ When performing large-scale computations like this on a cluster, errors and job
202
+ failures are essentially inevitable, and the commands are resilient to various
203
+ failure modes.
181
204
205
+ Let's go through the example above using the distributed commands. First, we
206
+ {ref}` dexplode-init<cmd-vcf2zarr-dexplode-init> ` to create an ICF directory:
182
207
208
+ ``` {code-cell}
209
+ :tags: [remove-cell]
210
+ rm -fR sample-dist.icf
211
+ ```
212
+ ``` {code-cell}
213
+ vcf2zarr dexplode-init sample.vcf.gz sample-dist.icf 5
214
+ ```
183
215
216
+ Here we asked `` dexplode-init `` to set up an ICF store in which the data
217
+ is split into 5 partitions. The number of partitions determines the level
218
+ of parallelism, so we would usually set this to the number of
219
+ parallel jobs we would like to use. The output of `` dexplode-init `` is
220
+ important though, as it tells us the ** actual** number of partitions that
221
+ we have (partitioning is based on the VCF indexes, which have a limited
222
+ granularity). You should be careful to use this value in your scripts
223
+ (the format is designed to be machine readable using e.g. `` cut `` and
224
+ `` grep `` ). In this case there are only 3 possible partitions.
225
+
226
+
227
+ Once `` dexplode-init `` is done and we know how many partitions we have,
228
+ we need to call `` dexplode-partition `` this number of times.
229
+
230
+ <!-- ```{code-cell} -->
231
+ <!-- vcf2zarr dexplode-partition sample-dist.icf 0 -->
232
+ <!-- vcf2zarr dexplode-partition sample-dist.icf 1 -->
233
+ <!-- vcf2zarr dexplode-partition sample-dist.icf 2 -->
234
+ <!-- ``` -->
184
235
0 commit comments