Skip to content

Commit 6e87b27

Browse files
Merge pull request #242 from jeromekelleher/some-more-docs3
Some more docs3
2 parents 5b63087 + 3fba9cf commit 6e87b27

File tree

2 files changed

+56
-79
lines changed

2 files changed

+56
-79
lines changed

docs/Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@ PYPATH=$(shell pwd)/../
33
B2Z_VERSION:=$(shell PYTHONPATH=${PYPATH} \
44
python3 -c 'import bio2zarr; print(bio2zarr.__version__.split("+")[0])')
55

6+
7+
# FIXME this is all very fragile and needs to be rewritten.
8+
# https://github.com/sgkit-dev/bio2zarr/issues/238
69
CASTS=_static/vcf2zarr_convert.cast\
710
_static/vcf2zarr_explode.cast
811

docs/vcf2zarr/tutorial.md

Lines changed: 53 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ head -n 20 sample.schema.json
121121
```
122122

123123
We've displayed the first 20 lines here so you can get a feel for the JSON format.
124-
The [jq](https://jqlang.github.io/jq/) provides a useful way of manipulating
124+
The [jq](https://jqlang.github.io/jq/) tool provides a useful way of manipulating
125125
these schemas. Let's look at the schema for just the ``call_genotype``
126126
field, for example:
127127

@@ -158,6 +158,15 @@ vcf2zarr mkschema sample.icf \
158158
```
159159
Then we can use the updated schema as input to ``encode``:
160160

161+
162+
<!-- FIXME shouldn't need to do this, but currently the execution model is very -->
163+
<!-- fragile. -->
164+
<!-- https://github.com/sgkit-dev/bio2zarr/issues/238 -->
165+
```{code-cell}
166+
:tags: [remove-cell]
167+
rm -fR sample_noHQ.vcz
168+
```
169+
161170
```{code-cell}
162171
vcf2zarr encode sample.icf -s sample_noHQ.schema.json sample_noHQ.vcz
163172
```
@@ -167,95 +176,60 @@ We can then ``inspect`` to see that there is no ``call_HQ`` array in the output:
167176
vcf2zarr inspect sample_noHQ.vcz
168177
```
169178

179+
:::{tip}
180+
Use the ``max-variants-chunks`` option to encode the first few chunks of your
181+
dataset while doing these kinds of schema tuning operations!
182+
:::
170183

171-
## Large
172-
173-
174-
175-
## Parallel encode/explode
176-
177-
178-
## Common options
179-
180-
```
181-
$ vcf2zarr convert <VCF1> <VCF2> <zarr>
182-
```
183-
184-
Converts the VCF to zarr format.
185-
186-
**Do not use this for anything but the smallest files**
187-
188-
The recommended approach is to use a multi-stage conversion
189-
190-
First, convert the VCF into the intermediate format:
191-
192-
```
193-
vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
194-
```
195-
196-
Then, (optionally) inspect this representation to get a feel for your dataset
197-
```
198-
vcf2zarr inspect tmp/sample.exploded
199-
```
200-
201-
Then, (optionally) generate a conversion schema to describe the corresponding
202-
Zarr arrays:
203-
204-
```
205-
vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
206-
```
207-
208-
View and edit the schema, deleting any columns you don't want, or tweaking
209-
dtypes and compression settings to your taste.
210-
211-
Finally, encode to Zarr:
212-
```
213-
vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
214-
```
215-
216-
Use the ``-p, --worker-processes`` argument to control the number of workers used
217-
in the ``explode`` and ``encode`` phases.
218-
219-
## To be merged with above
184+
## Large dataset
220185

221-
The simplest usage is:
186+
The {ref}`explode<cmd-vcf2zarr-explode>`
187+
and {ref}`encode<cmd-vcf2zarr-encode>` commands have powerful features for
188+
conversion on a single machine, and can take full advantage of large servers
189+
with many cores. Current biobank scale datasets, however, are so large that
190+
we must go a step further and *distribute* computations over a cluster.
191+
Vcf2zarr provides some low-level utilities that allow you to do this, that should
192+
be compatible with any cluster scheduler.
222193

223-
```
224-
$ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
225-
```
194+
The distributed commands are split into three phases:
226195

196+
- **init <num_partitions>**: Initialise the computation, setting up the data structures needed
197+
for the bulk computation to be split into ``num_partitions`` independent partitions
198+
- **partition <j>**: perform the computation of partition ``j``
199+
- **finalise**: Complete the full process.
227200

228-
This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
229-
step. As this writes the intermediate columnar format to a temporary directory,
230-
we only recommend this approach for small files (< 1GB, say).
201+
When performing large-scale computations like this on a cluster, errors and job
202+
failures are essentially inevitable, and the commands are resilient to various
203+
failure modes.
231204

232-
The recommended approach is to run the conversion in two passes, and
233-
to keep the intermediate columnar format ("exploded") around to facilitate
234-
experimentation with chunk sizes and compression settings:
205+
Let's go through the example above using the distributed commands. First, we
206+
{ref}`dexplode-init<cmd-vcf2zarr-dexplode-init>` to create an ICF directory:
235207

208+
```{code-cell}
209+
:tags: [remove-cell]
210+
rm -fR sample-dist.icf
236211
```
237-
$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
238-
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
212+
```{code-cell}
213+
vcf2zarr dexplode-init sample.vcf.gz sample-dist.icf 5
239214
```
240215

241-
The inspect command provides a way to view contents of an exploded ICF
242-
or Zarr:
216+
Here we asked ``dexplode-init`` to set up an ICF store in which the data
217+
is split into 5 partitions. The number of partitions determines the level
218+
of parallelism, so we would usually set this to the number of
219+
parallel jobs we would like to use. The output of ``dexplode-init`` is
220+
important though, as it tells us the **actual** number of partitions that
221+
we have (partitioning is based on the VCF indexes, which have a limited
222+
granularity). You should be careful to use this value in your scripts
223+
(the format is designed to be machine readable using e.g. ``cut`` and
224+
``grep``). In this case there are only 3 possible partitions.
243225

244-
```
245-
$ vcf2zarr inspect [PATH]
246-
```
247-
248-
This is useful when tweaking chunk sizes and compression settings to suit
249-
your dataset, using the mkschema command and --schema option to encode:
250226

251-
```
252-
$ vcf2zarr mkschema [ICF_PATH] > schema.json
253-
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
254-
```
227+
Once ``dexplode-init`` is done and we know how many partitions we have,
228+
we need to call ``dexplode-partition`` this number of times.
255229

256-
By editing the schema.json file you can drop columns that are not of interest
257-
and edit column specific compression settings. The --max-variant-chunks option
258-
to encode allows you to try out these options on small subsets, hopefully
259-
arriving at settings with the desired balance of compression and query
260-
performance.
230+
<!-- ```{code-cell} -->
231+
<!-- vcf2zarr dexplode-partition sample-dist.icf 0 -->
232+
<!-- vcf2zarr dexplode-partition sample-dist.icf 1 -->
233+
<!-- vcf2zarr dexplode-partition sample-dist.icf 2 -->
234+
<!-- ``` -->
261235

0 commit comments

Comments
 (0)