Skip to content

Commit 5b63087

Browse files
Merge pull request #237 from jeromekelleher/some-more-docs2
Some more docs2
2 parents 164c923 + 03f5d1a commit 5b63087

File tree

4 files changed

+109
-5
lines changed

4 files changed

+109
-5
lines changed

.github/workflows/docs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ jobs:
3333
3434
- name: Install bcftools
3535
run: |
36-
sudo apt-get install bcftools
36+
sudo apt-get install bcftools jq
3737
3838
- name: Install package
3939
run: |

docs/Makefile

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,9 @@ _static/vcf2zarr_convert.cast: sample.vcf.gz
3535
asciinema-automation cast_scripts/vcf2zarr_convert.sh $@
3636
cp -R sample.vcz vcf2zarr
3737

38+
# TODO rename this cast
3839
_static/vcf2zarr_explode.cast: sample.vcf.gz
39-
rm -Rf sample.icf
40+
rm -Rf sample.icf sample.vcz
4041
asciinema-automation cast_scripts/vcf2zarr_explode.sh $@
41-
cp -R sample.icf vcf2zarr
42+
cp -R sample.icf sample.vcz vcf2zarr
4243

docs/cast_scripts/vcf2zarr_explode.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
#$ delay 5
22
vcf2zarr explode sample.vcf.gz sample.icf
33
#$ expect \$
4+
vcf2zarr encode sample.icf sample.vcz
5+
#$ expect \$

docs/vcf2zarr/tutorial.md

Lines changed: 103 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ convert your data, basically providing different levels of
1818
convenience and flexibility corresponding to what you might
1919
need for small, intermediate and large datasets.
2020

21-
## Small
21+
## Small dataset
2222

2323
The simplest way to convert VCF data to Zarr is to use the
2424
{ref}`vcf2zarr convert<cmd-vcf2zarr-convert>` command:
@@ -54,9 +54,35 @@ these directories:
5454
tree sample.vcz
5555
```
5656

57+
You can get a better idea of what's being stored and the sizes
58+
of the different fields using the
59+
{ref}`vcf2zarr inspect<cmd-vcf2zarr-inspect>` command:
5760

58-
## Intermediate
61+
```{code-cell}
62+
vcf2zarr inspect sample.vcz
63+
```
64+
65+
The ``stored`` and ``size`` columns here are important, and tell you
66+
how much storage space is being used by the compressed chunks,
67+
and the full size of the uncompressed array. The ``ratio``
68+
column is the ratio of these two values, with large numbers
69+
indicating a high compression ration. In this example
70+
the compression ratios are less than one because the compressed
71+
chunks are *larger* than the actual arrays. This is because it's
72+
a tiny example, with only 9 variants and 3 samples (see the ``shape``
73+
column), so, for example ``call_genotype`` is only 54 bytes.
74+
75+
## Medium dataset
76+
77+
Conversion in ``vcf2zarr`` is a two step process. First we convert the VCF(s) to
78+
an "intermediate columnar format" (ICF), and then use this as the basis of
79+
generating the final Zarr.
5980

81+
:::{important}
82+
We would recommend using this two-step process for all but the smallest datasets.
83+
:::
84+
85+
In the simplest case, we just call two commands instead of one:
6086
<div id="vcf2zarr_explode"></div>
6187
<script>
6288
AsciinemaPlayer.create('../_static/vcf2zarr_explode.cast',
@@ -66,6 +92,81 @@ AsciinemaPlayer.create('../_static/vcf2zarr_explode.cast',
6692
});
6793
</script>
6894

95+
:::{tip}
96+
Both the {ref}`explode<cmd-vcf2zarr-explode>`
97+
and
98+
{ref}`encode<cmd-vcf2zarr-encode>`
99+
commands allow you to perform the
100+
conversion in parallel using the ``-p/--worker-processes`` option.
101+
:::
102+
103+
The ICF form gives us useful information about the VCF data, and can help us to
104+
decide some of the finer details of the final Zarr encoding. We can also run
105+
{ref}`vcf2zarr inspect<cmd-vcf2zarr-inspect>` command on an ICF:
106+
107+
```{code-cell}
108+
vcf2zarr inspect sample.icf
109+
```
110+
111+
This tells us about the fields found in the input VCF, and provides details
112+
about their data type, raw and compressed size, the maximum dimension of a value
113+
and the maximum and minimum values observed in the field. These extreme values
114+
are use to determine the width of integer types in the final Zarr encoding.
115+
The defaults can be overridden by generating a "storage schema"
116+
using the {ref}`vcf2zarr mkschema<cmd-vcf2zarr-mkschema>` command on an ICF:
117+
118+
```{code-cell}
119+
vcf2zarr mkschema sample.icf > sample.schema.json
120+
head -n 20 sample.schema.json
121+
```
122+
123+
We've displayed the first 20 lines here so you can get a feel for the JSON format.
124+
The [jq](https://jqlang.github.io/jq/) provides a useful way of manipulating
125+
these schemas. Let's look at the schema for just the ``call_genotype``
126+
field, for example:
127+
128+
```{code-cell}
129+
jq '.fields[] | select(.name == "call_genotype")' < sample.schema.json
130+
```
131+
132+
If we wanted to trade-off file size for better decoding performance
133+
we could turn off the shuffle option here (which is set to 2, for
134+
Blosc BitShuffle). This can be done by loading the ``sample.schema.json``
135+
file into your favourite editor, and changing the value "2" to "0".
136+
137+
:::{todo}
138+
There is a lot going on here. It would be good to give some specific
139+
instructions for how to do this using jq. We would also want to
140+
explain the numbers provided for shuffle, as well as links to the
141+
Blosc documentation.
142+
Better mechanisms for updating schemas via a Python API would be
143+
a useful addition here.
144+
:::
145+
146+
147+
A common thing we might want to do is to drop some fields entirely
148+
from the final Zarr, either because they are too big and unwieldy
149+
in the standard encoding (e.g. ``call_PL``) or we just don't
150+
want them.
151+
152+
Suppose here wanted to drop the ``FORMAT/HQ``/``call_HQ`` field. This
153+
can be done using ``jq`` like:
154+
155+
```{code-cell}
156+
vcf2zarr mkschema sample.icf \
157+
| jq 'del(.fields[] | select(.name == "call_HQ"))' > sample_noHQ.schema.json
158+
```
159+
Then we can use the updated schema as input to ``encode``:
160+
161+
```{code-cell}
162+
vcf2zarr encode sample.icf -s sample_noHQ.schema.json sample_noHQ.vcz
163+
```
164+
We can then ``inspect`` to see that there is no ``call_HQ`` array in the output:
165+
166+
```{code-cell}
167+
vcf2zarr inspect sample_noHQ.vcz
168+
```
169+
69170

70171
## Large
71172

0 commit comments

Comments
 (0)