@@ -18,7 +18,7 @@ convert your data, basically providing different levels of
18
18
convenience and flexibility corresponding to what you might
19
19
need for small, intermediate and large datasets.
20
20
21
- ## Small
21
+ ## Small dataset
22
22
23
23
The simplest way to convert VCF data to Zarr is to use the
24
24
{ref}` vcf2zarr convert<cmd-vcf2zarr-convert> ` command:
@@ -54,9 +54,35 @@ these directories:
54
54
tree sample.vcz
55
55
```
56
56
57
+ You can get a better idea of what's being stored and the sizes
58
+ of the different fields using the
59
+ {ref}` vcf2zarr inspect<cmd-vcf2zarr-inspect> ` command:
57
60
58
- ## Intermediate
61
+ ``` {code-cell}
62
+ vcf2zarr inspect sample.vcz
63
+ ```
64
+
65
+ The `` stored `` and `` size `` columns here are important, and tell you
66
+ how much storage space is being used by the compressed chunks,
67
+ and the full size of the uncompressed array. The `` ratio ``
68
+ column is the ratio of these two values, with large numbers
69
+ indicating a high compression ration. In this example
70
+ the compression ratios are less than one because the compressed
71
+ chunks are * larger* than the actual arrays. This is because it's
72
+ a tiny example, with only 9 variants and 3 samples (see the `` shape ``
73
+ column), so, for example `` call_genotype `` is only 54 bytes.
74
+
75
+ ## Medium dataset
76
+
77
+ Conversion in `` vcf2zarr `` is a two step process. First we convert the VCF(s) to
78
+ an "intermediate columnar format" (ICF), and then use this as the basis of
79
+ generating the final Zarr.
59
80
81
+ :::{important}
82
+ We would recommend using this two-step process for all but the smallest datasets.
83
+ :::
84
+
85
+ In the simplest case, we just call two commands instead of one:
60
86
<div id =" vcf2zarr_explode " ></div >
61
87
<script >
62
88
AsciinemaPlayer .create (' ../_static/vcf2zarr_explode.cast' ,
@@ -66,6 +92,81 @@ AsciinemaPlayer.create('../_static/vcf2zarr_explode.cast',
66
92
});
67
93
</script >
68
94
95
+ :::{tip}
96
+ Both the {ref}` explode<cmd-vcf2zarr-explode> `
97
+ and
98
+ {ref}` encode<cmd-vcf2zarr-encode> `
99
+ commands allow you to perform the
100
+ conversion in parallel using the `` -p/--worker-processes `` option.
101
+ :::
102
+
103
+ The ICF form gives us useful information about the VCF data, and can help us to
104
+ decide some of the finer details of the final Zarr encoding. We can also run
105
+ {ref}` vcf2zarr inspect<cmd-vcf2zarr-inspect> ` command on an ICF:
106
+
107
+ ``` {code-cell}
108
+ vcf2zarr inspect sample.icf
109
+ ```
110
+
111
+ This tells us about the fields found in the input VCF, and provides details
112
+ about their data type, raw and compressed size, the maximum dimension of a value
113
+ and the maximum and minimum values observed in the field. These extreme values
114
+ are use to determine the width of integer types in the final Zarr encoding.
115
+ The defaults can be overridden by generating a "storage schema"
116
+ using the {ref}` vcf2zarr mkschema<cmd-vcf2zarr-mkschema> ` command on an ICF:
117
+
118
+ ``` {code-cell}
119
+ vcf2zarr mkschema sample.icf > sample.schema.json
120
+ head -n 20 sample.schema.json
121
+ ```
122
+
123
+ We've displayed the first 20 lines here so you can get a feel for the JSON format.
124
+ The [ jq] ( https://jqlang.github.io/jq/ ) provides a useful way of manipulating
125
+ these schemas. Let's look at the schema for just the `` call_genotype ``
126
+ field, for example:
127
+
128
+ ``` {code-cell}
129
+ jq '.fields[] | select(.name == "call_genotype")' < sample.schema.json
130
+ ```
131
+
132
+ If we wanted to trade-off file size for better decoding performance
133
+ we could turn off the shuffle option here (which is set to 2, for
134
+ Blosc BitShuffle). This can be done by loading the `` sample.schema.json ``
135
+ file into your favourite editor, and changing the value "2" to "0".
136
+
137
+ :::{todo}
138
+ There is a lot going on here. It would be good to give some specific
139
+ instructions for how to do this using jq. We would also want to
140
+ explain the numbers provided for shuffle, as well as links to the
141
+ Blosc documentation.
142
+ Better mechanisms for updating schemas via a Python API would be
143
+ a useful addition here.
144
+ :::
145
+
146
+
147
+ A common thing we might want to do is to drop some fields entirely
148
+ from the final Zarr, either because they are too big and unwieldy
149
+ in the standard encoding (e.g. `` call_PL `` ) or we just don't
150
+ want them.
151
+
152
+ Suppose here wanted to drop the `` FORMAT/HQ `` /`` call_HQ `` field. This
153
+ can be done using `` jq `` like:
154
+
155
+ ``` {code-cell}
156
+ vcf2zarr mkschema sample.icf \
157
+ | jq 'del(.fields[] | select(.name == "call_HQ"))' > sample_noHQ.schema.json
158
+ ```
159
+ Then we can use the updated schema as input to `` encode `` :
160
+
161
+ ``` {code-cell}
162
+ vcf2zarr encode sample.icf -s sample_noHQ.schema.json sample_noHQ.vcz
163
+ ```
164
+ We can then `` inspect `` to see that there is no `` call_HQ `` array in the output:
165
+
166
+ ``` {code-cell}
167
+ vcf2zarr inspect sample_noHQ.vcz
168
+ ```
169
+
69
170
70
171
## Large
71
172
0 commit comments