@@ -4,15 +4,37 @@ Convert bioinformatics file formats to Zarr
4
4
Initially supports converting VCF to the
5
5
[ sgkit vcf-zarr specification] ( https://github.com/pystatgen/vcf-zarr-spec/ )
6
6
7
- ** This is early alpha-status code: everything is subject to change, a
7
+ ** This is early alpha-status code: everything is subject to change,
8
8
and it has not been thoroughly tested**
9
9
10
- ## Usage
10
+ ## Install
11
+
12
+ ```
13
+ $ python3 -m pip install bio2zarr
14
+ ```
15
+
16
+ This will install the programs `` vcf2zarr `` , `` plink2zarr `` and `` vcf_partition ``
17
+ into your local Python path. You may need to update your $PATH to call the
18
+ executables directly.
19
+
20
+ Alternatively, calling
21
+ ```
22
+ $ python3 -m bio2zarr vcf2zarr <args>
23
+ ```
24
+ is equivalent to
25
+
26
+ ```
27
+ $ vcf2zarr <args>
28
+ ```
29
+ and will always work.
30
+
31
+
32
+ ## vcf2zarr
11
33
12
34
Convert a VCF to zarr format:
13
35
14
36
```
15
- python3 -m bio2zarr vcf2zarr convert <VCF > <zarr>
37
+ $ vcf2zarr convert <VCF1> <VCF2 > <zarr>
16
38
```
17
39
18
40
Converts the VCF to zarr format.
@@ -21,33 +43,64 @@ Converts the VCF to zarr format.
21
43
22
44
The recommended approach is to use a multi-stage conversion
23
45
24
- First, convert the VCF into an intermediate columnar format:
46
+ First, convert the VCF into the intermediate format:
25
47
26
48
```
27
- python3 -m bio2zarr vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
49
+ vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
28
50
```
29
51
30
52
Then, (optionally) inspect this representation to get a feel for your dataset
31
53
```
32
- python3 -m bio2zarr vcf2zarr inspec tmp/sample.exploded
54
+ vcf2zarr inspect tmp/sample.exploded
33
55
```
34
56
35
57
Then, (optionally) generate a conversion schema to describe the corresponding
36
58
Zarr arrays:
37
59
38
60
```
39
- python3 -m bio2zarr vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
61
+ vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
40
62
```
41
63
42
- View and edit the schema, deleting any columns you don't want.
43
-
44
- Finally, convert to Zarr
64
+ View and edit the schema, deleting any columns you don't want, or tweaking
65
+ dtypes and compression settings to your taste.
45
66
67
+ Finally, encode to Zarr:
46
68
```
47
- python3 -m bio2zarr vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
69
+ vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
48
70
```
49
71
50
72
Use the `` -p, --worker-processes `` argument to control the number of workers used
51
- to do zarr encoding.
73
+ in the `` explode `` and `` encode `` phases.
74
+
75
+ ## plink2zarr
76
+
77
+ Convert a plink `` .bed `` file to zarr format. ** This is incomplete**
78
+
79
+ ## vcf_partition
80
+
81
+ Partition a given VCF file into (approximately) a give number of regions:
82
+
83
+ ```
84
+ vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
85
+ ```
86
+ gives
87
+ ```
88
+ chr20:1-6799360
89
+ chr20:6799361-14319616
90
+ chr20:14319617-21790720
91
+ chr20:21790721-28770304
92
+ chr20:28770305-31096832
93
+ chr20:31096833-38043648
94
+ chr20:38043649-45580288
95
+ chr20:45580289-52117504
96
+ chr20:52117505-58834944
97
+ chr20:58834945-
98
+ ```
99
+
100
+ These reqion strings can then be used to split computation of the VCF
101
+ into chunks for parallelisation.
52
102
103
+ ** TODO give a nice example here using xargs**
53
104
105
+ ** WARNING that this does not take into account that indels may overlap
106
+ partitions and you may count variants twice or more if they do**
0 commit comments