Skip to content

Commit 35c62b0

Browse files
Merge pull request #69 from jeromekelleher/pre-release-stuff
Pre release stuff
2 parents 6695d35 + 81f59b3 commit 35c62b0

File tree

3 files changed

+72
-17
lines changed

3 files changed

+72
-17
lines changed

README.md

Lines changed: 65 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,37 @@ Convert bioinformatics file formats to Zarr
44
Initially supports converting VCF to the
55
[sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/)
66

7-
**This is early alpha-status code: everything is subject to change, a
7+
**This is early alpha-status code: everything is subject to change,
88
and it has not been thoroughly tested**
99

10-
## Usage
10+
## Install
11+
12+
```
13+
$ python3 -m pip install bio2zarr
14+
```
15+
16+
This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
17+
into your local Python path. You may need to update your $PATH to call the
18+
executables directly.
19+
20+
Alternatively, calling
21+
```
22+
$ python3 -m bio2zarr vcf2zarr <args>
23+
```
24+
is equivalent to
25+
26+
```
27+
$ vcf2zarr <args>
28+
```
29+
and will always work.
30+
31+
32+
## vcf2zarr
1133

1234
Convert a VCF to zarr format:
1335

1436
```
15-
python3 -m bio2zarr vcf2zarr convert <VCF> <zarr>
37+
$ vcf2zarr convert <VCF1> <VCF2> <zarr>
1638
```
1739

1840
Converts the VCF to zarr format.
@@ -21,33 +43,64 @@ Converts the VCF to zarr format.
2143

2244
The recommended approach is to use a multi-stage conversion
2345

24-
First, convert the VCF into an intermediate columnar format:
46+
First, convert the VCF into the intermediate format:
2547

2648
```
27-
python3 -m bio2zarr vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
49+
vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
2850
```
2951

3052
Then, (optionally) inspect this representation to get a feel for your dataset
3153
```
32-
python3 -m bio2zarr vcf2zarr inspec tmp/sample.exploded
54+
vcf2zarr inspect tmp/sample.exploded
3355
```
3456

3557
Then, (optionally) generate a conversion schema to describe the corresponding
3658
Zarr arrays:
3759

3860
```
39-
python3 -m bio2zarr vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
61+
vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
4062
```
4163

42-
View and edit the schema, deleting any columns you don't want.
43-
44-
Finally, convert to Zarr
64+
View and edit the schema, deleting any columns you don't want, or tweaking
65+
dtypes and compression settings to your taste.
4566

67+
Finally, encode to Zarr:
4668
```
47-
python3 -m bio2zarr vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
69+
vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
4870
```
4971

5072
Use the ``-p, --worker-processes`` argument to control the number of workers used
51-
to do zarr encoding.
73+
in the ``explode`` and ``encode`` phases.
74+
75+
## plink2zarr
76+
77+
Convert a plink ``.bed`` file to zarr format. **This is incomplete**
78+
79+
## vcf_partition
80+
81+
Partition a given VCF file into (approximately) a give number of regions:
82+
83+
```
84+
vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
85+
```
86+
gives
87+
```
88+
chr20:1-6799360
89+
chr20:6799361-14319616
90+
chr20:14319617-21790720
91+
chr20:21790721-28770304
92+
chr20:28770305-31096832
93+
chr20:31096833-38043648
94+
chr20:38043649-45580288
95+
chr20:45580289-52117504
96+
chr20:52117505-58834944
97+
chr20:58834945-
98+
```
99+
100+
These reqion strings can then be used to split computation of the VCF
101+
into chunks for parallelisation.
52102

103+
**TODO give a nice example here using xargs**
53104

105+
**WARNING that this does not take into account that indels may overlap
106+
partitions and you may count variants twice or more if they do**

bio2zarr/cli.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
help="Chunk size in the samples dimension",
3232
)
3333

34-
version = click.version_option(version=provenance.__version__)
34+
version = click.version_option(version=f"bio2zarr {provenance.__version__}")
3535

3636

3737
# Note: logging hasn't been implemented in the code at all, this is just

setup.cfg

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,11 @@ name = bio2zarr
33
author = sgkit Developers
44
author_email = project@pystatgen.org
55
license = Apache
6-
description = FIXME
6+
description = Convert bioinformatics data to Zarr
77
long_description_content_type=text/x-rst
88
long_description =
9-
FIXME
9+
This is an early alpha release for testing and development.
10+
**Do not use in production**
1011
url = https://github.com/pystatgen/bio2zarr
1112
classifiers =
1213
Development Status :: 3 - Alpha
@@ -15,7 +16,6 @@ classifiers =
1516
Intended Audience :: Science/Research
1617
Programming Language :: Python
1718
Programming Language :: Python :: 3
18-
Programming Language :: Python :: 3.8
1919
Programming Language :: Python :: 3.9
2020
Programming Language :: Python :: 3.10
2121
Programming Language :: Python :: 3.11
@@ -25,7 +25,7 @@ classifiers =
2525
packages = bio2zarr
2626
zip_safe = False # https://mypy.readthedocs.io/en/latest/installed_packages.html
2727
include_package_data = True
28-
python_requires = >=3.8
28+
python_requires = >=3.9
2929
install_requires =
3030
numpy
3131
zarr >= 2.10.0, != 2.11.0, != 2.11.1, != 2.11.2
@@ -45,6 +45,8 @@ setup_requires =
4545
console_scripts =
4646
vcf2zarr = bio2zarr.cli:vcf2zarr
4747
plink2zarr = bio2zarr.cli:plink2zarr
48+
# TODO I don't like this name, anything better?
49+
vcf_partition = bio2zarr.cli:vcf_partition
4850

4951
[flake8]
5052
ignore =

0 commit comments

Comments
 (0)