Closed
Description
I'm using bio2zarr v0.0.6
and trying to explode 1000G genotype data (NOTE: not recent NYGC WGS data; but older genotype data from ~2013):
vcf2zarr explode data/genotypes/chr22.vcf.gz data/genotypes/chr22.icf -p8
Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 14.1files/s]
Explode: 1.10Mvars [02:45, 6.65kvars/s]
Traceback (most recent call last):
File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
sys.exit(vcf2zarr())
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 178, in explode
vcf.explode(
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1173, in explode
writer.finalise()
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1139, in finalise
assert total_records == self.metadata.num_records
AssertionError
It seems there's a mismatch between the metadata and the number of records in each of the partitions? This error still comes up when I use a single worker -p1
, so it's not due to multiprocessing. I printed total_records
and self.metadata.num_records
and got the following:
total_records: 1103547
metadata.num_records: 0
Any ideas what might be going on? It seems like it could be an issue in scan
due to corrupted or outdated VCF format?
UPDATE:
This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .