I'm using bio2zarr v0.0.6 and trying to explode 1000G genotype data (NOTE: not recent NYGC WGS data; but older genotype data from ~2013):
vcf2zarr explode data/genotypes/chr22.vcf.gz data/genotypes/chr22.icf -p8
Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 14.1files/s]
Explode: 1.10Mvars [02:45, 6.65kvars/s]
Traceback (most recent call last):
File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
sys.exit(vcf2zarr())
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 178, in explode
vcf.explode(
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1173, in explode
writer.finalise()
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1139, in finalise
assert total_records == self.metadata.num_records
AssertionError
It seems there's a mismatch between the metadata and the number of records in each of the partitions? This error still comes up when I use a single worker -p1, so it's not due to multiprocessing. I printed total_records and self.metadata.num_records and got the following:
total_records: 1103547
metadata.num_records: 0
Any ideas what might be going on? It seems like it could be an issue in scan due to corrupted or outdated VCF format?
UPDATE:
This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .
I'm using
bio2zarr v0.0.6and trying to explode 1000G genotype data (NOTE: not recent NYGC WGS data; but older genotype data from ~2013):vcf2zarr explode data/genotypes/chr22.vcf.gz data/genotypes/chr22.icf -p8 Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 14.1files/s] Explode: 1.10Mvars [02:45, 6.65kvars/s] Traceback (most recent call last): File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module> sys.exit(vcf2zarr()) File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 178, in explode vcf.explode( File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1173, in explode writer.finalise() File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1139, in finalise assert total_records == self.metadata.num_records AssertionErrorIt seems there's a mismatch between the metadata and the number of records in each of the partitions? This error still comes up when I use a single worker
-p1, so it's not due to multiprocessing. I printedtotal_recordsandself.metadata.num_recordsand got the following:Any ideas what might be going on? It seems like it could be an issue in
scandue to corrupted or outdated VCF format?UPDATE:
This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .