Skip to content

VCF indexes without record counts lead to assertion #150

Closed
@shz9

Description

@shz9

I'm using bio2zarr v0.0.6 and trying to explode 1000G genotype data (NOTE: not recent NYGC WGS data; but older genotype data from ~2013):

vcf2zarr explode data/genotypes/chr22.vcf.gz data/genotypes/chr22.icf -p8
    Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 14.1files/s]
 Explode: 1.10Mvars [02:45, 6.65kvars/s]
Traceback (most recent call last):
  File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
    sys.exit(vcf2zarr())
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 178, in explode
    vcf.explode(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1173, in explode
    writer.finalise()
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1139, in finalise
    assert total_records == self.metadata.num_records
AssertionError

It seems there's a mismatch between the metadata and the number of records in each of the partitions? This error still comes up when I use a single worker -p1, so it's not due to multiprocessing. I printed total_records and self.metadata.num_records and got the following:

total_records: 1103547
metadata.num_records: 0

Any ideas what might be going on? It seems like it could be an issue in scan due to corrupted or outdated VCF format?

UPDATE:

This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions