Skip to content

VCF indexes without record counts lead to assertion #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shz9 opened this issue Apr 28, 2024 · 4 comments
Closed

VCF indexes without record counts lead to assertion #150

shz9 opened this issue Apr 28, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@shz9
Copy link

shz9 commented Apr 28, 2024

I'm using bio2zarr v0.0.6 and trying to explode 1000G genotype data (NOTE: not recent NYGC WGS data; but older genotype data from ~2013):

vcf2zarr explode data/genotypes/chr22.vcf.gz data/genotypes/chr22.icf -p8
    Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 14.1files/s]
 Explode: 1.10Mvars [02:45, 6.65kvars/s]
Traceback (most recent call last):
  File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
    sys.exit(vcf2zarr())
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 178, in explode
    vcf.explode(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1173, in explode
    writer.finalise()
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1139, in finalise
    assert total_records == self.metadata.num_records
AssertionError

It seems there's a mismatch between the metadata and the number of records in each of the partitions? This error still comes up when I use a single worker -p1, so it's not due to multiprocessing. I printed total_records and self.metadata.num_records and got the following:

total_records: 1103547
metadata.num_records: 0

Any ideas what might be going on? It seems like it could be an issue in scan due to corrupted or outdated VCF format?

UPDATE:

This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .

@jeromekelleher
Copy link
Contributor

Great catch, thanks @shz9. The problem with old phase 1 files is that the ancient version of tabix here doesn't store the record counts in the index that we use. It works fine if you re-index.

I'll add an update that either checks for these missing bits of information in the index and raises and informative error, or just prints a warning and handles the fact we can't show progress.

@jeromekelleher
Copy link
Contributor

This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .

I've been working with these without problems - can you show the backtrace please?

@jeromekelleher jeromekelleher added the bug Something isn't working label Apr 29, 2024
@jeromekelleher jeromekelleher changed the title AssertionError when exploding 1000G genotype data VCF indexes without record counts lead to assertion Apr 29, 2024
@shz9
Copy link
Author

shz9 commented Apr 29, 2024

It's the same as above:

 Explode: 1.87Mvars [57:06, 547vars/s]  
total_records: 1872741
metadata.num_records: 0
0
Traceback (most recent call last):
  File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
    sys.exit(vcf2zarr())
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 178, in explode
    vcf.explode(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1176, in explode
    writer.finalise()
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1142, in finalise
    assert total_records == self.metadata.num_records
AssertionError

The VCF files are here. I think this is probably the same issue you identified for the older 1000G data, because during conversion, the hts library was throwing the following warnings:

[W::hts_idx_load3] The index file is older than the data file: data/WGS/chr22.vcf.gz.tbi

And indeed, on the website linked to above, it seems that the .tbi files are older than the files they are indexing. So, I will try to re-index everything before proceeding, but as you said, it'd be good to detect those issues at the scan stage so that they can be corrected before exploding the files.

@jeromekelleher
Copy link
Contributor

Closed in #164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants