Skip to content

[WIP] Add single variant to existing sample #828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

alancleary
Copy link
Member

@simonjwhite Would like the ability to add variants one at a time to support liftover. This PR adds that functionality.

Currently the feature is only available via the add-variant command in the CLI, which accepts a single variant as JSON. For example:

tiledbvcf add-variant -u ./1000-genomes-dragen-v3.7.6-dataset --json '{"chrom":"chr1","pos":10,"ref":"A","alt":"T","sample":"HG03914","id":"rs123","qual":30.0,"info":{"DP":45,"AF":0.3},"format":{"GT":"0/1","DP":15,"GQ":25},"filter":["PASS"]}'

Notes:

  • Only one variant can be added at a time
  • The sample a variant is being added to must already exist in the dataset
  • This functionality is currently implemented as a wrapper for ingestion, i.e. it generates temporary .vcf.gz and .tbi files and then ingests them
    • The ingestion can be configured through the add-variant command via pass-through flags
  • Temporary files are written to /dev/shm/ for performance
    • This can be configured via the --tmp-dir flag

I will add tests and proceed with the Python API after getting feedback on the current implementation.

The library is by nlohmann and is composed of a single json.hpp header file.
The function joins a vector of strings into a single string using a delineator string as a separator.
The function writes a VCF files for the given HTSlib header and records. The function can write a .vcf or .vcf.gz file and a corresponding .tbi index file can be optionally generated.
This method returns the sample id (row index) for the given sample name.
These are utilites for working with individual variants that aren't necessarily affiliated with a VCF file or have been ingested. Currently these utilities are composed of a Variant class that is instantiated from JSON representing a variant. The class currently supports converting the variant to a HTSlib record.
This allows a single variant (input as JSON) to be added to a sample already in the dataset. The current implementation simply generates temporary VCF and TBI files for the variant then ingests them.
This allows a single variant (input as JSON) to be added to a sample already in the dataset.
@alancleary alancleary added the enhancement New feature or request label Aug 4, 2025
Copy link
Contributor

@spencerseale spencerseale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you adding the python implementation to this PR? Or are we holding off for now?

@alancleary
Copy link
Member Author

Are you adding the python implementation to this PR? Or are we holding off for now?

Holding off for now. We're considering making the implementation/API more transactional to reduce fragmentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants