WeavePop is a Snakemake workflow that maps short sequencing reads of multiple haploid eukaryotic genomes to selected reference genomes and analyzes the genomic variants between them. The core analysis is done by Snippy to obtain alignment files (BAM), variant calling files (VCFs), and reference-based assemblies. From these results, WeavePop analyzes the mapping quality and depth, annotates the assemblies with Litoff using the corresponding reference genome annotation, extracts the DNA and amino acid sequences of all transcripts using AGAT, annotates the effects of the small variants with SnpEff, calls Copy-Number Variants, generates a variety of useful diagnostic plots, and integrates all the results into an SQL database. The database allows users to easily explore the results using WeavePop-Shiny, an interactive web app, or the command-line interface WeavePop-CLI.
Check the Wiki for detailed information on how to use WeavePop.
Overview
Installation
Input files
Configuration
Testing
Execution
Output
WeavePop-Shiny
Requires a Linux operating system.
Install Mamba or Miniconda following the instructions on their webpage:
- Mamba: https://github.com/conda-forge/miniforge (recommended)
- Miniconda: https://docs.anaconda.com/miniconda/
After successfully installing conda, add the necessary channels and set strict channel priority by running:
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --set channel_priority strict
Download or clone this GitHub repository.
To download, use the green button <> Code
and click Download ZIP
, then extract the .zip
file to a directory called WeavePop/
.
In your terminal, go to the directory you downloaded and run
mamba env create --file workflow/envs/snakemake.yaml # use conda instead of mamba if you installed Miniconda
The environments for particular software used by the pipeline will be installed by Snakemake when you run it, so you don't need to install them. The programs in each environment are described in the table below.
Software in the environments used in the pipeline
Environment files | Software |
---|---|
workflow/envs/snakemake.yaml |
Snakemake,Python, Pandas |
workflow/envs/snakemake-apptainer.yaml |
Snakemake,Python, Pandas, Apptainer |
workflow/envs/snippy.yaml |
Snippy,Samtools |
workflow/envs/liftoff.yaml |
Litoff,Minimap2 |
workflow/envs/agat.yaml |
AGAT,Seqkit |
workflow/envs/samtools.yaml |
Samtools, Bedtools, Bcftools,Xonsh,Pandas, Click, SciPy, NumPy |
workflow/envs/depth.yaml |
Mosdepth |
workflow/envs/repeatmasker.yaml |
RepeatMasker,RepeatModeler, Bedtools, Seqkit |
workflow/envs/r.yaml |
R, tidyverse, svglite, scales, RColorBrewer |
workflow/envs/variants.yaml |
SnpEff,DuckDB, PyVCF, Xonsh,Pandas, Click, Biopython, Bedtools, Bcftools |
workflow/envs/pandas.yaml |
Pandas |
workflow/envs/shell.yaml |
Coreutils |
To see a full description of the input files and their format, go to the Input Wiki.
- FASTQ files: Paired-end short-read FASTQ files for all samples in the same directory.
- Reference genomes: FASTA and GFF files for each lineage. Or FASTA for each lineage and FASTA and GFF for a main reference to use to annotate the other references.
metadata.csv
: A comma-separated table with one sample per row, with the columnssample
,lineage
, andstrain
. Example.chromosomes.csv
: A comma-separated table with one row per chromosome per lineage, with the columnslineage
,accession
, andchromosome
. Example.RepBase.fasta
: Database of repetitive sequences in FASTA format to use for RepeatMasker. We recommend the RepBase database. You need to download it, extract the files, and concatenate them all in one FASTA file. The database is needed if the CNV, plotting, or database modules are activated. If you don't provide a database, you can choose to run it with a fake database, which will generate inaccurate identification of repetitive sequences.loci.csv
: If you want genetic features to be plotted in the depth and MAPQ plots, provide a comma-separated table with one row per gene, with the columnsgene_id
andfeature
. Max 8 features. Example.exclude.txt
: If you want to exclude from all analysis some of the samples in your metadata file, you can provide a file with a list of sample IDs to exclude.
To execute the workflow, you need to edit the configuration file located in config/config.yaml
to:
- Select the workflow to run: The
analysis
workflow will run the analysis for one dataset. If you have the complete results (database module activated) of theanalysis
workflow for multiple datasets, you can use thejoin_datasets
workflow to create a database with all of them. - Provide the paths to the input files and project directory. The working directory should be
WeavePop/
, which containsconfig/
andworkflow/
. - Activate modules: When running the
analysis
workflow, you can select which of its modules to activate. Activating thedatabase
module automatically activates the modulescnv
,genes_mapq_depth
, andsnpeff
. - Specify parameters. The output description in Output Wiki explains which files are created by each module.
To see a full description of the configuration, go to the Configuration Wiki.
cd /<path-to>/WeavePop/
conda activate snakemake
snakemake --profile test/config/default
See more details in the Wiki Testing.
- In a terminal, go to
WeavePop/
to use as your working directory. - Activate the Snakemake environment:
conda activate snakemake
. - Specify the command-line parameters in the execution profile
config/default/config.yaml
. - Run:
snakemake --profile config/default
Learn more about the execution options in the Wiki pages Basic usage, Execution profiles and Working with multiple projects and runs.
The output will be generated in the results/
directory by default. Check the Wiki Working with multiple projects and runs for more information.
Here is a list of the most relevant outputs. To see the full list and know which module produces each file, go to the Output Wiki.
Output files
File | Description |
---|---|
01.Samples/snippy/{sample}/snps.bam |
BAM file of alignment between short reads of the sample and the corresponding reference genome. |
01.Samples/snippy/{sample}/snps.consensus.fa |
FASTA file of the reference genome with all variants instantiated. |
01.Samples/snippy/{sample}/snps.vcf |
Called variants in VCF format. Positions are 01-Based. |
01.Samples/annotation/{sample}/annotation.gff |
Standardized GFF file of annotation by Liftoff. Positions are 1-Based. |
01.Samples/annotation/{sample}/cds.fa |
Nucleotide sequences of all transcripts of the sample. |
01.Samples/annotation/{sample}/proteins.fa |
Protein sequences of all isoforms of the sample. |
01.Samples/plots/{sample}/depth_by_windows.png |
Plot of normalized depth of windows along each chromosome, with specified genetic features, called CNVs, and repetitive sequences of the corresponding reference. |
02.Dataset/plots/dataset_depth_by_chrom.png |
Normalized mean depth of each chromosome in the samples that survived the quality filter. |
02.Dataset/plots/dataset_summary.png |
Genome-wide depth and mapping quality metrics of the samples that survived the quality filter. |
02.Dataset/depth_quality/mapq_depth_by_feature.tsv |
MAPQ and mean depth of each feature in all the samples. |
02.Dataset/cnv/cnv_calls.tsv |
Table of deleted and duplicated regions in all samples and their overlap with repetitive sequences. Positions are 1-Based. |
02.Dataset/snpeff/effects.tsv |
Table with the effects of the possible variants in all lineages. |
02.Dataset/snpeff/presence.tsv |
Table with the variant IDs of all lineages and the samples they are present in. |
02.Dataset/snpeff/variants.tsv |
Table with the description of all variants of all lineages. Positions are 1-Based. |
02.Dataset/database.db |
SQL database with the main results. |
WeavePop-Shiny is an interactive web app that allows you to query the database generated by WeavePop. It is useful to explore the results of the analysis. To use the command-line interface instead, see WeavePop-CLI Wiki.
Install the environment with
mamba env create --file query_database/shiny.yaml
Go to the 02.Dataset
directory, where the database.db
file is located.
Activate the environment.
Run the Shiny App.
cd /<path-to>/WeavePop/<my-project>/results/02.Dataset
conda activate shiny
shiny run /<path-to>/WeavePop/query_database/app.py
Use the app in a browser: Copy the link that appears in the output (e.g. http://127.0.0.1:8000) and paste it into your web browser. Don't close the terminal while you are using the app.
These steps assume that you are using a local machine. If you have WeavePop and your results in a remote machine, you can either download query_database/
and the results/02.Datasets/database.db
file and do the installation and use the WeavePop-Shiny locally, or use VSCode with the Remote extension and Shiny extension to use the WeavePop-Shiny remotely.