Skip to content

stefanpeidli/scperturb_snakemake_example

Repository files navigation

An example snakemake pipeline for downloading and using datasets from scperturb for method development

This repository demonstrates how to use snakemake to download and run methods on the datasets in scperturb.

Example:

Let's say we are developing a new method for single-cell perturbation analysis. We want to evaluate our method on some of the datasets in scperturb. For the sake of this example, we call our method "your_method". Let us realistically assume that your_method is horribly slow. We want to run your_method on the datasets in scperturb, ideally in parallel. This is where snakemake comes in. Snakemake is a workflow management system that allows you to define a pipeline in a human-readable way and automatically runs code in parallel jobs.

In this example, your_method is a simple script that reads in a dataset, pseudo-bulks each perturbation and saves the results. You can just replace the script with your actual method. After installing snakemake and configuring the example to your needs, we run the example pipeline with snakemake --cores 8 --use-conda. This will download the datasets and run your_method on each of them in parallel. The results are saved in the OUT_DIR directory specified in the config.yaml file, and downloaded datasets are saved in the DATA_DIR directory. Snakemake is intelligent enough to not re-download or re-run your_method if the results are already there (except you force it to).

DAG of the example snakemake workflow

Installation

  • If you don't have conda installed, you can install it by following the instructions here. I recommend using Mamba instead of conda as it is much faster.
  • Install snakemake with conda install pandas numpy pyyaml conda-forge::openpyxl "bioconda::snakemake==7.32.4"
  • Clone this repository with git clone URL_OF_THIS_REPOSITORY
  • Configure the example to your needs by editing the config.yaml file:
    • You can change the datasets variable to include the datasets you want to download. The datasets are specified by their names in scperturb. You can find the names of the datasets in the scperturb repository. Be aware that if you put "all" in there, it will download over 40 datasets with a total size of over 27GB. However, these files are compressed, so actually reading them in will take up much more space. I recommend starting with a few datasets first.
    • You should change the directories in the config.yaml file to match your system.

Usage

  • You can run the pipeline on your local machine e.g. via snakemake --cores 8 --use-conda
  • If you want to run the pipeline on SLURM, please configure snakemake correctly. See here.

Useful Links

About

Example repository for analyzing many datasets at once from scperturb.org using snakemake

Topics

Resources

Stars

Watchers

Forks