This Mutation Signature Analysis Automation Pipeline is designed to streamline the preparation, execution, and analysis of mutation signatures using the Palimpsest R package. It is tailored for cancer genomics research, automating tasks from data preprocessing to de novo mutation signature extraction and comparison with COSMIC (Catalogue Of Somatic Mutations In Cancer) signatures.
The pipeline processes mutation and clinical data, aligning and cleaning the inputs before performing a detailed analysis. It enables visualization and interpretation of mutation signatures with a focus on specific genes and cancer subtypes. By automating repetitive tasks, this pipeline accelerates research into the mutational processes underlying cancer.
- Processes mutation data to ensure compatibility with Palimpsest.
- Filters clinical data for specific cancer subtypes.
- Filters patients based on cancer subtype or specific gene mutations.
- Extracts mutation signatures using Non-Negative Matrix Factorization (NMF).
- Determines proportions of de novo signatures.
- Compares extracted signatures to known COSMIC SBS signatures.
- Visualizes similarity using cosine similarity heatmaps.
- Generates pie charts to display signature contributions.
- Produces signature content plots for enhanced interpretation.
- Summarizes mutations in DNA repair genes for selected patients.
- Generates detailed reports for individual patients, including:
- Mutation counts.
- Key gene mutations.
- Signature contributions.
- R version 4.0 or higher
- Palimpsest
- BSgenome.Hsapiens.UCSC.hg19 (or hg38 if applicable)
dplyr
,readr
,purrr
,ggplot2
,tidyr
- Mutation Data: CSV/TSV file with columns like
Hugo_Symbol
,Variant_Classification
,Variant_Type
, andPATIENT_ID
(if not present, add it during data cleaning step). - Clinical Data: CSV/TSV file with columns like
PATIENT_ID
andSUBTYPE
.
git clone https://github.com/RoySoumik23/AutoMutSig_Pipeline.git
cd AutoMutSig_Pipeline
Update the following variables in the script:
print_statement
: Message describing the unique patient count.gene_name
: Name or pattern of the target gene (e.g.,"ALKBH"
).subtype
: Cancer subtype for analysis (e.g.,"TNBC"
).mut_location
: File path to the mutation data.clinical_location
: File path to the clinical data.num_of_denovo_sings
: Number of de novo signatures to extract.parent_subfile_name
: Name of the subfile for this analysis.Parent_file_name
: Location of the parent file.
Run the script in R:
source("palimpsest_pipeline.R")
- Summary File:
details.txt
summarizing patient counts at each stage. - VCF Files: Mutation data formatted for Palimpsest analysis.
- Comparison Table: Cosine similarity scores between de novo and COSMIC signatures.
- Visualizations:
- Heatmaps of cosine similarities.
- Pie charts showing SBS signature contributions (
all_known_signature_piecharts.pdf
).
- Final Merged Table: Contains:
- Patient IDs.
- SBS signatures (de novo and known COSMIC).
- Correlation and gene mutation types.
- De Novo SBS Proportions:
denovo_SBS_proportion.csv
detailing the percentage contribution of each of the 96 SBS conversions.
- Ensure that column names in mutation and clinical data files match the script’s requirements.
- Modify filtering logic for datasets with unique requirements (e.g., additional subtypes or clinical conditions).
-
Compatibility with hg38:
- If using
BSgenome.Hsapiens.UCSC.hg38
, ensure the mutation data aligns with genome boundaries.
- If using
-
Pie Chart Colors:
- Colors in pie charts may not render as intended. Adjust the color mapping manually if needed.
- This is a MS PowerBI dashboard for palimpsest analysis of ALKBH gene mutation in Triple Negative Breast Cancer (TNBC) cancer. (Need to be made separately)

-
Data Format Requirements:
- Mutation data must include
Hugo_Symbol
,Variant_Classification
,Variant_Type
, andPATIENT_ID
. - Clinical data must include
PATIENT_ID
andSUBTYPE
.
- Mutation data must include
-
File Paths:
- Use absolute file paths for
mut_location
andclinical_location
to avoid path-related issues.
- Use absolute file paths for
-
Handling Missing Data:
- Clean datasets containing
NA
values usingna.omit()
or similar methods.
- Clean datasets containing
-
Runtime Considerations:
- Use a machine with at least 8 GB of RAM for optimal performance.
-
Testing with Sample Data:
- Validate the workflow using a small, curated dataset before running it on large datasets.
-
Version Control:
- Track changes to the pipeline using Git for reproducibility.
- Automating input file validation.
- Adding advanced visualization options.
- Enhancing compatibility with hg38.
Contributions are welcome! Please fork the repository, make changes, and submit a pull request. Bug reports and feature suggestions are encouraged.
Special thanks to the developers of:
- The Palimpsest R package for mutation signature analysis.
- The BSgenome project for genomic reference data integration.