Skip to content

CSI-Cancer/MM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Multiple Myeloma (MM) Sample Collection Analysis Pipeline

This repository contains data analysis and visualization tools for a Multiple Myeloma research study focusing on sample collection timelines and patient diagnosis patterns.

Project Overview

The project analyzes sample collection data from Multiple Myeloma patients across different disease stages, including:

  • MGUS (Monoclonal Gammopathy of Undetermined Significance)
  • SMM (Smoldering Multiple Myeloma)
  • NDMM (Newly Diagnosed Multiple Myeloma)
  • RRMM (Relapsed/Refractory Multiple Myeloma)
  • PCL (Plasma Cell Leukemia)
  • LPL (Lymphoplasmacytic Lymphoma)
  • LPL (Smoldering) (Smoldering Lymphoplasmacytic Lymphoma)

The analysis focuses on temporal patterns of sample collection from blood and bone marrow (BM) specimens over patient follow-up periods.

Repository Structure

MM/
├── data/
│   └── raw/
│       ├── paired_mm.xlsx          # Paired sample data
│       ├── tube_table.csv          # Tube collection metadata
│       ├── unpaired_mm.xlsx        # Unpaired sample data
│       └── unpaired_tube_table.csv # Unpaired tube metadata
├── notebooks/
│   ├── 1.0-rmn-understanding-slides.ipynb  # Main analysis notebook
│   └── plots/                      # Generated visualization outputs
│       ├── *.png                   # High-resolution plot images
│       └── *.pdf                   # Publication-ready PDFs
├── plots/                          # Additional plot outputs
├── requirements.txt                # Python dependencies
└── README.md                       # This file

Data Description

Sample Types

  • Blood samples: Represented by red triangles (▲) in visualizations
  • Bone Marrow (BM) samples: Represented by light blue circles (●) in visualizations

Key Datasets

  1. Paired Samples: 95 patients, 250 total samples, max follow-up 52.6 months
  2. Unpaired Samples: 80 patients, 171 total samples, max follow-up 57.5 months
  3. Combined Dataset: 105 patients, 421 total samples, max follow-up 67.5 months

Patient Distribution by Diagnosis

  • MGUS: 23 patients (follow-up: 0.0 - 49.1 months)
  • SMM: 29 patients (follow-up: 0.0 - 64.5 months)
  • NDMM: 21 patients (follow-up: 0.0 - 67.5 months)
  • RRMM: 22 patients (follow-up: 0.0 - 64.6 months)
  • PCL: 2 patients (follow-up: 0.0 - 15.9 months)
  • LPL: 4 patients (follow-up: 0.0 - 34.7 months)
  • LPL (Smoldering): 4 patients (follow-up: 0.0 - 18.1 months)

Generated Visualizations

The pipeline produces several key visualizations:

1. Swimmer Charts (Timeline Plots)

  • Paired samples: swimmer_chart_paired_normalized_ordered.png/pdf
  • Unpaired samples: swimmer_chart_unpaired_normalized_ordered.png/pdf
  • Combined data: swimmer_chart_combined_normalized_ordered.png/pdf

Features:

  • Patients ordered by diagnosis (MGUS → LPL Smoldering) then by follow-up duration
  • Individual patient baseline normalization
  • Color-coded background bars by diagnosis
  • Sample type markers (blood vs. bone marrow)

2. Stacked Area Plot

  • File: stacked_area_tubes_by_diagnosis.png/pdf
  • Shows tube collection frequency over time by diagnosis
  • 2-month time bins for temporal analysis
  • Diagnosis-specific color coding

Installation & Setup

Prerequisites

  • Python 3.7+
  • Jupyter Notebook/Lab

Dependencies

Install required packages:

pip install -r requirements.txt

Required packages:

  • pandas - Data manipulation and analysis
  • numpy - Numerical computing
  • matplotlib - Basic plotting
  • seaborn - Statistical visualization
  • plotly - Interactive plots
  • scikit-learn - Machine learning utilities
  • scipy - Scientific computing
  • openpyxl - Excel file handling

Usage

Running the Analysis

  1. Start Jupyter:

    jupyter notebook
  2. Open the main notebook: Navigate to notebooks/1.0-rmn-understanding-slides.ipynb

  3. Execute cells sequentially to:

    • Load and clean the data
    • Normalize patient timelines
    • Generate swimmer charts
    • Create stacked area plots
    • Export high-quality visualizations

Key Analysis Steps

  1. Data Loading: Import Excel and CSV files
  2. Data Cleaning: Standardize diagnosis labels and sample types
  3. Temporal Normalization: Calculate months from individual patient baselines
  4. Patient Ordering: Sort by diagnosis priority and follow-up duration
  5. Visualization Generation: Create swimmer charts and area plots
  6. Export: Save plots in PNG and PDF formats

Key Features

Data Processing

  • Automatic data cleaning and standardization
  • Individual patient baseline calculation
  • Diagnosis-based patient ordering
  • Sample type categorization

Visualizations

  • Swimmer Charts: Individual patient timelines with sample collection points
  • Stacked Area Plots: Population-level collection patterns over time
  • Color Coding: Diagnosis-specific visual encoding
  • Multiple Formats: High-resolution PNG and publication-ready PDF outputs

Quality Control

  • Data validation and conflict resolution
  • Missing data handling
  • Comprehensive summary statistics
  • Detailed breakdown tables

Output Files

All visualizations are saved in both PNG (300 DPI) and PDF formats:

  • plots/swimmer_chart_*_normalized_ordered.{png,pdf}
  • plots/stacked_area_tubes_by_diagnosis.{png,pdf}

Analysis Insights

Peak Collection Periods

  1. Months 26-28: 36 tubes (highest)
  2. Months 10-12: 30 tubes
  3. Months 22-24: 28 tubes
  4. Months 20-22: 25 tubes
  5. Months 28-30: 25 tubes

Sample Distribution

  • SMM: 135 tubes (32% of total)
  • NDMM: 104 tubes (25% of total)
  • RRMM: 86 tubes (20% of total)
  • MGUS: 67 tubes (16% of total)
  • LPL: 12 tubes (3% of total)
  • LPL (Smoldering): 11 tubes (3% of total)
  • PCL: 6 tubes (1% of total)

Contributing

This analysis pipeline is part of the HDSCA (High-Dimensional Single Cell Analysis) project. For contributions or questions, please follow the project's contribution guidelines.

Notes

  • All timestamps are normalized to individual patient baselines
  • Follow-up durations vary significantly by diagnosis type
  • Visualizations are optimized for both digital viewing and print publication
  • Data privacy: All patient identifiers are anonymized (MM02-XXXX format)

Last updated: September 2025

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published