Multiple Myeloma (MM) Sample Collection Analysis Pipeline

This repository contains data analysis and visualization tools for a Multiple Myeloma research study focusing on sample collection timelines and patient diagnosis patterns.

Project Overview

The project analyzes sample collection data from Multiple Myeloma patients across different disease stages, including:

MGUS (Monoclonal Gammopathy of Undetermined Significance)
SMM (Smoldering Multiple Myeloma)
NDMM (Newly Diagnosed Multiple Myeloma)
RRMM (Relapsed/Refractory Multiple Myeloma)
PCL (Plasma Cell Leukemia)
LPL (Lymphoplasmacytic Lymphoma)
LPL (Smoldering) (Smoldering Lymphoplasmacytic Lymphoma)

The analysis focuses on temporal patterns of sample collection from blood and bone marrow (BM) specimens over patient follow-up periods.

Repository Structure

MM/
├── data/
│   └── raw/
│       ├── paired_mm.xlsx          # Paired sample data
│       ├── tube_table.csv          # Tube collection metadata
│       ├── unpaired_mm.xlsx        # Unpaired sample data
│       └── unpaired_tube_table.csv # Unpaired tube metadata
├── notebooks/
│   ├── 1.0-rmn-understanding-slides.ipynb  # Main analysis notebook
│   └── plots/                      # Generated visualization outputs
│       ├── *.png                   # High-resolution plot images
│       └── *.pdf                   # Publication-ready PDFs
├── plots/                          # Additional plot outputs
├── requirements.txt                # Python dependencies
└── README.md                       # This file

Data Description

Sample Types

Blood samples: Represented by red triangles (▲) in visualizations
Bone Marrow (BM) samples: Represented by light blue circles (●) in visualizations

Key Datasets

Paired Samples: 95 patients, 250 total samples, max follow-up 52.6 months
Unpaired Samples: 80 patients, 171 total samples, max follow-up 57.5 months
Combined Dataset: 105 patients, 421 total samples, max follow-up 67.5 months

Patient Distribution by Diagnosis

MGUS: 23 patients (follow-up: 0.0 - 49.1 months)
SMM: 29 patients (follow-up: 0.0 - 64.5 months)
NDMM: 21 patients (follow-up: 0.0 - 67.5 months)
RRMM: 22 patients (follow-up: 0.0 - 64.6 months)
PCL: 2 patients (follow-up: 0.0 - 15.9 months)
LPL: 4 patients (follow-up: 0.0 - 34.7 months)
LPL (Smoldering): 4 patients (follow-up: 0.0 - 18.1 months)

Generated Visualizations

The pipeline produces several key visualizations:

1. Swimmer Charts (Timeline Plots)

Paired samples: swimmer_chart_paired_normalized_ordered.png/pdf
Unpaired samples: swimmer_chart_unpaired_normalized_ordered.png/pdf
Combined data: swimmer_chart_combined_normalized_ordered.png/pdf

Features:

Patients ordered by diagnosis (MGUS → LPL Smoldering) then by follow-up duration
Individual patient baseline normalization
Color-coded background bars by diagnosis
Sample type markers (blood vs. bone marrow)

2. Stacked Area Plot

File: stacked_area_tubes_by_diagnosis.png/pdf
Shows tube collection frequency over time by diagnosis
2-month time bins for temporal analysis
Diagnosis-specific color coding

Installation & Setup

Prerequisites

Python 3.7+
Jupyter Notebook/Lab

Dependencies

Install required packages:

pip install -r requirements.txt

Required packages:

pandas - Data manipulation and analysis
numpy - Numerical computing
matplotlib - Basic plotting
seaborn - Statistical visualization
plotly - Interactive plots
scikit-learn - Machine learning utilities
scipy - Scientific computing
openpyxl - Excel file handling

Usage

Running the Analysis

Start Jupyter:
```
jupyter notebook
```
Open the main notebook: Navigate to notebooks/1.0-rmn-understanding-slides.ipynb
Execute cells sequentially to:
- Load and clean the data
- Normalize patient timelines
- Generate swimmer charts
- Create stacked area plots
- Export high-quality visualizations

Key Analysis Steps

Data Loading: Import Excel and CSV files
Data Cleaning: Standardize diagnosis labels and sample types
Temporal Normalization: Calculate months from individual patient baselines
Patient Ordering: Sort by diagnosis priority and follow-up duration
Visualization Generation: Create swimmer charts and area plots
Export: Save plots in PNG and PDF formats

Key Features

Data Processing

Automatic data cleaning and standardization
Individual patient baseline calculation
Diagnosis-based patient ordering
Sample type categorization

Visualizations

Swimmer Charts: Individual patient timelines with sample collection points
Stacked Area Plots: Population-level collection patterns over time
Color Coding: Diagnosis-specific visual encoding
Multiple Formats: High-resolution PNG and publication-ready PDF outputs

Quality Control

Data validation and conflict resolution
Missing data handling
Comprehensive summary statistics
Detailed breakdown tables

Output Files

All visualizations are saved in both PNG (300 DPI) and PDF formats:

plots/swimmer_chart_*_normalized_ordered.{png,pdf}
plots/stacked_area_tubes_by_diagnosis.{png,pdf}

Analysis Insights

Peak Collection Periods

Months 26-28: 36 tubes (highest)
Months 10-12: 30 tubes
Months 22-24: 28 tubes
Months 20-22: 25 tubes
Months 28-30: 25 tubes

Sample Distribution

SMM: 135 tubes (32% of total)
NDMM: 104 tubes (25% of total)
RRMM: 86 tubes (20% of total)
MGUS: 67 tubes (16% of total)
LPL: 12 tubes (3% of total)
LPL (Smoldering): 11 tubes (3% of total)
PCL: 6 tubes (1% of total)

Contributing

This analysis pipeline is part of the HDSCA (High-Dimensional Single Cell Analysis) project. For contributions or questions, please follow the project's contribution guidelines.

Notes

All timestamps are normalized to individual patient baselines
Follow-up durations vary significantly by diagnosis type
Visualizations are optimized for both digital viewing and print publication
Data privacy: All patient identifiers are anonymized (MM02-XXXX format)

Last updated: September 2025

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/raw		data/raw
notebooks		notebooks
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multiple Myeloma (MM) Sample Collection Analysis Pipeline

Project Overview

Repository Structure

Data Description

Sample Types

Key Datasets

Patient Distribution by Diagnosis

Generated Visualizations

1. Swimmer Charts (Timeline Plots)

2. Stacked Area Plot

Installation & Setup

Prerequisites

Dependencies

Usage

Running the Analysis

Key Analysis Steps

Key Features

Data Processing

Visualizations

Quality Control

Output Files

Analysis Insights

Peak Collection Periods

Sample Distribution

Contributing

Notes

About

Uh oh!

Releases

Packages

Languages

CSI-Cancer/MM

Folders and files

Latest commit

History

Repository files navigation

Multiple Myeloma (MM) Sample Collection Analysis Pipeline

Project Overview

Repository Structure

Data Description

Sample Types

Key Datasets

Patient Distribution by Diagnosis

Generated Visualizations

1. Swimmer Charts (Timeline Plots)

2. Stacked Area Plot

Installation & Setup

Prerequisites

Dependencies

Usage

Running the Analysis

Key Analysis Steps

Key Features

Data Processing

Visualizations

Quality Control

Output Files

Analysis Insights

Peak Collection Periods

Sample Distribution

Contributing

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages