This repository contains data analysis and visualization tools for a Multiple Myeloma research study focusing on sample collection timelines and patient diagnosis patterns.
The project analyzes sample collection data from Multiple Myeloma patients across different disease stages, including:
- MGUS (Monoclonal Gammopathy of Undetermined Significance)
- SMM (Smoldering Multiple Myeloma)
- NDMM (Newly Diagnosed Multiple Myeloma)
- RRMM (Relapsed/Refractory Multiple Myeloma)
- PCL (Plasma Cell Leukemia)
- LPL (Lymphoplasmacytic Lymphoma)
- LPL (Smoldering) (Smoldering Lymphoplasmacytic Lymphoma)
The analysis focuses on temporal patterns of sample collection from blood and bone marrow (BM) specimens over patient follow-up periods.
MM/
├── data/
│ └── raw/
│ ├── paired_mm.xlsx # Paired sample data
│ ├── tube_table.csv # Tube collection metadata
│ ├── unpaired_mm.xlsx # Unpaired sample data
│ └── unpaired_tube_table.csv # Unpaired tube metadata
├── notebooks/
│ ├── 1.0-rmn-understanding-slides.ipynb # Main analysis notebook
│ └── plots/ # Generated visualization outputs
│ ├── *.png # High-resolution plot images
│ └── *.pdf # Publication-ready PDFs
├── plots/ # Additional plot outputs
├── requirements.txt # Python dependencies
└── README.md # This file
- Blood samples: Represented by red triangles (▲) in visualizations
- Bone Marrow (BM) samples: Represented by light blue circles (●) in visualizations
- Paired Samples: 95 patients, 250 total samples, max follow-up 52.6 months
- Unpaired Samples: 80 patients, 171 total samples, max follow-up 57.5 months
- Combined Dataset: 105 patients, 421 total samples, max follow-up 67.5 months
- MGUS: 23 patients (follow-up: 0.0 - 49.1 months)
- SMM: 29 patients (follow-up: 0.0 - 64.5 months)
- NDMM: 21 patients (follow-up: 0.0 - 67.5 months)
- RRMM: 22 patients (follow-up: 0.0 - 64.6 months)
- PCL: 2 patients (follow-up: 0.0 - 15.9 months)
- LPL: 4 patients (follow-up: 0.0 - 34.7 months)
- LPL (Smoldering): 4 patients (follow-up: 0.0 - 18.1 months)
The pipeline produces several key visualizations:
- Paired samples:
swimmer_chart_paired_normalized_ordered.png/pdf - Unpaired samples:
swimmer_chart_unpaired_normalized_ordered.png/pdf - Combined data:
swimmer_chart_combined_normalized_ordered.png/pdf
Features:
- Patients ordered by diagnosis (MGUS → LPL Smoldering) then by follow-up duration
- Individual patient baseline normalization
- Color-coded background bars by diagnosis
- Sample type markers (blood vs. bone marrow)
- File:
stacked_area_tubes_by_diagnosis.png/pdf - Shows tube collection frequency over time by diagnosis
- 2-month time bins for temporal analysis
- Diagnosis-specific color coding
- Python 3.7+
- Jupyter Notebook/Lab
Install required packages:
pip install -r requirements.txtRequired packages:
pandas- Data manipulation and analysisnumpy- Numerical computingmatplotlib- Basic plottingseaborn- Statistical visualizationplotly- Interactive plotsscikit-learn- Machine learning utilitiesscipy- Scientific computingopenpyxl- Excel file handling
-
Start Jupyter:
jupyter notebook
-
Open the main notebook: Navigate to
notebooks/1.0-rmn-understanding-slides.ipynb -
Execute cells sequentially to:
- Load and clean the data
- Normalize patient timelines
- Generate swimmer charts
- Create stacked area plots
- Export high-quality visualizations
- Data Loading: Import Excel and CSV files
- Data Cleaning: Standardize diagnosis labels and sample types
- Temporal Normalization: Calculate months from individual patient baselines
- Patient Ordering: Sort by diagnosis priority and follow-up duration
- Visualization Generation: Create swimmer charts and area plots
- Export: Save plots in PNG and PDF formats
- Automatic data cleaning and standardization
- Individual patient baseline calculation
- Diagnosis-based patient ordering
- Sample type categorization
- Swimmer Charts: Individual patient timelines with sample collection points
- Stacked Area Plots: Population-level collection patterns over time
- Color Coding: Diagnosis-specific visual encoding
- Multiple Formats: High-resolution PNG and publication-ready PDF outputs
- Data validation and conflict resolution
- Missing data handling
- Comprehensive summary statistics
- Detailed breakdown tables
All visualizations are saved in both PNG (300 DPI) and PDF formats:
plots/swimmer_chart_*_normalized_ordered.{png,pdf}plots/stacked_area_tubes_by_diagnosis.{png,pdf}
- Months 26-28: 36 tubes (highest)
- Months 10-12: 30 tubes
- Months 22-24: 28 tubes
- Months 20-22: 25 tubes
- Months 28-30: 25 tubes
- SMM: 135 tubes (32% of total)
- NDMM: 104 tubes (25% of total)
- RRMM: 86 tubes (20% of total)
- MGUS: 67 tubes (16% of total)
- LPL: 12 tubes (3% of total)
- LPL (Smoldering): 11 tubes (3% of total)
- PCL: 6 tubes (1% of total)
This analysis pipeline is part of the HDSCA (High-Dimensional Single Cell Analysis) project. For contributions or questions, please follow the project's contribution guidelines.
- All timestamps are normalized to individual patient baselines
- Follow-up durations vary significantly by diagnosis type
- Visualizations are optimized for both digital viewing and print publication
- Data privacy: All patient identifiers are anonymized (MM02-XXXX format)
Last updated: September 2025