This project implements a hybrid machine learning approach for classifying breast cancer from DNA sequences using bidirectional embeddings generated by DNABERT. The study processes over 46 million high-quality DNA sequences to distinguish between cancerous and non-cancerous genomic material.
- Bidirectional Analysis: Utilizes both forward and reverse DNA strand representations
- Hybrid Classification: Combines Random Forest (forward embeddings) and Deep Neural Networks (backward embeddings)
- High-Quality Data: Implements Q30 Phred score filtering for 99.9% base call accuracy
- Scalable Processing: Handles large genomic datasets using efficient batch processing
- Apply DNA sequence analysis to distinguish between genetic material linked to breast cancer
- Develop a bidirectional DNA sequence embedding approach for improved classification
- Demonstrate the potential of genomic information for early, non-invasive cancer diagnosis
π Main Publication: Breast_Cancer_Classification_of_DNA_Sequences3.pdf
Authors:
- Aakash Walavalkar (Michigan Technological University, USA)
- Anushka Kumar (NMIMS, Mumbai)
- Laavanya Mishra (NMIMS, Mumbai)
Keywords: Phred Score, Base Pair, Sequence Embeddings, Breast Cancer, DNA Sequencing, Classification
βββ π Breast_Cancer_Classification_of_DNA_Sequences3.pdf # Main research paper
βββ π Data Processing & Analysis
β βββ 6. Generating Clean Readings for All Batches.pdf # Quality control procedures
β βββ cleaning_sequences.ipynb # Sequence cleaning implementation
β βββ data_read_forward.ipynb # Forward strand processing
β βββ data_read_backward.ipynb # Backward strand processing
β βββ rough.ipynb # Utility functions
βββ 𧬠Embedding Generation
β βββ embeddings.py # DNABERT embedding generation
β βββ embeddings.ipynb # Embedding analysis notebook
βββ π€ Model Training
β βββ forw_train_df.ipynb # Forward embeddings training
β βββ backw_train_df.ipynb # Backward embeddings training
βββ π Results & Analysis
βββ UMAP visualizations # Dimensionality reduction plots
βββ Classification reports # Model performance metrics
- Source: National Genomics Data Center (NGDC) and NCBI SRA
- Datasets:
- Cancerous: Primary breast cancer sequences (SRR5177930)
- Non-Cancerous: Normal breast tissue samples (SRR6269879)
- Format: FASTQ files converted to Parquet for efficient processing
def is_quality_good(quality_scores):
return np.min(quality_scores) >= 30Filtering Criteria:
- Phred Score: β₯ Q30 (99.9% base call accuracy)
- Sequence Length: β₯ 100 base pairs
- Batch Size: 100,000 reads per batch for memory efficiency
- Parse FASTQ files using Biopython's SeqIO
- Extract sequence ID, nucleotide sequence, and quality scores
- Batch processing to handle 60M+ sequences efficiently
- Apply Q30 filtering to ensure high-quality sequences
- Remove sequences with any base having Phred score < 30
- Filter sequences shorter than 100 base pairs
# DNABERT-6 tokenization and embedding
model = AutoModel.from_pretrained("zhihan1996/DNA_bert_6")
embeddings = model.encode(sequences) # 768-dimensional vectors- Algorithm: RandomForestClassifier with Optuna hyperparameter optimization
- Best Parameters:
- n_estimators: 161
- max_depth: 21
- max_features: sqrt
- Performance: AUC = 0.9753, Accuracy = 97.5%
- Architecture: Feedforward Neural Network
- Input: 768-dimensional DNABERT embeddings
- Layers: [4096, 2048, 1024, 512, 256, 128]
- Dropout: 0.3
- Optimizer: Adam
- Performance: AUC = 0.9493, Accuracy = 88.05%
| Model Type | Embedding Direction | Accuracy | Precision | Recall | F1-Score | AUC |
|---|---|---|---|---|---|---|
| Random Forest (Tuned) | Forward | 97.50% | 0.97-0.98 | 0.97-0.98 | 0.97-0.98 | 0.9753 |
| Neural Network | Backward | 88.05% | 0.87-0.90 | 0.85-0.91 | 0.87-0.89 | 0.9493 |
- Python 3.9.21
- CUDA-compatible GPU (recommended for DNABERT processing)
- Azure VM or similar cloud compute environment
pip install torch transformers
pip install biopython pandas pyarrow numpy scikit-learn
pip install optuna umap-learn
pip install duckdb # For efficient Parquet queryingfrom transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNA_bert_6", trust_remote_code=True)
model = AutoModel.from_pretrained("zhihan1996/DNA_bert_6", trust_remote_code=True)# Clean and filter sequences
jupyter notebook cleaning_sequences.ipynb
# Process forward and backward strands
jupyter notebook data_read_forward.ipynb
jupyter notebook data_read_backward.ipynb# Run DNABERT embedding generation
python embeddings.py --input_dir ./clean_sequences --output_dir ./embeddings# Train forward embeddings model
jupyter notebook forw_train_df.ipynb
# Train backward embeddings model
jupyter notebook backw_train_df.ipynbclean_forward_reads/ # Forward cancerous sequences (604 batches, 28M sequences)
clean_backward_reads/ # Backward cancerous sequences (604 batches, 9M sequences)
clean_forward_noncan/ # Forward non-cancerous sequences (553 batches, 5.7M sequences)
clean_backward_noncan/ # Backward non-cancerous sequences (553 batches, 3.8M sequences)
embeddings/
βββ forward_cancerous_embeddings.npy # 768-dim vectors + metadata CSV
βββ forward_noncancerous_embeddings.npy # 768-dim vectors + metadata CSV
βββ backward_cancerous_embeddings.npy # 768-dim vectors + metadata CSV
βββ backward_noncancerous_embeddings.npy # 768-dim vectors + metadata CSV
- Platform: Azure Virtual Machine (cloud-hosted)
- Access: SSH with Visual Studio Code remote development
- Storage: Efficient Parquet format for large genomic datasets
- Monitoring: Weights & Biases for experiment tracking
- DNABERT: Pre-trained transformer for genomic sequences with 6-mer tokenization
- Optuna: Hyperparameter optimization framework
- UMAP: Dimensionality reduction for visualization
- DuckDB: Lightweight querying for large Parquet files
- Total Sequences Processed: 46,968,954 high-quality sequences
- Quality Threshold: Phred Q30 (1 in 1,000 error rate)
- Batch Processing: 100,000 sequences per batch for memory efficiency
- Bidirectional Approach: Forward and backward embeddings show different separability patterns in UMAP visualization
- Model Selection: Random Forest optimal for forward embeddings; Neural Networks better for backward embeddings
- Quality Impact: Q30 filtering significantly reduces dataset size but improves classification accuracy
- Performance: Both models achieve high accuracy (>88%) demonstrating feasibility of genomic-based cancer classification
- Multi-class cancer type classification
- Integration with clinical data
- Real-time diagnostic applications
- Cross-population validation studies
This is an academic research project. For collaboration opportunities or questions:
- Aakash Walavalkar: aakash.muskurahat@gmail.com
- Anushka Kumar: anushka.ayyanar@gmail.com
- Laavanya Mishra: mishralaavanya@gmail.com
The complete reference list is available in the research paper. Key references include:
- DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language
- National Genomics Data Center (NGDC) datasets
- Apache Parquet for efficient genomic data storage
- Optuna for automated hyperparameter optimization
This project is for academic research purposes. Please cite the paper if you use this work in your research.
Last Updated: September 2025
Version: 1.0
Status: Research Complete