Skip to content

This project implements a hybrid machine learning approach for classifying breast cancer from DNA sequences using bidirectional embeddings generated by DNABERT. The study processes over 46 million high-quality DNA sequences to distinguish between cancerous and non-cancerous genomic material.

Notifications You must be signed in to change notification settings

tech-aakash/Breast-cancer-classification-using-DNA-Sequencing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Breast Cancer Classification of DNA Sequences

Python DNABERT License

πŸ“š Overview

This project implements a hybrid machine learning approach for classifying breast cancer from DNA sequences using bidirectional embeddings generated by DNABERT. The study processes over 46 million high-quality DNA sequences to distinguish between cancerous and non-cancerous genomic material.

Key Features

  • Bidirectional Analysis: Utilizes both forward and reverse DNA strand representations
  • Hybrid Classification: Combines Random Forest (forward embeddings) and Deep Neural Networks (backward embeddings)
  • High-Quality Data: Implements Q30 Phred score filtering for 99.9% base call accuracy
  • Scalable Processing: Handles large genomic datasets using efficient batch processing

🎯 Research Objectives

  • Apply DNA sequence analysis to distinguish between genetic material linked to breast cancer
  • Develop a bidirectional DNA sequence embedding approach for improved classification
  • Demonstrate the potential of genomic information for early, non-invasive cancer diagnosis

πŸ“– Research Paper

πŸ“„ Main Publication: Breast_Cancer_Classification_of_DNA_Sequences3.pdf

Authors:

  • Aakash Walavalkar (Michigan Technological University, USA)
  • Anushka Kumar (NMIMS, Mumbai)
  • Laavanya Mishra (NMIMS, Mumbai)

Keywords: Phred Score, Base Pair, Sequence Embeddings, Breast Cancer, DNA Sequencing, Classification

πŸ—‚οΈ Project Structure

β”œβ”€β”€ πŸ“„ Breast_Cancer_Classification_of_DNA_Sequences3.pdf  # Main research paper
β”œβ”€β”€ πŸ“Š Data Processing & Analysis
β”‚   β”œβ”€β”€ 6. Generating Clean Readings for All Batches.pdf   # Quality control procedures
β”‚   β”œβ”€β”€ cleaning_sequences.ipynb                          # Sequence cleaning implementation
β”‚   β”œβ”€β”€ data_read_forward.ipynb                          # Forward strand processing
β”‚   β”œβ”€β”€ data_read_backward.ipynb                         # Backward strand processing
β”‚   └── rough.ipynb                                      # Utility functions
β”œβ”€β”€ 🧬 Embedding Generation
β”‚   β”œβ”€β”€ embeddings.py                                    # DNABERT embedding generation
β”‚   └── embeddings.ipynb                                # Embedding analysis notebook
β”œβ”€β”€ πŸ€– Model Training
β”‚   β”œβ”€β”€ forw_train_df.ipynb                             # Forward embeddings training
β”‚   └── backw_train_df.ipynb                            # Backward embeddings training
└── πŸ“ˆ Results & Analysis
    β”œβ”€β”€ UMAP visualizations                             # Dimensionality reduction plots
    └── Classification reports                          # Model performance metrics

πŸ”¬ Methodology

1. Data Acquisition

  • Source: National Genomics Data Center (NGDC) and NCBI SRA
  • Datasets:
    • Cancerous: Primary breast cancer sequences (SRR5177930)
    • Non-Cancerous: Normal breast tissue samples (SRR6269879)
  • Format: FASTQ files converted to Parquet for efficient processing

2. Quality Control & Filtering

def is_quality_good(quality_scores):
    return np.min(quality_scores) >= 30

Filtering Criteria:

  • Phred Score: β‰₯ Q30 (99.9% base call accuracy)
  • Sequence Length: β‰₯ 100 base pairs
  • Batch Size: 100,000 reads per batch for memory efficiency

3. Data Processing Pipeline

Phase 1: FASTQ to Parquet Conversion

  • Parse FASTQ files using Biopython's SeqIO
  • Extract sequence ID, nucleotide sequence, and quality scores
  • Batch processing to handle 60M+ sequences efficiently

Phase 2: Quality Filtering

  • Apply Q30 filtering to ensure high-quality sequences
  • Remove sequences with any base having Phred score < 30
  • Filter sequences shorter than 100 base pairs

Phase 3: Embedding Generation

# DNABERT-6 tokenization and embedding
model = AutoModel.from_pretrained("zhihan1996/DNA_bert_6")
embeddings = model.encode(sequences)  # 768-dimensional vectors

4. Model Architecture

Forward Embeddings: Random Forest Classifier

  • Algorithm: RandomForestClassifier with Optuna hyperparameter optimization
  • Best Parameters:
    • n_estimators: 161
    • max_depth: 21
    • max_features: sqrt
  • Performance: AUC = 0.9753, Accuracy = 97.5%

Backward Embeddings: Deep Neural Network

  • Architecture: Feedforward Neural Network
    • Input: 768-dimensional DNABERT embeddings
    • Layers: [4096, 2048, 1024, 512, 256, 128]
    • Dropout: 0.3
    • Optimizer: Adam
  • Performance: AUC = 0.9493, Accuracy = 88.05%

πŸ“Š Results Summary

Model Type Embedding Direction Accuracy Precision Recall F1-Score AUC
Random Forest (Tuned) Forward 97.50% 0.97-0.98 0.97-0.98 0.97-0.98 0.9753
Neural Network Backward 88.05% 0.87-0.90 0.85-0.91 0.87-0.89 0.9493

πŸ› οΈ Installation & Setup

Prerequisites

  • Python 3.9.21
  • CUDA-compatible GPU (recommended for DNABERT processing)
  • Azure VM or similar cloud compute environment

Required Libraries

pip install torch transformers
pip install biopython pandas pyarrow numpy scikit-learn
pip install optuna umap-learn
pip install duckdb  # For efficient Parquet querying

DNABERT Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNA_bert_6", trust_remote_code=True)
model = AutoModel.from_pretrained("zhihan1996/DNA_bert_6", trust_remote_code=True)

πŸ”§ Usage

1. Data Preprocessing

# Clean and filter sequences
jupyter notebook cleaning_sequences.ipynb

# Process forward and backward strands
jupyter notebook data_read_forward.ipynb
jupyter notebook data_read_backward.ipynb

2. Generate Embeddings

# Run DNABERT embedding generation
python embeddings.py --input_dir ./clean_sequences --output_dir ./embeddings

3. Train Models

# Train forward embeddings model
jupyter notebook forw_train_df.ipynb

# Train backward embeddings model  
jupyter notebook backw_train_df.ipynb

πŸ“ Data Organization

Cleaned Datasets Structure

clean_forward_reads/     # Forward cancerous sequences (604 batches, 28M sequences)
clean_backward_reads/    # Backward cancerous sequences (604 batches, 9M sequences)  
clean_forward_noncan/    # Forward non-cancerous sequences (553 batches, 5.7M sequences)
clean_backward_noncan/   # Backward non-cancerous sequences (553 batches, 3.8M sequences)

Embedding Files Structure

embeddings/
β”œβ”€β”€ forward_cancerous_embeddings.npy     # 768-dim vectors + metadata CSV
β”œβ”€β”€ forward_noncancerous_embeddings.npy  # 768-dim vectors + metadata CSV
β”œβ”€β”€ backward_cancerous_embeddings.npy    # 768-dim vectors + metadata CSV
└── backward_noncancerous_embeddings.npy # 768-dim vectors + metadata CSV

πŸ”¬ Technical Implementation Details

Infrastructure

  • Platform: Azure Virtual Machine (cloud-hosted)
  • Access: SSH with Visual Studio Code remote development
  • Storage: Efficient Parquet format for large genomic datasets
  • Monitoring: Weights & Biases for experiment tracking

Key Technologies

  • DNABERT: Pre-trained transformer for genomic sequences with 6-mer tokenization
  • Optuna: Hyperparameter optimization framework
  • UMAP: Dimensionality reduction for visualization
  • DuckDB: Lightweight querying for large Parquet files

Quality Metrics

  • Total Sequences Processed: 46,968,954 high-quality sequences
  • Quality Threshold: Phred Q30 (1 in 1,000 error rate)
  • Batch Processing: 100,000 sequences per batch for memory efficiency

πŸ“ˆ Key Findings

  1. Bidirectional Approach: Forward and backward embeddings show different separability patterns in UMAP visualization
  2. Model Selection: Random Forest optimal for forward embeddings; Neural Networks better for backward embeddings
  3. Quality Impact: Q30 filtering significantly reduces dataset size but improves classification accuracy
  4. Performance: Both models achieve high accuracy (>88%) demonstrating feasibility of genomic-based cancer classification

Future Work

Future Directions

  • Multi-class cancer type classification
  • Integration with clinical data
  • Real-time diagnostic applications
  • Cross-population validation studies

🀝 Contributing

This is an academic research project. For collaboration opportunities or questions:

πŸ“š References

The complete reference list is available in the research paper. Key references include:

  1. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language
  2. National Genomics Data Center (NGDC) datasets
  3. Apache Parquet for efficient genomic data storage
  4. Optuna for automated hyperparameter optimization

πŸ“„ License

This project is for academic research purposes. Please cite the paper if you use this work in your research.


Last Updated: September 2025
Version: 1.0
Status: Research Complete

About

This project implements a hybrid machine learning approach for classifying breast cancer from DNA sequences using bidirectional embeddings generated by DNABERT. The study processes over 46 million high-quality DNA sequences to distinguish between cancerous and non-cancerous genomic material.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •