Skip to content

yusufmo1/smiles2spec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMILES2SPEC: AI-Powered Mass Spectrum Prediction System

MSc AI in Biosciences Dissertation Project
Queen Mary University of London
Author: Yusuf Mohammed
Supervisor: Mohammed Elbadawi

MSc Dissertation Academic Year

Core Technologies

Python Flask SvelteKit TypeScript RDKit Plotly Gunicorn Docker

Machine Learning & Performance

Machine Learning PyTorch scikit-learn Performance Samples Features API Endpoints Accuracy


Quick Navigation

Executive Summary • System Features • Architecture • Installation • API Documentation • Configuration • Technical Specifications • Citation


Executive Summary

SMILES2SPEC is a production-ready full-stack application developed as part of an MSc dissertation exploring the integration of artificial intelligence into electronic laboratory notebooks. The system predicts high-resolution electron ionization (EI) mass spectra directly from SMILES molecular notation, combining advanced machine learning models with a robust web-based architecture to provide real-time spectral prediction capabilities for biomedical research applications.

System Features

Backend

  • Flask 2.x REST API with modular, service-oriented architecture
  • Domain-driven design with separated concerns:
    • API Layer: Routes, middleware, validation schemas
    • Service Layer: Business logic and orchestration
    • Core Layer: Domain models, ML processors, feature extraction
    • Integration Layer: External service abstractions (LLM, databases)
    • Utils Layer: Shared utilities and error handling
  • Advanced Molecular Featurization:
    • 200+ RDKit molecular descriptors
    • 8 different fingerprint types (Morgan, MACCS, RDKit, Avalon, etc.)
    • Electronic properties and structural analysis
    • Parallelised feature extraction for performance
  • Pre-trained Random Forest Model with 2048+ features
  • AI-powered chat assistant with spectrum context and chemical knowledge (google/gemma-3-27b-it:free)
  • SMILES generation from natural language descriptions via OpenRouter LLM integration
  • Comprehensive Export Capabilities:
    • JSON-formatted spectra with peak lists
    • MSP (Mass Spectral Library) format export
    • Molecular structure images (PNG/SVG)
    • Bulk processing for multiple compounds
  • Production-Ready Features:
    • Production WSGI Server: Gunicorn with optimised worker configuration
    • Pydantic schema validation for all inputs/outputs
    • Global error handling with structured responses
    • Health monitoring and service readiness checks
    • CORS configuration for frontend integration
    • Environment-based configuration management
    • Docker containerization with health checks

Pre-processing Pipeline

  • Parallelised feature extraction (joblib)
  • Automatic feature schema generation
  • Variance and NaN filtering
  • Log-scaling and standardisation
  • Support for 2048+ molecular descriptors and fingerprints

Frontend

  • Svelte/SvelteKit with TypeScript and modular architecture
  • Performance-Optimised State Management:
    • Modular store architecture with lazy loading
    • Plot management code loaded on-demand
    • Tree-shaking optimisation for smaller bundles
    • Separate chunks for heavy functionality
  • User-Centred Route Structure:
    • Landing Page (/): Welcome interface with navigation to primary tools
    • Spectral Simulation (/spectral-simulation): Full prediction interface with interactive visualisation
    • How It Works (/how-it-works): Educational content about mass spectrum prediction
    • About (/about): Developer information and project details
    • Chat with Spectrum (/chat-with-spectrum): AI-powered spectral analysis assistance
  • Real-time spectrum visualisation with Plotly.js
  • Interactive molecular structure display
  • Bulk SMILES processing with progress tracking
  • Export capabilities (JSON, MSP, CSV) with format validation
  • Responsive design with modern UI components and accessibility features

System Architecture

Backend Architecture

backend/
├── main.py                    # Application entry point
├── config/                    # Centralized configuration
│   └── settings.py           # Environment-based config
├── api/                      # API layer
│   ├── routes/              # Domain-specific route modules
│   │   ├── prediction.py    # Spectrum prediction endpoints
│   │   ├── export.py        # File export endpoints
│   │   ├── chat.py          # AI chat endpoints
│   │   ├── health.py        # Health check endpoints
│   │   └── upload.py        # File upload endpoints
│   ├── schemas/             # Pydantic validation schemas
│   │   ├── prediction.py    # Prediction request/response schemas
│   │   └── chat.py          # Chat message schemas
│   └── middleware/          # Request/response middleware
│       ├── validation.py    # Input validation decorators
│       └── error_handler.py # Global error handling
├── services/                # Business logic layer
│   └── prediction_service.py # Spectrum prediction orchestration
├── core/                    # Core domain layer
│   ├── models/             # Domain models
│   │   ├── molecule.py     # Molecular data structures
│   │   └── spectrum.py     # Spectrum data structures
│   ├── processors/         # Core processing logic
│   │   ├── feature_processor.py    # Molecular feature extraction
│   │   ├── spectrum_processor.py   # Spectrum processing
│   │   └── feature_preprocessor.py # Feature preprocessing
│   └── ml/                 # Machine learning components
│       └── model_handler.py # ML model operations
├── integrations/           # External service integrations
│   └── llm/               # LLM integration
│       ├── client.py      # LLM API client
│       └── services/      # LLM service implementations
│           ├── chat_service.py    # Chat with spectrum context
│           └── smiles_service.py  # SMILES generation
├── utils/                 # Shared utilities
│   ├── errors.py         # Exception hierarchy
│   ├── logging.py        # Centralized logging
│   ├── chemistry.py      # Chemical utilities
│   ├── data.py           # Data conversion utilities
│   └── formats.py        # File format utilities
└── models/                # ML model files (not in git)
    ├── spectrum_predictor.pkl     # Trained Random Forest model
    ├── feature_preprocessor.pkl   # Feature scaling pipeline
    └── feature_mapping.json       # Feature index mapping

Frontend Architecture

frontend/
├── src/
│   ├── routes/                    # SvelteKit pages
│   │   ├── +page.svelte          # Landing page with project overview
│   │   ├── +layout.svelte        # Root layout
│   │   ├── spectral-simulation/  # Main prediction interface
│   │   ├── about/                # About page
│   │   ├── chat-with-spectrum/   # AI chat interface
│   │   └── how-it-works/         # Documentation
│   └── lib/                      # Core library
│       ├── components/           # Reusable components
│       │   ├── landing/          # Landing page components
│       │   ├── smiles-input/     # SMILES input system
│       │   ├── panels/           # UI panel system
│       │   │   ├── simulation/   # Spectrum visualisation
│       │   │   ├── info/         # Information panels
│       │   │   └── about/        # About panels
│       │   ├── chat/             # Chat components
│       │   └── icons/            # SVG icon library
│       ├── services/             # API and business logic
│       │   ├── api.ts           # Backend API client
│       │   ├── plotlyService.ts # Plotting utilities
│       │   └── chatService.ts   # Chat functionality
│       ├── stores/              # Modular state management
│       │   ├── index.ts         # Main store exports with lazy loading
│       │   ├── appState.ts      # Global application state
│       │   ├── panelStore.ts    # Panel definitions and management
│       │   ├── carouselStore.ts # Panel navigation and carousel
│       │   ├── pageStore.ts     # Page routing and navigation
│       │   ├── plotEffects.ts   # Lazy-loaded plot management
│       │   └── types.ts         # TypeScript type definitions
│       └── styles/              # Global styles
│           ├── theme.css        # CSS variables
│           └── tokens.css       # Design tokens
├── static/                      # Static assets
├── package.json                 # Dependencies
└── vite.config.js              # Build configuration

Key Architectural Principles

  1. Separation of Concerns: Clear boundaries between API, business logic, and data layers
  2. Dependency Injection: Services are injected rather than tightly coupled
  3. Error Handling: Structured exception hierarchy with global error handlers
  4. Validation: Input validation at API boundaries using Pydantic schemas
  5. Testability: Modular design enables comprehensive unit and integration testing
  6. Maintainability: Domain-driven organisation makes code easy to understand and modify
  7. Performance: Lazy loading and code splitting optimise bundle size and initial load times

Backend Components

Core Services

  • PredictionService: Orchestrates spectrum prediction from SMILES strings
  • ChatService: AI-powered chat functionality with spectrum context integration
  • SMILESService: Natural language to SMILES generation using LLMs

Machine Learning Pipeline

  • ModelHandler: Manages trained model loading and inference operations
  • FeatureProcessor: Extracts 2048+ molecular features using RDKit
  • FeaturePreprocessor: Applies scaling, filtering, and standardisation
  • SpectrumProcessor: Converts model outputs to formatted spectra and peak lists

Data Processing & Validation

  • Pydantic Schemas: Comprehensive request/response validation
  • Error Handling: Custom exception hierarchy with structured error responses
  • File Processing: Support for CSV/TXT bulk uploads and MSP exports
  • Image Generation: Molecular structure visualisation (PNG/SVG)

API Architecture

  • Route Blueprints: Domain-separated endpoint organisation
  • Middleware: Global validation, error handling, and CORS configuration
  • Health Monitoring: Service readiness and model status checking

Frontend Performance Architecture

  • Modular Store System: Separated stores for different concerns (app state, panels, navigation)
  • Lazy Loading: Plot management functionality loaded on-demand
  • Code Splitting: Heavy functionality in separate chunks for optimal bundle size
  • Tree Shaking: Unused store logic excluded from production bundles
  • Memory Management: Automatic cleanup of plot resources and event listeners

Installation and Setup

Prerequisites

  • Python 3.10-3.11 (Python 3.12+ has compatibility issues with RDKit)
  • Node.js 18+ with npm
  • RDKit molecular toolkit (installation via conda-forge required)
  • Docker and Docker Compose (optional for containerised deployment)

Backend Setup

Development Mode

cd backend
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Optional: Set up environment variables
echo "OPENROUTER_API_KEY=your_api_key_here" > .env

# Start development server
python main.py

Production Mode (WSGI Server)

cd backend
# Install dependencies including Gunicorn
pip install -r requirements.txt

# Run with Gunicorn (production WSGI server)
gunicorn --config gunicorn.conf.py backend.wsgi:app

# Alternative: Run from parent directory
cd ..
gunicorn --config backend/gunicorn.conf.py backend.wsgi:app

Docker Deployment

# Full stack deployment with production WSGI server
docker-compose up --build
# Backend: http://localhost:5050 (Production WSGI with Gunicorn)
# Frontend: http://localhost:3001

The API will be available at http://localhost:5050

Frontend Setup

cd frontend
npm install
npm run dev

The frontend will be available at http://localhost:5173

API Documentation

Spectrum Prediction

  • POST /predict - Predict mass spectrum from SMILES string
  • GET /structure - Get molecular structure as base64 PNG
  • GET /structure/png - Alternative endpoint for molecular structure

File Operations & Export

  • POST /smiles_bulk - Upload CSV/TXT files with bulk SMILES data
  • POST /export_msp - Export single spectrum in MSP format
  • POST /export_msp_batch - Export multiple spectra as combined MSP file

AI-Powered Features

  • POST /chat - Interactive chat with AI assistant (includes spectrum context)
  • POST /generate_smiles - Generate SMILES from natural language descriptions

System Health & Monitoring

  • GET /health - Comprehensive service health check with model status

Example Usage

# Predict spectrum with detailed response
curl -X POST http://localhost:5050/predict \
  -H "Content-Type: application/json" \
  -d '{"smiles": "CC(=O)OC1=CC=CC=C1C(=O)O"}'

# Response includes:
# - Chemical name and molecular properties
# - High-resolution spectrum data (x/y arrays)
# - Peak list with m/z and intensity values
# - Base64-encoded molecular structure PNG
# - Metadata and processing information

# Get molecular structure image
curl "http://localhost:5050/structure?smiles=CCO"

# Upload bulk SMILES file (CSV or TXT)
curl -X POST http://localhost:5050/smiles_bulk \
  -F "file=@molecules.csv"

# Export spectrum in MSP format (mass spectral library format)
curl -X POST http://localhost:5050/export_msp \
  -H "Content-Type: application/json" \
  -d '{"smiles": "CCO"}' \
  --output ethanol_spectrum.msp

# Chat with AI about spectrum interpretation
curl -X POST http://localhost:5050/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Explain the fragmentation pattern"}],
    "smiles": "CC(=O)OC1=CC=CC=C1C(=O)O",
    "stream": false
  }'

# Generate SMILES from description
curl -X POST http://localhost:5050/generate_smiles \
  -H "Content-Type: application/json" \
  -d '{"description": "anti-inflammatory drug with benzene ring"}'

Configuration

Environment Variables

  • API_HOST - Server host (default: 0.0.0.0)
  • API_PORT - Server port (default: 5050)
  • API_DEBUG - Debug mode (default: true)
  • OPENROUTER_API_KEY - OpenRouter API key for LLM features (optional)
  • LOG_LEVEL - Logging level (default: INFO)

Backend Configuration

API Settings (config/settings.py)

  • host/port: Server binding configuration
  • debug: Development mode settings
  • model_path: Path to trained spectrum predictor model
  • preprocessor_path: Feature preprocessing pipeline location
  • feature_mapping_path: Molecular feature mapping definitions

Production WSGI Configuration (gunicorn.conf.py)

  • Workers: Optimised for ML workload (2 workers default)
  • Worker Class: Sync workers for RDKit/ML compatibility
  • Timeouts: Extended timeout (300s) for ML inference operations
  • Memory Management: Preload app for efficient model loading
  • Logging: Structured production logging with access logs
  • Health Checks: Configurable health monitoring endpoints

Feature Extraction Configuration

  • Molecular Descriptors: 200+ RDKit descriptors (MW, LogP, TPSA, etc.)
  • Fingerprints: Multiple fingerprint types with configurable parameters:
    • Morgan fingerprints (radii 1-3, size 1024)
    • Morgan feature fingerprints (radius 2, size 1024)
    • MACCS keys (166 structural keys)
    • Topological fingerprints (size 1024)
    • RDKit fingerprints (size 2048)
    • Avalon fingerprints (size 1024)
    • Pattern fingerprints (size 1024)
    • Layered fingerprints (size 2048)
  • Count Features: Bond counts, atom counts, ring analysis
  • Electronic Properties: HOMO/LUMO estimation, dipole moments

Spectral Processing

  • m/z Range: Configurable mass-to-charge ratio ranges
  • Resolution: Spectral resolution and binning parameters
  • Peak Detection: Intensity thresholds and peak finding algorithms

LLM Integration

  • Provider: OpenRouter gateway to multiple LLM providers
  • Model: google/gemma-3-27b-it:free (primary) for spectrum interpretation and SMILES generation
  • Temperature: Response creativity (0.7 default)
  • Max Tokens: Response length limits (1000 default)

Technical Specifications

Machine Learning Implementation

The system employs a sophisticated machine learning pipeline developed through extensive experimentation documented in the SMILES2SPEC Foundry research pipeline:

  • Algorithm: Random Forest Regression optimised for spectral prediction
  • Features: 2048+ molecular descriptors and fingerprints including:
    • RDKit molecular descriptors (molecular weight, LogP, TPSA, etc.)
    • Morgan fingerprints (multiple radii for different structural patterns)
    • MACCS keys (166 structural keys for pharmacophore analysis)
    • Topological, Avalon, Pattern, and Layered fingerprints
    • Bond/atom counts and electronic properties
  • Training Data: 2,720 curated electron ionisation mass spectrometry samples
  • Performance: Cosine similarity of 0.8138 achieved through Bayesian optimisation
  • Preprocessing Pipeline:
    • Variance and NaN filtering for feature selection
    • Log-scaling and standardisation for optimal model performance
    • Automatic feature schema generation and validation
    • Parallelised processing for high-throughput applications

Technical Capabilities

  • Request Validation: Comprehensive Pydantic schemas for all endpoints
  • Error Handling: Structured exception hierarchy with meaningful error messages
  • File Processing: Support for CSV/TXT bulk uploads with intelligent parsing
  • Export Formats: MSP (mass spectral library), JSON, and image formats
  • AI Integration: OpenRouter/google/gemma-3-27b-it:free for spectrum interpretation and SMILES generation
  • Performance: Optimised for both single predictions and bulk processing
  • Monitoring: Health checks with model status and service readiness indicators
  • Security: Input sanitisation, file upload validation, and environment-based secrets

Academic Context

This system was developed as one of two complementary implementations demonstrating AI integration into electronic laboratory notebooks. Together with the GUARDIAN pharmaceutical compliance system, it forms the technical foundation of the dissertation "Integrating AI into Electronic Lab Notebooks" submitted for the MSc AI in Biosciences programme at Queen Mary University of London.

Citation

For academic use of this work, please cite:

@mastersthesis{mohammed2025smiles2spec,
  title = {SMILES2SPEC: AI-Powered Mass Spectrum Prediction System for Electronic Laboratory Notebooks},
  author = {Mohammed, Yusuf},
  year = {2025},
  school = {Queen Mary University of London},
  department = {MSc AI in Biosciences},
  supervisor = {Elbadawi, Mohammed},
  note = {MSc Dissertation Project: Integrating AI into Electronic Lab Notebooks}
}

Author Supervisor Institution Programme

Developed as part of MSc AI in Biosciences dissertation at Queen Mary University of London (2025)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published