MSc AI in Biosciences Dissertation Project
Queen Mary University of London
Author: Yusuf Mohammed
Supervisor: Mohammed Elbadawi
Executive Summary • System Features • Architecture • Installation • API Documentation • Configuration • Technical Specifications • Citation
SMILES2SPEC is a production-ready full-stack application developed as part of an MSc dissertation exploring the integration of artificial intelligence into electronic laboratory notebooks. The system predicts high-resolution electron ionization (EI) mass spectra directly from SMILES molecular notation, combining advanced machine learning models with a robust web-based architecture to provide real-time spectral prediction capabilities for biomedical research applications.
- Flask 2.x REST API with modular, service-oriented architecture
- Domain-driven design with separated concerns:
- API Layer: Routes, middleware, validation schemas
- Service Layer: Business logic and orchestration
- Core Layer: Domain models, ML processors, feature extraction
- Integration Layer: External service abstractions (LLM, databases)
- Utils Layer: Shared utilities and error handling
- Advanced Molecular Featurization:
- 200+ RDKit molecular descriptors
- 8 different fingerprint types (Morgan, MACCS, RDKit, Avalon, etc.)
- Electronic properties and structural analysis
- Parallelised feature extraction for performance
- Pre-trained Random Forest Model with 2048+ features
- AI-powered chat assistant with spectrum context and chemical knowledge (google/gemma-3-27b-it:free)
- SMILES generation from natural language descriptions via OpenRouter LLM integration
- Comprehensive Export Capabilities:
- JSON-formatted spectra with peak lists
- MSP (Mass Spectral Library) format export
- Molecular structure images (PNG/SVG)
- Bulk processing for multiple compounds
- Production-Ready Features:
- Production WSGI Server: Gunicorn with optimised worker configuration
- Pydantic schema validation for all inputs/outputs
- Global error handling with structured responses
- Health monitoring and service readiness checks
- CORS configuration for frontend integration
- Environment-based configuration management
- Docker containerization with health checks
- Parallelised feature extraction (joblib)
- Automatic feature schema generation
- Variance and NaN filtering
- Log-scaling and standardisation
- Support for 2048+ molecular descriptors and fingerprints
- Svelte/SvelteKit with TypeScript and modular architecture
- Performance-Optimised State Management:
- Modular store architecture with lazy loading
- Plot management code loaded on-demand
- Tree-shaking optimisation for smaller bundles
- Separate chunks for heavy functionality
- User-Centred Route Structure:
- Landing Page (
/
): Welcome interface with navigation to primary tools - Spectral Simulation (
/spectral-simulation
): Full prediction interface with interactive visualisation - How It Works (
/how-it-works
): Educational content about mass spectrum prediction - About (
/about
): Developer information and project details - Chat with Spectrum (
/chat-with-spectrum
): AI-powered spectral analysis assistance
- Landing Page (
- Real-time spectrum visualisation with Plotly.js
- Interactive molecular structure display
- Bulk SMILES processing with progress tracking
- Export capabilities (JSON, MSP, CSV) with format validation
- Responsive design with modern UI components and accessibility features
backend/
├── main.py # Application entry point
├── config/ # Centralized configuration
│ └── settings.py # Environment-based config
├── api/ # API layer
│ ├── routes/ # Domain-specific route modules
│ │ ├── prediction.py # Spectrum prediction endpoints
│ │ ├── export.py # File export endpoints
│ │ ├── chat.py # AI chat endpoints
│ │ ├── health.py # Health check endpoints
│ │ └── upload.py # File upload endpoints
│ ├── schemas/ # Pydantic validation schemas
│ │ ├── prediction.py # Prediction request/response schemas
│ │ └── chat.py # Chat message schemas
│ └── middleware/ # Request/response middleware
│ ├── validation.py # Input validation decorators
│ └── error_handler.py # Global error handling
├── services/ # Business logic layer
│ └── prediction_service.py # Spectrum prediction orchestration
├── core/ # Core domain layer
│ ├── models/ # Domain models
│ │ ├── molecule.py # Molecular data structures
│ │ └── spectrum.py # Spectrum data structures
│ ├── processors/ # Core processing logic
│ │ ├── feature_processor.py # Molecular feature extraction
│ │ ├── spectrum_processor.py # Spectrum processing
│ │ └── feature_preprocessor.py # Feature preprocessing
│ └── ml/ # Machine learning components
│ └── model_handler.py # ML model operations
├── integrations/ # External service integrations
│ └── llm/ # LLM integration
│ ├── client.py # LLM API client
│ └── services/ # LLM service implementations
│ ├── chat_service.py # Chat with spectrum context
│ └── smiles_service.py # SMILES generation
├── utils/ # Shared utilities
│ ├── errors.py # Exception hierarchy
│ ├── logging.py # Centralized logging
│ ├── chemistry.py # Chemical utilities
│ ├── data.py # Data conversion utilities
│ └── formats.py # File format utilities
└── models/ # ML model files (not in git)
├── spectrum_predictor.pkl # Trained Random Forest model
├── feature_preprocessor.pkl # Feature scaling pipeline
└── feature_mapping.json # Feature index mapping
frontend/
├── src/
│ ├── routes/ # SvelteKit pages
│ │ ├── +page.svelte # Landing page with project overview
│ │ ├── +layout.svelte # Root layout
│ │ ├── spectral-simulation/ # Main prediction interface
│ │ ├── about/ # About page
│ │ ├── chat-with-spectrum/ # AI chat interface
│ │ └── how-it-works/ # Documentation
│ └── lib/ # Core library
│ ├── components/ # Reusable components
│ │ ├── landing/ # Landing page components
│ │ ├── smiles-input/ # SMILES input system
│ │ ├── panels/ # UI panel system
│ │ │ ├── simulation/ # Spectrum visualisation
│ │ │ ├── info/ # Information panels
│ │ │ └── about/ # About panels
│ │ ├── chat/ # Chat components
│ │ └── icons/ # SVG icon library
│ ├── services/ # API and business logic
│ │ ├── api.ts # Backend API client
│ │ ├── plotlyService.ts # Plotting utilities
│ │ └── chatService.ts # Chat functionality
│ ├── stores/ # Modular state management
│ │ ├── index.ts # Main store exports with lazy loading
│ │ ├── appState.ts # Global application state
│ │ ├── panelStore.ts # Panel definitions and management
│ │ ├── carouselStore.ts # Panel navigation and carousel
│ │ ├── pageStore.ts # Page routing and navigation
│ │ ├── plotEffects.ts # Lazy-loaded plot management
│ │ └── types.ts # TypeScript type definitions
│ └── styles/ # Global styles
│ ├── theme.css # CSS variables
│ └── tokens.css # Design tokens
├── static/ # Static assets
├── package.json # Dependencies
└── vite.config.js # Build configuration
- Separation of Concerns: Clear boundaries between API, business logic, and data layers
- Dependency Injection: Services are injected rather than tightly coupled
- Error Handling: Structured exception hierarchy with global error handlers
- Validation: Input validation at API boundaries using Pydantic schemas
- Testability: Modular design enables comprehensive unit and integration testing
- Maintainability: Domain-driven organisation makes code easy to understand and modify
- Performance: Lazy loading and code splitting optimise bundle size and initial load times
PredictionService
: Orchestrates spectrum prediction from SMILES stringsChatService
: AI-powered chat functionality with spectrum context integrationSMILESService
: Natural language to SMILES generation using LLMs
ModelHandler
: Manages trained model loading and inference operationsFeatureProcessor
: Extracts 2048+ molecular features using RDKitFeaturePreprocessor
: Applies scaling, filtering, and standardisationSpectrumProcessor
: Converts model outputs to formatted spectra and peak lists
- Pydantic Schemas: Comprehensive request/response validation
- Error Handling: Custom exception hierarchy with structured error responses
- File Processing: Support for CSV/TXT bulk uploads and MSP exports
- Image Generation: Molecular structure visualisation (PNG/SVG)
- Route Blueprints: Domain-separated endpoint organisation
- Middleware: Global validation, error handling, and CORS configuration
- Health Monitoring: Service readiness and model status checking
- Modular Store System: Separated stores for different concerns (app state, panels, navigation)
- Lazy Loading: Plot management functionality loaded on-demand
- Code Splitting: Heavy functionality in separate chunks for optimal bundle size
- Tree Shaking: Unused store logic excluded from production bundles
- Memory Management: Automatic cleanup of plot resources and event listeners
- Python 3.10-3.11 (Python 3.12+ has compatibility issues with RDKit)
- Node.js 18+ with npm
- RDKit molecular toolkit (installation via conda-forge required)
- Docker and Docker Compose (optional for containerised deployment)
cd backend
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
# Optional: Set up environment variables
echo "OPENROUTER_API_KEY=your_api_key_here" > .env
# Start development server
python main.py
cd backend
# Install dependencies including Gunicorn
pip install -r requirements.txt
# Run with Gunicorn (production WSGI server)
gunicorn --config gunicorn.conf.py backend.wsgi:app
# Alternative: Run from parent directory
cd ..
gunicorn --config backend/gunicorn.conf.py backend.wsgi:app
# Full stack deployment with production WSGI server
docker-compose up --build
# Backend: http://localhost:5050 (Production WSGI with Gunicorn)
# Frontend: http://localhost:3001
The API will be available at http://localhost:5050
cd frontend
npm install
npm run dev
The frontend will be available at http://localhost:5173
POST /predict
- Predict mass spectrum from SMILES stringGET /structure
- Get molecular structure as base64 PNGGET /structure/png
- Alternative endpoint for molecular structure
POST /smiles_bulk
- Upload CSV/TXT files with bulk SMILES dataPOST /export_msp
- Export single spectrum in MSP formatPOST /export_msp_batch
- Export multiple spectra as combined MSP file
POST /chat
- Interactive chat with AI assistant (includes spectrum context)POST /generate_smiles
- Generate SMILES from natural language descriptions
GET /health
- Comprehensive service health check with model status
# Predict spectrum with detailed response
curl -X POST http://localhost:5050/predict \
-H "Content-Type: application/json" \
-d '{"smiles": "CC(=O)OC1=CC=CC=C1C(=O)O"}'
# Response includes:
# - Chemical name and molecular properties
# - High-resolution spectrum data (x/y arrays)
# - Peak list with m/z and intensity values
# - Base64-encoded molecular structure PNG
# - Metadata and processing information
# Get molecular structure image
curl "http://localhost:5050/structure?smiles=CCO"
# Upload bulk SMILES file (CSV or TXT)
curl -X POST http://localhost:5050/smiles_bulk \
-F "file=@molecules.csv"
# Export spectrum in MSP format (mass spectral library format)
curl -X POST http://localhost:5050/export_msp \
-H "Content-Type: application/json" \
-d '{"smiles": "CCO"}' \
--output ethanol_spectrum.msp
# Chat with AI about spectrum interpretation
curl -X POST http://localhost:5050/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Explain the fragmentation pattern"}],
"smiles": "CC(=O)OC1=CC=CC=C1C(=O)O",
"stream": false
}'
# Generate SMILES from description
curl -X POST http://localhost:5050/generate_smiles \
-H "Content-Type: application/json" \
-d '{"description": "anti-inflammatory drug with benzene ring"}'
API_HOST
- Server host (default: 0.0.0.0)API_PORT
- Server port (default: 5050)API_DEBUG
- Debug mode (default: true)OPENROUTER_API_KEY
- OpenRouter API key for LLM features (optional)LOG_LEVEL
- Logging level (default: INFO)
- host/port: Server binding configuration
- debug: Development mode settings
- model_path: Path to trained spectrum predictor model
- preprocessor_path: Feature preprocessing pipeline location
- feature_mapping_path: Molecular feature mapping definitions
- Workers: Optimised for ML workload (2 workers default)
- Worker Class: Sync workers for RDKit/ML compatibility
- Timeouts: Extended timeout (300s) for ML inference operations
- Memory Management: Preload app for efficient model loading
- Logging: Structured production logging with access logs
- Health Checks: Configurable health monitoring endpoints
- Molecular Descriptors: 200+ RDKit descriptors (MW, LogP, TPSA, etc.)
- Fingerprints: Multiple fingerprint types with configurable parameters:
- Morgan fingerprints (radii 1-3, size 1024)
- Morgan feature fingerprints (radius 2, size 1024)
- MACCS keys (166 structural keys)
- Topological fingerprints (size 1024)
- RDKit fingerprints (size 2048)
- Avalon fingerprints (size 1024)
- Pattern fingerprints (size 1024)
- Layered fingerprints (size 2048)
- Count Features: Bond counts, atom counts, ring analysis
- Electronic Properties: HOMO/LUMO estimation, dipole moments
- m/z Range: Configurable mass-to-charge ratio ranges
- Resolution: Spectral resolution and binning parameters
- Peak Detection: Intensity thresholds and peak finding algorithms
- Provider: OpenRouter gateway to multiple LLM providers
- Model: google/gemma-3-27b-it:free (primary) for spectrum interpretation and SMILES generation
- Temperature: Response creativity (0.7 default)
- Max Tokens: Response length limits (1000 default)
The system employs a sophisticated machine learning pipeline developed through extensive experimentation documented in the SMILES2SPEC Foundry research pipeline:
- Algorithm: Random Forest Regression optimised for spectral prediction
- Features: 2048+ molecular descriptors and fingerprints including:
- RDKit molecular descriptors (molecular weight, LogP, TPSA, etc.)
- Morgan fingerprints (multiple radii for different structural patterns)
- MACCS keys (166 structural keys for pharmacophore analysis)
- Topological, Avalon, Pattern, and Layered fingerprints
- Bond/atom counts and electronic properties
- Training Data: 2,720 curated electron ionisation mass spectrometry samples
- Performance: Cosine similarity of 0.8138 achieved through Bayesian optimisation
- Preprocessing Pipeline:
- Variance and NaN filtering for feature selection
- Log-scaling and standardisation for optimal model performance
- Automatic feature schema generation and validation
- Parallelised processing for high-throughput applications
- Request Validation: Comprehensive Pydantic schemas for all endpoints
- Error Handling: Structured exception hierarchy with meaningful error messages
- File Processing: Support for CSV/TXT bulk uploads with intelligent parsing
- Export Formats: MSP (mass spectral library), JSON, and image formats
- AI Integration: OpenRouter/google/gemma-3-27b-it:free for spectrum interpretation and SMILES generation
- Performance: Optimised for both single predictions and bulk processing
- Monitoring: Health checks with model status and service readiness indicators
- Security: Input sanitisation, file upload validation, and environment-based secrets
This system was developed as one of two complementary implementations demonstrating AI integration into electronic laboratory notebooks. Together with the GUARDIAN pharmaceutical compliance system, it forms the technical foundation of the dissertation "Integrating AI into Electronic Lab Notebooks" submitted for the MSc AI in Biosciences programme at Queen Mary University of London.
For academic use of this work, please cite:
@mastersthesis{mohammed2025smiles2spec,
title = {SMILES2SPEC: AI-Powered Mass Spectrum Prediction System for Electronic Laboratory Notebooks},
author = {Mohammed, Yusuf},
year = {2025},
school = {Queen Mary University of London},
department = {MSc AI in Biosciences},
supervisor = {Elbadawi, Mohammed},
note = {MSc Dissertation Project: Integrating AI into Electronic Lab Notebooks}
}
Developed as part of MSc AI in Biosciences dissertation at Queen Mary University of London (2025)