Demiurge is a modular and fully automated Python-based platform designed to generate machine learning input data from both simulated spectral and structural molecular representations. Specifically developed to support QSPR studies, it processes chemical structures provided as SMILES strings alongside target properties (e.g., CHI logD), and produces ready-to-use feature matrices for classical ML models and deep neural networks.
The tool supports four representation modes: predicted ¹H NMR spectra, ¹³C NMR spectra, concatenated ¹H | ¹³C spectral vectors and ECFP4 molecular fingerprints. Input files in .csv
format are validated for SMILES integrity and formatting. Molecular structures are reconstructed using RDKit or obabel, optimized in 3D, then flattened to 2D to comply with NMR prediction tools.
NMR spectral predictions are performed locally using a standalone Java-based engine built on the NMRshiftDB2 database, utilizing HOSE-code pattern matching. The resulting chemical shift lists are then transformed into fixed-length spectral vectors through a custom bucketing strategy (200 bins per nucleus, but can be changed in bucketing module script), enabling compatibility with ML pipelines. In the fused representation, the 1H and 13C vectors are joined together head to tail.
For ECFP4 generation, RDKit's Morgan fingerprinting (radius = 2) is used to construct 2048-bit binary descriptors. All generated feature matrices are merged with property labels (e.g., CHI logD), headers are appended, and the final datasets are saved as .csv
files.
Demiurge is optimized for parallel execution on 8-core CPUs, achieving processing times of ~6 minutes for ¹H NMR spectra, ~15 minutes for ¹³C NMR, and under 2 minutes for ECFP4 on datasets of ~1000 molecules.
The architecture is fully extensible and easily adaptable to other endpoints such as logP, TPSA, or logS, and to other types of spectral or molecular representations.
The tool uses the NMRshiftDB2 predictor, which can be accessed here.
The script was run as an example for the prediction of 13C NMR spectra with an input file containing a misdefined one of the rows. In addition, a comma was inserted as the decimal separator and a semicolon was inserted as the column separator.
- DEMIURGE
- NMR or ECFP4-based Machine Learning Input Generator
- 🖥 Examples of Working Program
- 💡 Key Features
- ✅ Requirements
- ⚙️ Installation
- 🗂 Directory Structure
- 🚀 Usage
- 📄 Command Line Arguments
- 📄 Example Usage
- 📄 Input CSV Format
- ⚙️ Script Workflow for NMR-based Output Data (1H / 13C)
- ⚙️ Script Workflow for ECFP4-based Output Data (FP)
- 🛠 Troubleshooting
- 📜 License
- Molecule Generation: Converts SMILES codes into 3D molecular structures and saves them as flattened 2D
.mol
files using RDKit. - NMR Spectrum Prediction: Predicts NMR spectra for each molecule using a custom Java-based NMRshiftDB2 predictor.
- ECFP4 Fingerprints Generation: Generates feature space using ECPF4 fingerprints with radius 2. (activated via --predictor FP)
- Bucketization: Converts predicted NMR spectra into a uniform matrix using a bucketing technique.
- Data Merging: Merges the bucketized spectra/fingerprints with property labels to form a consolidated dataset.
- Label Insertion: Adds a target property column to the merged dataset based on a specified label column.
- Custom Headers: Adds headers to the final dataset for easy identification and readability.
- Data Concatenation: 1H and 13C Representations are fused into new hybrid representation.
- Optional Cleanup: Deletes all intermediate files and folders to save space and reduce clutter.
Important: The script was tested under Windows 10 using PowerShell and works reliably in this environment on Python 3.11.4. It has not been tested on Linux or other operating systems.
Ensure the following software and libraries are installed:
-
Python Libraries:
rdkit
pandas
numpy
art
tqdm
Install the required Python packages using:
pip install rdkit pandas numpy tqdm art
or predefined Python script, which will check if the necessary libraries are installed. If not it will install them:
python install_modules.py
-
Open Babel Open Babel is needed to process "exotic" structures that standard RDKit library can not process. Make sure the
obabel
command is available in your system's PATH. -
Java SDK:
- Java Development Kit (JDK) is required to compile and run the Java batch processor for NMR spectrum prediction. Make sure the
javac
andjava
commands are available in your system's PATH.
- Java Development Kit (JDK) is required to compile and run the Java batch processor for NMR spectrum prediction. Make sure the
If you prefer to use Conda - utilize the provided environment file to create your Conda environment:
conda env create -f conda_environment.yml
conda activate predictor_logD
Then Download & Install Java SDK (tested on version 23). Ensure java and javac are accessible in your PATH.
Clone the repository from GitHub and navigate to the project directory:
git clone https://github.com/Prospero1988/Demiurge.git
cd Demiurge
The project is organized into the following directories and files:
demiurge/
│
├── demiurge.py # Main script for executing the pipeline
├── input_example.csv # Example of the input file
├── install_modules.py # Installs required Python packages
├── predictor/
│ ├── predictorh.jar # Java-based predictor for 1H spectra [NMRshiftDB2]
│ ├── predictor13C.jar # Java-based predictor for 13C spectra [NMRshiftDB2]
│ ├── cdk-2.9.jar # CDK library required for spectrum prediction.
│ ├── BatchProcessor1H.java # Java batch processor for 1H spectra [NMRshiftDB2]
│ └── BatchProcessor13C.java # Java batch processor for 13C spectra [NMRshiftDB2]
├── logD_predictor_bin/ # Directory containing helper modules
│ ├── csv_checker.py # Verifies and preprocesses CSV files
│ ├── concatenator.py # Concatenate 1H and 13C matrices into new fused matrix
│ ├── gen_mols.py # Generates .mol files from SMILES strings
│ ├── bucket.py # Buckets NMR spectra
│ ├── merger.py # Merges bucketed spectra CSVs
| ├── labeler.py # Adds labels to the merged spectra file.
│ ├── custom_header.py # Adds custom headers to the final dataset
│ ├── fp_generator.py # ECFP4 Fingerprints generator
│ └── model_query.py # Queries machine learning models
└── README.md # Project documentation (this file)
To run the script, use the following command:
python demiurge.py --csv_path <input_csv_file> --predictor <NMR_type> --label_column <column_number> [--clean]
If using conda remember to activate enviriment:
conda activate predictor_logD
Argument | Required | Accepted Values | Description |
---|---|---|---|
--csv_path |
✅ Yes | (any valid CSV path) | Path to the input CSV file containing compound names and SMILES codes. |
--predictor |
✅ Yes | 1H , 13C , hybrid , FP |
Type of predictor: 1H or 13C for NMR, hybrid for fused 1H/13C, or FP for ECFP4 fingerprints. |
--label_column |
✅ Yes | (integer ≥ 1) | Column index (1-based) in the input CSV file that contains the target property values. |
--clean |
❌ No | (flag, no value) | If set, the script will delete all intermediate temporary files and folders after execution. |
python demiurge.py --csv_path 'test.csv' --predictor '1H' --label_column 3 --clean
In this example:
- The script will read the input CSV file
test.csv
. - It will generate
.mol
files for each molecule based on its SMILES code. - It will predict the 1H NMR spectra for each molecule.
- The spectra will be bucketized and merged into a single file.
- The target property values from
column 3
intest.csv
will be added as labels. - All intermediate files and directories will be deleted after execution due to the
--clean
option.
python demiurge.py --csv_path 'test.csv' --predictor 'hybrid' --label_column 3 --clean
In this example:
- The script will read the input CSV file
test.csv
. - It will generate
.mol
files for each molecule based on its SMILES code. - It will predict the 1H NMR spectra for each molecule.
- The spectra will be bucketized and merged into a single file.
- The target property values from
column 3
intest.csv
will be added as labels. - Script predicts 13C, buckets, merges and adds propety colums as for 1H data.
- 1H and 13C matrices are fused into concatenated representation 1H|13C.
- All intermediate files and directories will be deleted after execution due to the
--clean
option.
python demiurge.py --csv_path 'test.csv' --predictor 'FP' --label_column 3 --clean
In this example:
- The script will read the input CSV file
test.csv
. - It will generate
.mol
files for each molecule based on its SMILES code. - It will calculate ECFP4 fingerprints for each molecule using RDKit.
- The fingerprint matrix will be merged with the target property values from
column 3
intest.csv
. - All intermediate files and directories will be deleted after execution due to the
--clean
option.
The input CSV file should have at least the following columns:
MOLECULE_NAME
: The name or identifier of the molecule.SMILES
: The SMILES code of the molecule.<Label>
: The property values to be modeled (must be specified in the--label_column
parameter).
Example test.csv
:
MOLECULE_NAME | SMILES | Label |
---|---|---|
Compound1 | CCCO | 5.3 |
Compound2 | CCC(=O)O | 2.1 |
Compound3 | CCN(CC)CC | 7.8 |
-
Step 1: Generate
.mol
Files:- Reads SMILES codes from the input CSV file and generates corresponding
.mol
files using RDKit.
- Reads SMILES codes from the input CSV file and generates corresponding
-
Step 2: Predict NMR Spectra:
- Uses the Java-based
BatchProcessor1H
orBatchProcessor13C
to predict NMR spectra for each molecule. Predictor is part of NMRshiftDB2 database.
- Uses the Java-based
-
Step 3: Bucketize Spectra:
- Converts the predicted spectra into a uniform bucketized matrix for easy analysis and machine learning input generation.
-
Step 4: Merge Spectra and Labels:
- Merges the bucketized spectra with the specified label column from the input CSV file.
-
Step 5: Add Custom Headers:
- Adds descriptive headers to the final merged CSV file, making it easier to interpret and use for machine learning tasks.
-
Step 6: Cleanup (Optional):
- Deletes all intermediate files and directories if the
--clean
option is specified.
- Deletes all intermediate files and directories if the
-
Step 1: Generate
.mol
Files:- Reads SMILES codes from the input CSV file and generates corresponding
.mol
files using RDKit.
- Reads SMILES codes from the input CSV file and generates corresponding
-
Step 2: Predict NMR Spectra:
- Uses the Java-based
BatchProcessor1H
orBatchProcessor13C
to predict NMR spectra for each molecule. Predictor is part of NMRshiftDB2 database.
- Uses the Java-based
-
Step 3: Bucketize Spectra:
- Converts the predicted spectra into a uniform bucketized matrix for easy analysis and machine learning input generation.
-
Step 4: Merge Spectra and Labels:
- Merges the bucketized spectra with the specified label column from the input CSV file.
-
Step 5: Add Custom Headers:
- Adds descriptive headers to the final merged CSV file, making it easier to interpret and use for machine learning tasks.
-
Step 5: Concatenation od 1H adn 13C:
- The script connects 1H and 13C vectors head-to-tail. It combines representations for the same MOLECULE_NAME and for the same label value.
-
Step 7: Cleanup (Optional):
- Deletes all intermediate files and directories if the
--clean
option is specified.
- Deletes all intermediate files and directories if the
-
Step 1: Generate
.mol
Files- Reads SMILES strings from the input CSV file and converts them into 2D
.mol
files using RDKit.
- Reads SMILES strings from the input CSV file and converts them into 2D
-
Step 2: Calculate ECFP4 Fingerprints
- For each molecule, extended-connectivity fingerprints (ECFP4) are generated from the
.mol
structures using the RDKit implementation.
- For each molecule, extended-connectivity fingerprints (ECFP4) are generated from the
-
Step 3: Merge Fingerprints and Labels
- Merges the computed ECFP4 vectors with the label column (e.g., logD values) from the input CSV file into a unified DataFrame.
-
Step 4: Add Custom Headers
- Assigns descriptive headers to the final CSV file, improving interpretability and downstream machine learning usability.
-
Step 5: Cleanup (Optional)
- Deletes all intermediate files and directories if the
--clean
flag is used.
- Deletes all intermediate files and directories if the
-
Java Compilation Issues:
- Ensure that the
javac
andjava
commands are available and the Java SDK is installed. - If
javac
is not recognized, check the system'sPATH
variable and make sure it includes the path to the JDKbin
directory.
- Ensure that the
-
Missing Dependencies:
- Ensure that all required Python libraries (
rdkit
,pandas
, andnumpy
) are installed.
- Ensure that all required Python libraries (
-
File Not Found Errors:
- Verify the paths to input files and directories. Ensure that the input CSV file and other necessary files are correctly specified.
-
Memory or Performance Issues:
- If handling a large dataset, consider increasing the memory allocation for the Java runtime by adjusting the
-Xmx
parameter in the script.
- If handling a large dataset, consider increasing the memory allocation for the Java runtime by adjusting the
This project is licensed under the MIT License - see the LICENSE file for details.
📖 For detailed background and scientific context, see the accompanying publication:
(placeholder for DOI and link after publication)