This repository demonstrates how to augment a multi-task image classification dataset (MEDIC) using synthetic images created with generative AI models (diffusion models), with the ultimate goal of improving CNN performance under class imbalance in crisis informatics. We also demonstrate the impressive zero-shot capabilities for disaster image classification of large multimodal models, such as OpenAI's GPT-4o.
Due to ethical constraints, we do not include the original disaster images or the generated synthetic images in this repository. Please see the note below on how to obtain them.
- Overview
- Repository Structure
- Running the Experiments
- Datasets and Ethical Constraints
- Dependencies
- License
- Contact
This codebase explores the use of Stable Diffusion (and related diffusion models) to generate synthetic images for the MEDIC dataset (Alam et al., 2023). The primary aim is to address class imbalance in disaster‐related imagery by supplementing underrepresented categories with newly generated images. We also demonstrate how off-the-shelf large language models with vision capabilities (e.g., GPT-4o) can achieve excellent zero-shot classification performance on the same tasks.
We follow a systematic pipeline:
- Experiment 1:
- Zero-Shot Classification
- Experiment 2:
- Relabelling
- Synthetic Image Generation (using LLMs to generate captions, then passing them to Stable Diffusion)
- Augmented Training with a mix of real and synthetic images
Notes
- In the project report, we first presented the synthetic data augmentation (Experiment 2) and then our results with zero-shot classification (here Experiment 1).
- Experiment 1 was conducted on the relabelled dataset.
Below is a high-level view of the repository. The key code for running experiments resides in the experiments/
directory, which contains two main subfolders:
experiments/
├─ experiment1
│ ├─ 01_zero_shot_classification.ipynb
│ └─ utils.py
└─ experiment2/
├─ 01_original_training.ipynb
├─ 02_relabelling_pipeline.ipynb
├─ 03_train_relabelled.ipynb
├─ 04_synthetic_image_generation.ipynb
└─ 05_train_augmented.ipynb home/ models/ results/ tensorboard/
-
experiment1/
01_zero_shot_classification.ipynb
: Demonstrates zero-shot classification performance.utils.py
: Shared utility code for setting up notebooks (logging, seeds, GPU detection, etc.).
-
experiment2/
01_original_training.ipynb
: Trains a baseline model on the original dataset.02_relabelling_pipeline.ipynb
: Applies the relabelling step to the dataset.03_train_relabelled.ipynb
: Retrains the model on the newly relabelled data.04_synthetic_image_generation.ipynb
: Uses large language models (LLMs) to generate captions and then Stable Diffusion or other models to create synthetic images.05_train_augmented.ipynb
: Combines real and synthetic data for training, examining the impact on class‐imbalanced tasks.
-
Other Folders
home/
,models/
,results/
,tensorboard/
: Organisational folders for logs, trained model checkpoints, intermediate results, and TensorBoard logs.
- Clone the repository (or download the source code).
- Install dependencies (see the Dependencies section).
- Set up configuration (e.g., API keys, dataset paths) by editing
config.py
(referenced in the code).- Be careful not to accidentally expose your API keys. For example, do not upload the modified
config.py
to a public repository.
- Be careful not to accidentally expose your API keys. For example, do not upload the modified
- Open the desired Jupyter notebooks in the
experiments/
subfolders. - Run each notebook cell-by-cell, following the instructions at the top of each notebook.
Note: The recommended order for the main pipeline is:
experiment2/01_original_training.ipynb
experiment2/02_relabelling_pipeline.ipynb
experiment2/03_train_relabelled.ipynb
experiment2/04_synthetic_image_generation.ipynb
experiment2/05_train_augmented.ipynb
You can also explore experiment1/01_zero_shot_classification.ipynb
independently to see alternative baselines.
- Original MEDIC dataset: We do not include the original disaster images here. They are freely available from the MEDIC listing on Papers with Code.
- Synthetic dataset: Our synthetic images are also not included in this repository due to ethical considerations and restrictions on sharing disaster imagery. If you wish to obtain them, please contact the author and provide:
- A clear statement of your ethical use case.
- Written confirmation that the images will not be further redistributed.
Because of the sensitive nature of disaster imagery, we strongly encourage responsible use of these materials and compliance with local IRB or ethics board guidelines.
Below is a concise list of Python libraries required to run the experiments end‐to‐end. Please ensure your environment is properly set up with the following:
- Python Standard Library:
os, sys, pathlib, tempfile, random, math, logging, warnings, inspect, base64, collections, typing, csv, re, time, datetime, threading, concurrent.futures, mimetypes, json, multiprocessing
- Third‐Party Libraries:
numpy
torch
(PyTorch)nvidia-dali
pandas
matplotlib
seaborn
scikit-learn
scipy
IPython
tueplots
umap-learn
(optional, for UMAP-based embedding visualisation)requests
tqdm
- LLM and Diffusion Model Integration:
openai
(for GPT-based calls)anthropic
(for Claude-based calls)- (Optionally)
transformers
,diffusers
,accelerate
(if running local Stable Diffusion or alternative pipelines)
A typical installation might look like:
pip install numpy torch nvidia-dali pandas matplotlib seaborn scikit-learn scipy ipython tueplots umap-learn requests tqdm openai anthropic
Code in this project is released under the terms of the MIT License.
If you have any questions or wish to obtain the synthetic images, please reach out via email at evammun 📧 gmail.com
.
In the latter case, make sure to provide details about your intended use, and be prepared to sign an agreement not to redistribute the images.
Thank you for your interest in Synthetic Data Augmentation for the MEDIC dataset. We hope you find these experiments interesting useful for exploring class imbalance solutions in disaster‐related contexts.