Official implementation of DepthClassNet: A Multitask Framework for Monocular Depth Estimation and Texture Classification in Endoscopic Imaging (Abdallah & Raza, MIUA 2025).
Springer • DOI: 10.1007/978-3-031-98691-8_17
DepthClassNet is a novel multitask framework for monocular depth estimation and texture classification in endoscopic (colonoscopy) imaging. It predicts per-pixel depth from a single RGB frame while classifying tissue texture, improving spatial understanding and scene interpretation for downstream clinical research.

DepthClassNet predictions on the UCL dataset

DepthClassNet predictions on the UCL dataset
Keywords: monocular depth estimation, depth prediction, endoscopy, colonoscopy, medical imaging, PyTorch, multitask learning, texture classification, Swin Transformer, CLIP
- Python 3.11.10 (recommended)
- PyTorch ≥ 2.2
- See requirements.txt for full dependencies
python3 -m venv myenv source myenv/bin/activate
pip install -r requirements.txt
-
Once downloaded, organise the datasets exactly as shown below; the dataloader relies on this layout.
Data/
├── c3vd/
│ ├── cecum_t1_a/
│ │ ├── 0000_color.png
│ │ ├── 0000_depth.tiff
│ │ ├── …
│ │ ├── 0275_color.png
│ │ └── 0275_depth.tiff
│ ├── cecum_t1_b/
│ │ ├── 0000_color.png
│ │ ├── 0000_depth.tiff
│ │ └── …
│ ├── cecum_t2_b/
│ └── trans_t4_a/
├── ucl/
│ ├── C_T3_L2_3_resized_FrameBuffer_0315.png
│ ├── C_T3_L2_3_resized_Depth_0315.png
│ ├── …
│ ├── C_T3_L2_3_resized_FrameBuffer_4515.png
│ └── C_T3_L2_3_resized_Depth_4515.png
└── splits/
├── ucl_train.txt
├── ucl_val.txt
└── ucl_test.txt
Official pretrained weights can be downloaded here: DepthClassNet Checkpoints (OneDrive)
This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
✔ Free for research and educational use.
❌ Commercial use is not permitted.
If you use this code for your research, please cite our paper:
@InProceedings{10.1007/978-3-031-98691-8_17,
author="Abdallah, Bashayer
and Raza, Shan E. Ahmed",
editor="Ali, Sharib
and Hogg, David C.
and Peckham, Michelle",
title="DepthClassNet: A Multitask Framework for Monocular Depth Estimation and Texture Classification in Endoscopic Imaging",
booktitle="Medical Image Understanding and Analysis",
year="2026",
publisher="Springer Nature Switzerland",
address="Cham",
pages="230--246",
abstract="Monocular depth estimation can play a critical role in medical imaging, providing spatial information that enhances diagnostic accuracy and supports precise surgical interventions. The texture classification in the endoscopic images significantly contributes to the differentiation of tissue types and the identification of pathological changes. Building on this knowledge, we introduce DepthClassNet, an innovative multitask framework designed to simultaneously perform monocular depth estimation and texture classification in endoscopic imaging. Our approach employs a tri-encoder model to integrate RGB images, edge maps, and textual descriptions. The architecture comprises a SWIN transformer as an image encoder, a convolutional neural network (CNN) as an edge encoder, and a modified CLIP text encoder for embedding class textual descriptions. Features from the image and edge encoders are effectively combined via a Feature Fusion Module (FFM), and high-resolution depth outputs are reconstructed through a decoder and depth projection block. We introduce an image embedding block that converts visual data from the SWIN encoder into embeddings that align with CLIP text embeddings. The classification head then computes similarity scores, scales them by a learnable temperature {\$}{\$}t{\$}{\$}t, and converts them into probabilities. By designing a loss function that combines depth, edge and classification losses with specific weights, our multitask architecture achieves state-of-the-art results on the Colonoscopy Depth - UCL dataset for depth estimation and texture classification.",
isbn="978-3-031-98691-8"
}