Loading...

Media is loading
 

Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Dr Muhammad Atif Tahir, Professor and Program Coordinator, Graduate & Postgraduate Programs (CS)

Keywords

Visual Question Answering, Gastrointestinal Imaging, Multimodal Explanations, Vision-Language Models, Grad-CAM, Multi-Task Learning

Abstract

Visual Question Answering (VQA) systems hold significant potential for gastrointestinal (GI) endoscopy diagnostics, yet their clinical adoption is hindered by their “black box” nature: they return answers without verifiable justification, undermining clinician trust. This project develops a compact VQA system for GI imaging that produces accurate answers and coherent multimodal explanations. Building on Florence-2, a small vision-language model, we first show that naive VQA fine-tuning yields two failure modes: a lack of visual grounding evidence (answers that are correct for the wrong visual reasons) and degraded free-form captioning. We address these through a progression of solutions. Solution 1 introduces a multi-task learning regime that jointly performs VQA, region-of-interest (ROI) mask prediction, and explanation generation, using pseudo-masks from SegCLIP and explanation targets distilled from a large language model. This improves grounding but introduces noisy pseudo-masks and explanation-induced hallucination. Solution 2 replaces text-prompted segmentation with Grad-CAM maps from a CNN classifier initialised from a foundation model pre-trained on the GastroNet-5M corpus, and decouples answering from explanation by captioning the ROI rather than explaining the answer directly. All training and data generation use the Kvasir-VQA family of datasets. On held-out GI VQA, our multi-task variant raises polyp segmentation IoU from 0.40 to 0.71 while preserving caption quality, and produces visibly sharper text-to-visual grounding than a simple-VQA baseline. We further outline a DINOv2-based extension for multi-focal lesions and a decision-support prototype.

Tools and Technologies Used

Python, PyTorch, Hugging Face Transformers, PEFT (LoRA), Microsoft Florence-2, Gemma 3 27B, Google Gemini API, Hugging Face Hub, Hugging Face Datasets, OpenCV (cv2), NumPy, Pandas, Pillow (PIL), Matplotlib, NLTK, BLEU, ROUGE, METEOR (evaluate library), WandB, KaggleHub, Jupyter Notebook, GastroNet-5M (DINOv1/ResNet-50), Kvasir-VQA-x1 Dataset, CUDA, dotenv

Methodology

The methodology begins with Florence-2 as the base vision-language model, initially fine-tuned on a Kvasir VQA X1 dataset using standard VQA training, which revealed failures in visual grounding and captioning quality. To address this, a multi-task learning approach was introduced, jointly training the model on VQA, segmentation mask prediction using SegCLIP-generated pseudo-masks, and rich explanation generation using Gemma 3 API responses as targets. However, this introduced noisy masks and hallucinated explanations, leading to a second solution that replaced pseudo-masks with Grad-CAM activation maps from a dedicated CNN classifier trained on approximately five million GI images, and decoupled explanation from answering by separately captioning the identified region of interest rather than generating explanations end-to-end. Finally, to handle multi-focal lesions where only a single instance was being grounded, the use of DINOv2 patch-level similarity matrices is proposed to surface additional visually similar regions across the image.

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS