Loading...
Degree
Bachelor of Science (Computer Science)
Department
Department of Computer Science
School
School of Mathematics and Computer Science (SMCS)
Advisor
Dr Muhammad Atif Tahir, Professor and Program Coordinator, Graduate & Postgraduate Programs (CS)
Keywords
Visual Question Answering, Gastrointestinal Imaging, Multimodal Explanations, Vision-Language Models, Grad-CAM, Multi-Task Learning
Abstract
Visual Question Answering (VQA) systems hold significant potential for gastrointestinal (GI) endoscopy diagnostics, yet their clinical adoption is hindered by their “black box” nature: they return answers without verifiable justification, undermining clinician trust. This project develops a compact VQA system for GI imaging that produces accurate answers and coherent multimodal explanations. Building on Florence-2, a small vision-language model, we first show that naive VQA fine-tuning yields two failure modes: a lack of visual grounding evidence (answers that are correct for the wrong visual reasons) and degraded free-form captioning. We address these through a progression of solutions. Solution 1 introduces a multi-task learning regime that jointly performs VQA, region-of-interest (ROI) mask prediction, and explanation generation, using pseudo-masks from SegCLIP and explanation targets distilled from a large language model. This improves grounding but introduces noisy pseudo-masks and explanation-induced hallucination. Solution 2 replaces text-prompted segmentation with Grad-CAM maps from a CNN classifier initialised from a foundation model pre-trained on the GastroNet-5M corpus, and decouples answering from explanation by captioning the ROI rather than explaining the answer directly. All training and data generation use the Kvasir-VQA family of datasets. On held-out GI VQA, our multi-task variant raises polyp segmentation IoU from 0.40 to 0.71 while preserving caption quality, and produces visibly sharper text-to-visual grounding than a simple-VQA baseline. We further outline a DINOv2-based extension for multi-focal lesions and a decision-support prototype.
Tools and Technologies Used
Python, PyTorch, Hugging Face Transformers, PEFT (LoRA), Microsoft Florence-2, Gemma 3 27B, Google Gemini API, Hugging Face Hub, Hugging Face Datasets, OpenCV (cv2), NumPy, Pandas, Pillow (PIL), Matplotlib, NLTK, BLEU, ROUGE, METEOR (evaluate library), WandB, KaggleHub, Jupyter Notebook, GastroNet-5M (DINOv1/ResNet-50), Kvasir-VQA-x1 Dataset, CUDA, dotenv
Methodology
The methodology begins with Florence-2 as the base vision-language model, initially fine-tuned on a Kvasir VQA X1 dataset using standard VQA training, which revealed failures in visual grounding and captioning quality. To address this, a multi-task learning approach was introduced, jointly training the model on VQA, segmentation mask prediction using SegCLIP-generated pseudo-masks, and rich explanation generation using Gemma 3 API responses as targets. However, this introduced noisy masks and hallucinated explanations, leading to a second solution that replaced pseudo-masks with Grad-CAM activation maps from a dedicated CNN classifier trained on approximately five million GI images, and decoupled explanation from answering by separately captioning the identified region of interest rather than generating explanations end-to-end. Finally, to handle multi-focal lesions where only a single instance was being grounded, the use of DINOv2 patch-level similarity matrices is proposed to surface additional visually similar regions across the image.
Document Type
Restricted Access
Submission Type
BSCS Final Year Project
Recommended Citation
Safwan, I., Khan, R., & Ahmedani, A. (2026). Visual Question Answering (with multimodal explanations) for Gastrointestinal Imaging. Retrieved from https://ir.iba.edu.pk/fyp-bscs/31
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
COinS
