Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Date of Submission

Fall 2024

Supervisor

Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration, Karachi

Keywords

Urdu Retrieval-Augmented Generation (RAG), Diabetes Chatbot, Language Model Evaluation

Abstract

This project focuses on the development of an Urdu Retrieval-Augmented Generation (RAG) chatbot to provide accurate and accessible diabetes-related information for patients and healthcare providers at Indus Hospital. Using data provided by the hospital, a robust knowledge base was created through the ChromaDB vector database, leveraging embeddings generated by the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model. BM25 and Chroma queries were utilized for efficient retrieval.

Three language models—Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, and BioMistral-7B—were evaluated for response generation using a dataset of 27 diabetes-focused question-answer pairs curated by medical professionals. The models were compared based on accuracy, relevance, and computational efficiency, with Llama-3.2-3B-Instruct selected as the optimal model due to its superior performance and reasonable response time.

To cater to Urdu-speaking users, the pipeline was extended with a translation layer using Meta's NLLB model, enabling bidirectional translation between English and Urdu. While this approach successfully adapted the chatbot for a low-resource language, challenges were encountered in assessing the accuracy of Urdu responses due to the lack of comprehensive evaluation frameworks for translations.

The findings of this project emphasize the potential of mid-sized language models for task-specific applications and highlight the need for continued research into low-resource language support and optimization for real-time healthcare scenarios. This project not only provides a practical solution for delivering diabetes-related information but also serves as a foundation for developing conversational AI systems tailored to resource-constrained settings.

Keywords: Urdu Retrieval-Augmented Generation (RAG), Diabetes Chatbot, Multilingual NLP, Low-Resource Languages, Meta NLLB, Llama Models, Healthcare AI, Language Model Evaluation

Document Type

Restricted Access

Submission Type

Research Project

Available for download on Tuesday, July 07, 2026

The full text of this document is only accessible to authorized users.

Share

COinS