Degree
Master of Science in Data Science
Department
Department of Computer Science
Faculty/ School
School of Mathematics and Computer Science (SMCS)
Date of Submission
Fall 2024
Supervisor
Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration, Karachi
Keywords
Urdu Retrieval-Augmented Generation (RAG), Diabetes Chatbot, Language Model Evaluation
Abstract
This project focuses on the development of an Urdu Retrieval-Augmented Generation (RAG) chatbot to provide accurate and accessible diabetes-related information for patients and healthcare providers at Indus Hospital. Using data provided by the hospital, a robust knowledge base was created through the ChromaDB vector database, leveraging embeddings generated by the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model. BM25 and Chroma queries were utilized for efficient retrieval.
Three language models—Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, and BioMistral-7B—were evaluated for response generation using a dataset of 27 diabetes-focused question-answer pairs curated by medical professionals. The models were compared based on accuracy, relevance, and computational efficiency, with Llama-3.2-3B-Instruct selected as the optimal model due to its superior performance and reasonable response time.
To cater to Urdu-speaking users, the pipeline was extended with a translation layer using Meta's NLLB model, enabling bidirectional translation between English and Urdu. While this approach successfully adapted the chatbot for a low-resource language, challenges were encountered in assessing the accuracy of Urdu responses due to the lack of comprehensive evaluation frameworks for translations.
The findings of this project emphasize the potential of mid-sized language models for task-specific applications and highlight the need for continued research into low-resource language support and optimization for real-time healthcare scenarios. This project not only provides a practical solution for delivering diabetes-related information but also serves as a foundation for developing conversational AI systems tailored to resource-constrained settings.
Keywords: Urdu Retrieval-Augmented Generation (RAG), Diabetes Chatbot, Multilingual NLP, Low-Resource Languages, Meta NLLB, Llama Models, Healthcare AI, Language Model Evaluation
Document Type
Restricted Access
Submission Type
Research Project
Recommended Citation
Chughtai, A. (2024). Development of an Urdu-Language Retrieval-Augmented Generation (RAG) chatbot for Diabetes FAQs (Unpublished graduate research project). Institute of Business Administration, Pakistan. Retrieved from https://ir.iba.edu.pk/research-projects-msds/45
demo video.mp4 (36687 kB)
Code Files.zip (64 kB)
The full text of this document is only accessible to authorized users.