On Leveraging Quantized LLMs for Sentence Boundary Prediction in Transcribed Urdu Text
Degree
Master of Science in Data Science
Department
Department of Computer Science
Faculty/ School
School of Mathematics and Computer Science (SMCS)
Date of Submission
Fall 2023
Supervisor
Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration, Karachi
Keywords
LLMs, transcriptions, Urdu language, punctuations, sentence boundaries, one-shot learning, fine-tuning
Abstract
In speech-to-text applications, the generated transcriptions often contain textual issues such as improper placement of punctuations and absence of sentence boundaries. The issue becomes more prominent for low-resource languages like Urdu. Researchers have tried rule-based and machine learning methods with limited success. However, no work has been reported that employs Large Language Models (LLMs).
This project, therefore, aims to address the Urdu language transcription problem using LLMs. Due to the resource limitations, smaller LLMs (with 13 billion or fewer parameters) are compared for this task.
Different techniques have been implemented to enhance the sentence boundary detection quality in Urdu text. For instance, One-shot learning and Fine-tuning techniques have been used to obtain correctly punctuated Urdu sentences. LLMs such as Llama2, Vicuna and Falcon were used in the one-shot learning experiments. The results suggested that the Vicuna 13b model gave the best results out of all the tested models.
On the other hand, fine-tuning Llama2-7b model with QLoRA method did not produce encouraging results. The generated punctuated Urdu sentences of the fine-tuned model did not always align with the correct sentences and rather generated words that did not belong to Urdu.
Document Type
Restricted Access
Submission Type
Research Project
Recommended Citation
Rizvi, S. (2023). On Leveraging Quantized LLMs for Sentence Boundary Prediction in Transcribed Urdu Text (Unpublished graduate research project). Institute of Business Administration, Pakistan. Retrieved from https://ir.iba.edu.pk/research-projects-msds/19
The full text of this document is only accessible to authorized users.