On Leveraging Quantized LLMs for Sentence Boundary Prediction in Transcribed Urdu Text

Student Name

Syed Asad RizviFollow

Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Date of Submission

Fall 2023

Supervisor

Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration, Karachi

Abstract

In speech-to-text applications, the generated transcriptions often contain textual issues such as improper placement of punctuations and absence of sentence boundaries. The issue becomes more prominent for low-resource languages like Urdu. Researchers have tried rule-based and machine learning methods with limited success. However, no work has been reported that employs Large Language Models (LLMs).

This project, therefore, aims to address the Urdu language transcription problem using LLMs. Due to the resource limitations, smaller LLMs (with 13 billion or fewer parameters) are compared for this task.

Different techniques have been implemented to enhance the sentence boundary detection quality in Urdu text. For instance, One-shot learning and Fine-tuning techniques have been used to obtain correctly punctuated Urdu sentences. LLMs such as Llama2, Vicuna and Falcon were used in the one-shot learning experiments. The results suggested that the Vicuna 13b model gave the best results out of all the tested models.

On the other hand, fine-tuning Llama2-7b model with QLoRA method did not produce encouraging results. The generated punctuated Urdu sentences of the fine-tuned model did not always align with the correct sentences and rather generated words that did not belong to Urdu.

Document Type

Restricted Access

Submission Type

Research Project

This document is currently not available here.

The full text of this document is only accessible to authorized users.

Share

COinS