MSDS Research Projects

On Leveraging Quantized LLMs for Sentence Boundary Prediction in Transcribed Urdu Text

Syed Asad RizviFollow

Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Date of Submission

Fall 2023

Supervisor

Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration, Karachi

Keywords

LLMs, transcriptions, Urdu language, punctuations, sentence boundaries, one-shot learning, fine-tuning

Abstract

In speech-to-text applications, the generated transcriptions often contain textual issues such as improper placement of punctuations and absence of sentence boundaries. The issue becomes more prominent for low-resource languages like Urdu. Researchers have tried rule-based and machine learning methods with limited success. However, no work has been reported that employs Large Language Models (LLMs).

This project, therefore, aims to address the Urdu language transcription problem using LLMs. Due to the resource limitations, smaller LLMs (with 13 billion or fewer parameters) are compared for this task.

Different techniques have been implemented to enhance the sentence boundary detection quality in Urdu text. For instance, One-shot learning and Fine-tuning techniques have been used to obtain correctly punctuated Urdu sentences. LLMs such as Llama2, Vicuna and Falcon were used in the one-shot learning experiments. The results suggested that the Vicuna 13b model gave the best results out of all the tested models.

On the other hand, fine-tuning Llama2-7b model with QLoRA method did not produce encouraging results. The generated punctuated Urdu sentences of the fine-tuned model did not always align with the correct sentences and rather generated words that did not belong to Urdu.

Document Type

Restricted Access

Submission Type

Research Project

Recommended Citation

Rizvi, S. (2023). On Leveraging Quantized LLMs for Sentence Boundary Prediction in Transcribed Urdu Text (Unpublished graduate research project). Institute of Business Administration, Pakistan. Retrieved from https://ir.iba.edu.pk/research-projects-msds/19

This document is currently not available here.

The full text of this document is only accessible to authorized users.

COinS

MSDS Research Projects

On Leveraging Quantized LLMs for Sentence Boundary Prediction in Transcribed Urdu Text

Degree

Department

Faculty/ School

Date of Submission

Supervisor

Keywords

Abstract

Document Type

Submission Type

Recommended Citation

Browse

Search

Author Corner

LINKS

MSDS Research Projects

On Leveraging Quantized LLMs for Sentence Boundary Prediction in Transcribed Urdu Text

Student Name

Degree

Department

Faculty/ School

Date of Submission

Supervisor

Keywords

Abstract

Document Type

Submission Type

Recommended Citation

Share

Browse

Search

Author Corner

LINKS