Loading...

Media is loading
 

Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Dr. Sajjad Haider, Professor, Department of Computer Science

Keywords

Urdu Automatic Speech Recognition, Whisper v3, LoRA Fine-Tuning, ParameterEfficient Fine-Tuning, Small Language Model Post-Processing, Multi-Speaker Evaluation, LowResource ASR, Word Error Rate

Abstract

This project focuses on improving Automatic Speech Recognition for Urdu, a low-resource language where existing models still struggle with accents, background noise, code-switching, and inconsistent Urdu spellings. We finetuned OpenAI Whisper Large-v3 and Whisper Large-v3-Turbo using LoRA on Urdu datasets including Common Voice v23, FLEURS, and CSaLT. The system was evaluated not only on benchmark datasets but also on a real-world YouTube evaluation set containing two-speaker Urdu news discussions with noise, interruptions, and natural speech patterns. We also tested decoder-level optimization and Small Language Model based post-processing to reduce transcription errors and improve spelling consistency. The project contributes updated Urdu ASR benchmarks, a realistic multi-speaker evaluation setup, and an end-to-end transcription pipeline aimed at making Urdu audio easier to transcribe, search, and analyze by providing 11% relative accuracy.

Tools and Technologies Used

Python, PyTorch, Hugging Face Transformers, OpenAI Whisper Large-v3, Whisper Large-v3-Turbo, LoRA, PEFT, Unsloth, jiwer, FFmpeg, Common Voice v23, FLEURS, CSaLT, YouTube audio dataset, Qwen3-14B, Tiny Aya Fire, Qalb-1.0-8B-Instruct, Gemma-2-9B, FastAPI, HTML, CSS, JavaScript, GitHub, Ubuntu VM, NVIDIA A40 GPU.

Methodology

We first reviewed and replicated prior Urdu ASR benchmarking work to understand existing model performance and limitations. The experiments were conducted on an Ubuntu 22.04.5 LTS GPU environment with 32 vCPUs, 62 GB RAM, and 1 NVIDIA A40 48GB GPU. We prepared Urdu speech datasets from Common Voice v23, FLEURS, and CSaLT by converting audio files to mono channel, 16 kHz sampling rate, and 16-bit PCM format, normalizing Urdu text, and creating consistent train, validation, and test splits. Whisper Large-v3 and Whisper Large-v3-Turbo were evaluated in zero-shot form and then fine-tuned using LoRA with mixed precision FP16 training to adapt them better to Urdu speech. For LoRA fine-tuning, Turbo used rank 32, alpha 64, and dropout 0.05, while Large-v3 used rank 16 and alpha 32. After fine-tuning, we performed decoder-level optimization by testing different inference settings such as beam width and length penalty. To evaluate real-world usefulness, we created a YouTube-based evaluation set containing 30 Urdu news clips with two speakers, background noise, and natural conversation flow. Finally, we applied Urdu-capable Small Language Models as a post-processing step to correct minor spelling and recognition errors without changing the meaning or structure of the transcript. Evaluation was performed using the jiwer library with consistent text normalization and metric calculation across all experiments. Performance was measured mainly using Word Error Rate, while CER, MER, and WIL were also used to evaluate SLM-based post-processing. The experiments were conducted on an Ubuntu 22.04.5 LTS GPU environment with 32 vCPUs, 62 GB RAM, and 1 NVIDIA A40 48GB GPU. Audio files were converted to mono channel, 16 kHz sampling rate, and 16-bit PCM format before processing. The models used were Whisper Large-v3 and Whisper Large-v3-Turbo, finetuned using LoRA with mixed precision FP16 training. Turbo used LoRA rank 32, alpha 64, and dropout 0.05, while Large-v3 used rank 16 and alpha 32. Evaluation was performed using the jiwer library with consistent text normalization and metric calculation across all experiments.

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Share

COinS