Loading...
Degree
Bachelor of Science (Computer Science)
Department
Department of Computer Science
School
School of Mathematics and Computer Science (SMCS)
Advisor
Dr. Sajjad Haider, Professor, Department of Computer Science
Keywords
Urdu Automatic Speech Recognition, Whisper v3, LoRA Fine-Tuning, ParameterEfficient Fine-Tuning, Small Language Model Post-Processing, Multi-Speaker Evaluation, LowResource ASR, Word Error Rate
Abstract
This project focuses on improving Automatic Speech Recognition for Urdu, a low-resource language where existing models still struggle with accents, background noise, code-switching, and inconsistent Urdu spellings. We finetuned OpenAI Whisper Large-v3 and Whisper Large-v3-Turbo using LoRA on Urdu datasets including Common Voice v23, FLEURS, and CSaLT. The system was evaluated not only on benchmark datasets but also on a real-world YouTube evaluation set containing two-speaker Urdu news discussions with noise, interruptions, and natural speech patterns. We also tested decoder-level optimization and Small Language Model based post-processing to reduce transcription errors and improve spelling consistency. The project contributes updated Urdu ASR benchmarks, a realistic multi-speaker evaluation setup, and an end-to-end transcription pipeline aimed at making Urdu audio easier to transcribe, search, and analyze by providing 11% relative accuracy.
Tools and Technologies Used
Python, PyTorch, Hugging Face Transformers, OpenAI Whisper Large-v3, Whisper Large-v3-Turbo, LoRA, PEFT, Unsloth, jiwer, FFmpeg, Common Voice v23, FLEURS, CSaLT, YouTube audio dataset, Qwen3-14B, Tiny Aya Fire, Qalb-1.0-8B-Instruct, Gemma-2-9B, FastAPI, HTML, CSS, JavaScript, GitHub, Ubuntu VM, NVIDIA A40 GPU.
Methodology
We first reviewed and replicated prior Urdu ASR benchmarking work to understand existing model performance and limitations. The experiments were conducted on an Ubuntu 22.04.5 LTS GPU environment with 32 vCPUs, 62 GB RAM, and 1 NVIDIA A40 48GB GPU. We prepared Urdu speech datasets from Common Voice v23, FLEURS, and CSaLT by converting audio files to mono channel, 16 kHz sampling rate, and 16-bit PCM format, normalizing Urdu text, and creating consistent train, validation, and test splits. Whisper Large-v3 and Whisper Large-v3-Turbo were evaluated in zero-shot form and then fine-tuned using LoRA with mixed precision FP16 training to adapt them better to Urdu speech. For LoRA fine-tuning, Turbo used rank 32, alpha 64, and dropout 0.05, while Large-v3 used rank 16 and alpha 32. After fine-tuning, we performed decoder-level optimization by testing different inference settings such as beam width and length penalty. To evaluate real-world usefulness, we created a YouTube-based evaluation set containing 30 Urdu news clips with two speakers, background noise, and natural conversation flow. Finally, we applied Urdu-capable Small Language Models as a post-processing step to correct minor spelling and recognition errors without changing the meaning or structure of the transcript. Evaluation was performed using the jiwer library with consistent text normalization and metric calculation across all experiments. Performance was measured mainly using Word Error Rate, while CER, MER, and WIL were also used to evaluate SLM-based post-processing. The experiments were conducted on an Ubuntu 22.04.5 LTS GPU environment with 32 vCPUs, 62 GB RAM, and 1 NVIDIA A40 48GB GPU. Audio files were converted to mono channel, 16 kHz sampling rate, and 16-bit PCM format before processing. The models used were Whisper Large-v3 and Whisper Large-v3-Turbo, finetuned using LoRA with mixed precision FP16 training. Turbo used LoRA rank 32, alpha 64, and dropout 0.05, while Large-v3 used rank 16 and alpha 32. Evaluation was performed using the jiwer library with consistent text normalization and metric calculation across all experiments.
Document Type
Restricted Access
Submission Type
BSCS Final Year Project
Recommended Citation
Ahmed, Z., Inayat, F., & Aqib, Z. (2026). Awaaz se Alfaaz: Enhancing Urdu ASR through Whisper Fine- Tuning and SLM-Based Post-Processing. Retrieved from https://ir.iba.edu.pk/fyp-bscs/50
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
COinS
