Degree
Master of Science in Data Science
Department
Department of Computer Science
Faculty/ School
School of Mathematics and Computer Science (SMCS)
Date of Submission
Fall 2024
Supervisor
Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration, Karachi
Keywords
Sentence Boundary Detection (SBD), Natural Language Processing (NLP), Urdu Language, Low-Resource Language, Transcribed Text, Punctuation Absence.
Abstract
Sentence boundary detection (SBD) is a critical task in natural language processing (NLP), enabling accurate segmentation of text for downstream applications such as machine translation, summarization, and question answering. This project focuses on SBD for Urdu, a low-resource language with unique grammatical structures and a complex script. Our research focuses on transcribed Urdu text, which typically lacks punctuation and presents unique challenges in identifying sentence boundaries. This reflects real-world scenarios such as speech-to-text outputs and informal digital communication, where sentence segmentation must rely on linguistic features rather than punctuation cues. To tackle this challenge, this project explored multiple approaches. A rule-based method leveraging linguistic patterns and part-of speech (POS) tagging was used initially to understand the data and the nature of issues related to SBD. Statistical models, including Decision Trees, Random Forest, Logistic Regression, and XGBoost, utilized features such as XPOS and UPOS tags, among others. Additionally, deep learning models, including feedforward CNNs and LSTMs, were applied. Feature selection techniques were employed to optimize performance, and experiments were conducted to evaluate models on an imbalanced dataset with two target classes. The best model achieved an F-measure of 0.73 (73%) using XGBoost. Our findings reveal that the absence of punctuation poses significant challenges. Yet, meaningful improvements can be achieved through careful feature engineering and model selection. Statistical models demonstrated interpretability and efficiency. This research contributes to the growing body of work on low-resource languages and establishes a foundation for practical applications in Urdu NLP.
Document Type
Restricted Access
Submission Type
Research Project
Recommended Citation
Nasim, K. (2024). Urdu Sentence Boundary Detection with Statistical Learning Approaches (Unpublished graduate research project). Institute of Business Administration, Pakistan. Retrieved from https://ir.iba.edu.pk/research-projects-msds/52
The full text of this document is only accessible to authorized users.
