Student Name

Kanza NasimFollow

Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Date of Submission

Fall 2024

Supervisor

Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration, Karachi

Keywords

Sentence Boundary Detection (SBD), Natural Language Processing (NLP), Urdu Language, Low-Resource Language, Transcribed Text, Punctuation Absence.

Abstract

Sentence boundary detection (SBD) is a critical task in natural language processing (NLP), enabling accurate segmentation of text for downstream applications such as machine translation, summarization, and question answering. This project focuses on SBD for Urdu, a low-resource language with unique grammatical structures and a complex script. Our research focuses on transcribed Urdu text, which typically lacks punctuation and presents unique challenges in identifying sentence boundaries. This reflects real-world scenarios such as speech-to-text outputs and informal digital communication, where sentence segmentation must rely on linguistic features rather than punctuation cues. To tackle this challenge, this project explored multiple approaches. A rule-based method leveraging linguistic patterns and part-of speech (POS) tagging was used initially to understand the data and the nature of issues related to SBD. Statistical models, including Decision Trees, Random Forest, Logistic Regression, and XGBoost, utilized features such as XPOS and UPOS tags, among others. Additionally, deep learning models, including feedforward CNNs and LSTMs, were applied. Feature selection techniques were employed to optimize performance, and experiments were conducted to evaluate models on an imbalanced dataset with two target classes. The best model achieved an F-measure of 0.73 (73%) using XGBoost. Our findings reveal that the absence of punctuation poses significant challenges. Yet, meaningful improvements can be achieved through careful feature engineering and model selection. Statistical models demonstrated interpretability and efficiency. This research contributes to the growing body of work on low-resource languages and establishes a foundation for practical applications in Urdu NLP.

Document Type

Restricted Access

Submission Type

Research Project

The full text of this document is only accessible to authorized users.

Share

COinS