Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Date of Submission

Fall 2024

Supervisor

Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration, Karachi

Keywords

Sentence Boundary Detection (SBD), Natural Language Processing (NLP), Urdu Language, Low-Resource Language, Transcribed Text, Punctuation Absence.

Abstract

Sentence boundary detection (SBD) is a critical task in natural language processing (NLP), enabling accurate segmentation of text for downstream applications such as machine translation, summarization, and question answering. This project focuses on SBD for Urdu, a low-resource language with unique grammatical structures and a complex script. Our research focuses on transcribed Urdu text, which typically lacks punctuation and presents unique challenges in identifying sentence boundaries. This reflects real-world scenarios such as speech-to-text outputs and informal digital communication, where sentence segmentation must rely on linguistic features rather than punctuation cues. To tackle this challenge, this project explored multiple approaches. A rule-based method leveraging linguistic patterns and part-of speech (POS) tagging was used initially to understand the data and the nature of issues related to SBD. Statistical models, including Decision Trees, Random Forest, Logistic Regression, and XGBoost, utilized features such as XPOS and UPOS tags, among others. Additionally, deep learning models, including feedforward CNNs and LSTMs, were applied. Feature selection techniques were employed to optimize performance, and experiments were conducted to evaluate models on an imbalanced dataset with two target classes. The best model achieved an F-measure of 0.73 (73%) using XGBoost. Our findings reveal that the absence of punctuation poses significant challenges. Yet, meaningful improvements can be achieved through careful feature engineering and model selection. Statistical models demonstrated interpretability and efficiency. This research contributes to the growing body of work on low-resource languages and establishes a foundation for practical applications in Urdu NLP.

Document Type

Restricted Access

Submission Type

Research Project

Recommended Citation

Nasim, K. (2024). Urdu Sentence Boundary Detection with Statistical Learning Approaches (Unpublished graduate research project). Institute of Business Administration, Pakistan. Retrieved from https://ir.iba.edu.pk/research-projects-msds/52

Download

The full text of this document is only accessible to authorized users.

COinS

MSDS Research Projects

Urdu Sentence Boundary Detection with Statistical Learning Approaches

Degree

Department

Faculty/ School

Date of Submission

Supervisor

Keywords

Abstract

Document Type

Submission Type

Recommended Citation

Browse

Search

Author Corner

LINKS

MSDS Research Projects

Urdu Sentence Boundary Detection with Statistical Learning Approaches

Student Name

Degree

Department

Faculty/ School

Date of Submission

Supervisor

Keywords

Abstract

Document Type

Submission Type

Recommended Citation

Share

Browse

Search

Author Corner

LINKS