Student Name

Kulsoom KhanFollow

Date of Submission

Spring 2025

Supervisor

Dr. Xiaorui Jiang, Lecturer, Information School, University of Sheffield

Co-Supervisor

Dr. Sajjad Haider, Co-Supervisor, Institute of Business Administration (IBA), Karachi

Committee Member 1

Dr. Sajjad Haider, Co-Supervisor, Department of Computer Science, Institute of Business Administration, Karachi

Committee Member 2

Dr. Atif Tahir, Examiner – I Department of Computer Science, Institute of Business Administration (IBA), Karachi

Committee Member 3

Dr. Tahir Syed, Examiner – II Department of Computer Science, Institute of Business Administration (IBA), Karachi

Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Keywords

Sequential Sentence Classification, Medical Evidence Extraction, Hierarchical Models, Rhetorical Role Classification, Software Requirement Specifications, Automated Medical Coding

Abstract

This thesis presents a novel hierarchical sequential classification model designed to address challenges in processing long and complex documents across multiple domains. The model has been applied to two distinct tasks: medical evidence extraction and software requirement classification. Both tasks involve binary classification, with the additional complexity of paired sentence classification for the software requirements dataset. For the medical domain, the study utilizes a subset of the MIMIC-III dataset, employing domain-specific BERT variants to extract textual evidence associated with International Classification of Diseases (ICD) codes. In the software engineering domain, the DRIP dataset is used to classify and analyze software requirements, leveraging paired sentence classification to identify semantic relationships between requirement pairs. Domain-specific BERT models, such as BioBERT, ClinicalBERT, and others, were fine-tuned for each dataset to improve task-specific performance. The hierarchical architecture allows the model to integrate sentence-level and document-level representations, effectively capturing both local and global context for sequential sentence classification tasks. Results demonstrate the model's ability to achieve decent accuracy in both domains, underscoring the potential of hierarchical approaches combined with domain-specific pretrained language models. This work contributes to advancing the state-of-the-art in document-level sequence classification by bridging gaps in medical and software engineering applications.

Document Type

Restricted Access

Submission Type

Thesis

Available for download on Tuesday, April 28, 2026

Share

COinS