Master of Science in Data Science


Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Date of Submission

Fall 2023


Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration, Karachi


This project delves into the complexities of medical coding within the healthcare sector, specifically focusing on the International Classification of Diseases, Tenth Revision (ICD-10) coding system. The main objective is to conduct a thorough comparative analysis between various pre-trained language models and MEDCAT's SNOMED-CT model for multi-label classification of clinical notes. Leveraging the MIMIC IV dataset, the project primarily evaluates methodologies using the shortest 1,000 clinical notes, predicting the initial three letters of the ICD-10 code (category). Multiple methods are scrutinized, each benchmarked against MEDCAT's SNOMED-CT. The findings underscore the consistent performance of SNOMED-CT, achieving a notable Macro-F1 score of 0.218. Conversely, BERT, the most successful transformer-based approach, attains a noteworthy Macro-F1 score of 0.123. Despite its modest performance, this result is significant, given the considerable number of predicted classes—around 24,000 for the overall MIMIC IV dataset and 700 for the testing dataset. Moreover, compared to related works, both high-performing approaches exhibit superior metrics. This comparative analysis yields valuable insights into the efficacy of various pre-trained language models alongside MEDCAT's SNOMED-CT when mapping clinical notes to ICD-10 codes, showcasing their proficiency in handling diverse text sizes and styles.

Document Type

Restricted Access

Submission Type

Research Project


Media is loading

The full text of this document is only accessible to authorized users.