Master of Science in Computer Science

Faculty / School

Faculty of Computer Sciences (FCS)


Department of Computer Science

Date of Submission



Dr. Sajjad Haider, Professor, Department of Computer Science, Institute of Business Administration (IBA), Karachi

Project Type

MSCS Survey Report


The Internet is a place that ties individuals, eliminates distances and brings everyone closer. Nowadays, people help and greet each other, share ideas, and recommend things on social media. Platforms like Facebook, Twitter and Instagram are becoming a great source of exchanging information.

As per recent statistics, Facebook has around 1 billion users and Twitter has more than 300 million users. People who use these platforms and interact with others generate massive data as they exchange their ideas by comments, chats and likes/dislikes on posts, pictures, new places or campaigns launched. This kind of data contains both negative and positive aspects. It helps us in understanding the userbase and in identifying what is important and what is not. For instance, reviews about movies and places can help us in picking the right spot and thing.

In Pakistan, we have a huge market for those who uses social media platforms for entertainment purposes or sharing ideas. Most people use English/Urdu/Roman-Urdu language for exchanging information and ideas on such platforms. This research survey aims to analyze approaches from previous published research to experiment and to identify the techniques and algorithms that work best for social media comments posted in English/Urdu/Roman-Urdu languages.

In this work, data consisting of Urdu and Roman-Urdu languages was gathered to create a machine learning model to classify sentiments. Results show that SVM and Naïve Bayes algorithms performed well and gave an accuracy of 79% for Roman-Urdu and 72% accuracy for Urdu sentences.


The present work performed sentiment analysis on local languages. Different machine learning models were applied on different sets of data that contained text-based data in Urdu and Roman Urdu languages. Different feature set techniques were also explored. Previously reported work also achieved good results, but their main drawback was that they were trained and tested on small amounts of data. The presented work has used a larger dataset. The results show that as the data grows, the complexity also increases. The initial model trained on Naive Bayes and Support Vector machines gives better results with the combination of tfidfvectorizer, countvectorizer and ngram. The model performed well and gave an accuracy of more than 70% for each set of data (Urdu and Roman Urdu).

The main idea of this survey was to analyze the existing research work that has been done in the domain of Urdu and Roman Urdu languages and to build a data model that is trained on a large dataset to check its efficiency. The techniques and approaches which have been used in this work were taken from different research work which has been done in the area of sentiment analysis.

The full text of this document is only accessible to authorized users.