Master of Science in Computer Science


Department of Computer Science

Faculty / School

Faculty of Computer Sciences (FCS)

Date of Submission



Dr. Muhammad Sarim, Visiting Faculty, Department of Computer Science

Document type

MSCS Survey Report


The idea of this survey is to explore the challenges and intricacies of developing OCR for English, Arabic, Urdu, and Chinese. Moreover, the survey also discusses the solutions proposed in order to overcome these challenges. A great extent of research has been done for the recognition of English and Chinese languages. However, Arabic and Urdu languages are still a prominent area of research. The basic system of recognition has six stages that are the data acquisition stage, the preprocessing stage, the segmentation phase, the feature extraction stage, the classification of text, and lastly, the post-processing stage. Building such a system for the English language is easy but there are some challenges. The main challenge in an OCR system comes from the input which can be noisy or distorted. Also, multiple fonts of English are also a challenge. OCR systems for handwritten English are even more complicated as the input text can be of varying dimensions and not as uniform as a printed text. In Urdu, there are many other challenges pertaining to the style of writing Urdu. Urdu is a cursive language and is context sensitive meaning, each character can take 2 to 4 different shapes depending on its position within the word. The Urdu language has a different number of dots on characters which distinguish each character. Also, there are various diacritic marks which need to be identified. Urdu is written in a diagonal fashion and there is no horizontal baseline like in English. There is intra-word and intra-ligature overlapping in Urdu. There are many other challenges which are discussed in the later sections. The challenges with Chinese are that the Chinese language has a huge character set with a total of3755 characters and many characters are very similar to each other and difficult to differentiate. Also, there is not enough labeled dataset for the recognition of Chinese text.

The full text of this document is only accessible to authorized users.