All Theses and Dissertations

Degree

Doctor of Philosophy in Computer Science

Department

Department of Computer Science

Date of Award

Fall 2011

Advisor

Dr. Sajjad Haider

Committee Member 1

Dr. Asim ur Rehman, National University of Computer and Emerging Sciences (NUCES), Karachi, Pakistan

Committee Member 2

Dr. Sharifullah Khan, National University of Sciences and Technology (NUST), Islamabad, Pakistan

Project Type

Dissertation

Access Type

Restricted Access

Pages

xv, 146

Abstract

The Internet has become an attractive avenue for global e-business, e-learning, knowledge sharing, etc. Due to continuous increase in the volume of web content, however, it is not practically possible for a user to extract information by browsing and integrating data from a huge amount of web sources retrieved by the existing search engines. The semantic web technology aims to answer this and many other information extraction related issues by providing a suite of tools for integrating data from different sources. To take full advantage of semantic web, however, it is necessary to annotate existing web pages with semantics. Mother difficulty that logically arises while accessing information over the web is the presence of unstructured, ungrammatical and incoherent format such as online advertisements, emails, reports etc. This thesis aims to answer few of the concern raised above and presents a semantic annotation framework that is capable of extracting relevant data from unstructured, ungrammatical and incoherent data sources and semantically annotating it. The semantic annotation framework is named BNOSA and it employs ontology and Bayesian network to perform semantic annotation. As the data is unstructured and ungrammatical, the framework exploits the use of context keywords along with domain knowledge to find the location of the data of interest in relevant data sources. Due to the variable size of information available on different webpages, it is often the case that the extracted data contains missing values for certain variables of interest or it may extract more than one value (conflicting values). It is desirable in such situations to predict the missing values and to resolve the conflicts by selecting them the most relevant value. BNOSA employs Bayesian networks for missing value prediction and conflict resolution. The framework is extensible as it is capable of dynamically linking any problem domain given a pre-defined ontology and a corresponding Bayesian network. Experiments have been conducted to analyze the performance of BNOSA on several problem domains. The sets of corpora used in the experiments belong to selling-purchasing websites where product information is entered by ordinary web users in a structure free format. The results show that BNOSA performs better than the other recently proposed semantic annotation frameworks.

The full text of this document is only accessible to authorized users.

Share

COinS