Doctor of Philosophy in Computer Science
Faculty / School
School of Mathematics and Computer Science (SMCS)
Department of Computer Science
Date of Award
Shakeel Ahmed Khoja
Committee Member 1
Shakeel Khoja, Professor, School of Mathematics and Computer Science (SMCS), Institute of Business Administration, Karachi
Committee Member 2
Dr. Basit Shafiq, LUMS, Lahore
Committee Member 3
Dr. Khalid Latif, COMSATS University, Islamabad
The link traversal approach over Linked Data promises to retrieve up-to-results for a large collection of linked, dynamic and distributed datasets from multitudinous domains using a recursive URI lookup process in real-time. The downside of this approach comes with the query patterns having subject-unbound such as ?s rdf:type :Class where object is a foreign URI (i.e. belonging to different domain/sub-domain to its corresponding subject) or it is a literal. The queries with the said triple patterns are referred to as nonLinked Data Answerable Queries (non-LDaQ). These queries fail to start the traversal process, as the Linked Data sources are subject-centric, and objects are either inaccessible or non-dereferenceable hence yielding empty results. This research focuses on the identification and execution of the non-LDaQ queries. An analysis of large corpus of real-world SPARQL query logs is performed to discover nonLDaQ queries. Then, two data source selection approaches are proposed for answering non-LDaQ queries live. The first approach uses the backlinking technique for extracting the backlinks from a dataset. Later, those backlinks are used as seed URIs for initial data source selection while performing the link traversal approach. It helps in retrieving the missing subjects from the stored backlinks. Hence, the non-LDaQ queries could be executed successfully with faster query execution times. However, the process of finding and storing backlinks is quite cumbersome, time-consuming, and is incapable of answering non-LDaQ queries where object is a literal. The second approach proposes a Hybrid Query Execution (HQE), a mechanism for answering non-LDaQ queries. HQE splits the non-LDaQ queries into two sub-queries, the first sub-query having subject-unbound and object as a foreign URI or literal is executed over local index for an initial data source selection to identify the missing subjects. The results obtained for the first sub-query are injected into the second sub-query using Jena parameterized SPARQL string, then it is executed live over Linked Data using link traversal strategy. The index used by HQE is created by discovering the Most Frequent Predicates (MFPs), which occur in non-LDaQ queries obtained from the realworld SPARQL query logs. HQE guarantees fresh, non-empty, and complete results in shortest possible times. The proposed technique is evaluated using one the latest realworld RDF dataset benchmarks, and a handcrafted customized benchmark, known as HQBench. The two performance metrics- completeness of results and query execution times are used for comparing HQE with different approaches. The evaluation of HQE with the above-mentioned criteria reveals that it retrieves non-empty results for 83% of non-LDaQ queries, at least five times more than the existing approaches.
Bai, S. (2021). Hybrid query execution on linked data with complete results (Unpublished doctoral dissertation). Institute of Business Administration, Pakistan. Retrieved from https://ir.iba.edu.pk/etd/84
Available for download on Wednesday, February 17, 2027
The full text of this document is only accessible to authorized users.