All Theses and Dissertations
Degree
Doctor of Philosophy in Computer Science
Faculty / School
School of Mathematics and Computer Science (SMCS)
Department
Department of Computer Science
Date of Award
Spring 2021
Advisor
Dr. Shakeel Khoja, Professor and Dean School of Mathematics and Computer Science, Department of Computer Science
Committee Member 1
Dr. Shafay Shamail, LUMS, Lahore
Committee Member 2
Dr. Malik Muhammad Saad Missen, Islamia University, Bahawalpur
Project Type
Dissertation
Access Type
Restricted Access
Document Version
Final
Pages
xxiii, 174
Abstract
The efficient retrieval of mathematical expressions over the web is a complex process as compared to simple text searches. It is only possible when the syntactic (for example, Textual) and semantic (for example, Structural) information of a mathematical expression is retrieved properly and analyzed methodically. This research proposes a technique that indexes expressions along with their syntactic and semantic information. The proposed technique also improves memory storage efficiency for the inverted index by encoding indexing terms in Braille Unicode.
The mathematical expressions are originally represented in Content MathML (CMML) for indexing. However, the majority of scientific collection of documents contains mathematical expressions in the LATEX math style. Therefore, a rule-based conversion technique is developed for transforming LATEX math expressions into CMML, termed as LATEX Math Grammar (LMG).
A weighting function that assigns a weight to each indexing term is introduced to improve the ranking of retrieved documents. The weighting score of each term contributes to the ranking function that improves the rank of a document that contains query terms. Multiple indices are created in a distributed environment to avoid large storage of an inverted index in a centralized location. Additionally, a user-friendly graphical user interface is developed for users so that both experienced and general users can use systems without any hassle.
The proposed technique has been evaluated on Wikipedia and Arxiv NTCIR-12- MathIR corpora, other than that three sets of ArXiv documents dumps are also selected for testing the performance of the system on a large collection of mathematical expressions.
The performance metrics are divided into two categories; retrieval performance and system execution performance. Retrieval performance is measured using NTCIR-MathIR evaluation criteria. The Wikipedia queries without wildcards resulted in the nDCG value of 49.02%, the MSnDCG value of 49.66%, Precision values of 45.50%, the Average Precision (AP) value of 49.32%, and nERR value of 65.69% at the top 5 documents. The Arxiv queries without text resulted in the nDCG value of 48.38%, the MSnDCG value of 47.88%, Precision values of 44%, the AP value of 34.83%, and nERR value of 56.20% at the top 5 documents. The system execution performance on an uncompressed index (for example, without Braille encoding), it is observed that 18.63 million formulae stored per Gigabytes storage, 53.26 million formulae are indexed in per-hour time; the average search time of a query is 267 milliseconds. In contrast, The system execution performance on a compressed index (for example, with Braille encoding), it is observed that 45.10 million formulae stored in per Gigabytes storage, 49.66 million formulae are indexed in per hour time; the average search time of a query is 347 milliseconds.
Link to Catalog Record
https://ils.iba.edu.pk/cgi-bin/koha/opac-detail.pl?biblionumber=114107
Recommended Citation
Hussain, S. (2021). Retrieval of mathematical information with syntactic and semantic structure (Unpublished doctoral dissertation). Institute of Business Administration, Pakistan. Retrieved from https://ir.iba.edu.pk/etd/88
The full text of this document is only accessible to authorized users.