All Theses and Dissertations

Degree

Doctor of Philosophy in Computer Science

Faculty / School

School of Mathematics and Computer Science (SMCS)

Department

Department of Computer Science

Date of Award

Spring 2021

Advisor

Dr. Shakeel Khoja, Professor and Dean School of Mathematics and Computer Science, Department of Computer Science

Committee Member 1

Dr. Shafay Shamail, LUMS, Lahore

Committee Member 2

Dr. Malik Muhammad Saad Missen, Islamia University, Bahawalpur

Project Type

Dissertation

Access Type

Restricted Access

Document Version

Final

Pages

xxiii, 174

Abstract

The efficient retrieval of mathematical expressions over the web is a complex process as compared to simple text searches. It is only possible when the syntactic (for example, Textual) and semantic (for example, Structural) information of a mathematical expression is retrieved properly and analyzed methodically. This research proposes a technique that indexes expressions along with their syntactic and semantic information. The proposed technique also improves memory storage efficiency for the inverted index by encoding indexing terms in Braille Unicode.

The mathematical expressions are originally represented in Content MathML (CMML) for indexing. However, the majority of scientific collection of documents contains mathematical expressions in the LATEX math style. Therefore, a rule-based conversion technique is developed for transforming LATEX math expressions into CMML, termed as LATEX Math Grammar (LMG).

A weighting function that assigns a weight to each indexing term is introduced to improve the ranking of retrieved documents. The weighting score of each term contributes to the ranking function that improves the rank of a document that contains query terms. Multiple indices are created in a distributed environment to avoid large storage of an inverted index in a centralized location. Additionally, a user-friendly graphical user interface is developed for users so that both experienced and general users can use systems without any hassle.

The proposed technique has been evaluated on Wikipedia and Arxiv NTCIR-12- MathIR corpora, other than that three sets of ArXiv documents dumps are also selected for testing the performance of the system on a large collection of mathematical expressions.

The performance metrics are divided into two categories; retrieval performance and system execution performance. Retrieval performance is measured using NTCIR-MathIR evaluation criteria. The Wikipedia queries without wildcards resulted in the nDCG value of 49.02%, the MSnDCG value of 49.66%, Precision values of 45.50%, the Average Precision (AP) value of 49.32%, and nERR value of 65.69% at the top 5 documents. The Arxiv queries without text resulted in the nDCG value of 48.38%, the MSnDCG value of 47.88%, Precision values of 44%, the AP value of 34.83%, and nERR value of 56.20% at the top 5 documents. The system execution performance on an uncompressed index (for example, without Braille encoding), it is observed that 18.63 million formulae stored per Gigabytes storage, 53.26 million formulae are indexed in per-hour time; the average search time of a query is 267 milliseconds. In contrast, The system execution performance on a compressed index (for example, with Braille encoding), it is observed that 45.10 million formulae stored in per Gigabytes storage, 49.66 million formulae are indexed in per hour time; the average search time of a query is 347 milliseconds.

The full text of this document is only accessible to authorized users.

Share

COinS