## All Theses and Dissertations

## Degree

Doctor of Philosophy in Computer Science

## Faculty / School

School of Mathematics and Computer Science (SMCS)

## Department

Department of Computer Science

## Date of Award

Spring 2021

## Advisor

Dr. Shakeel Khoja, Professor and Dean School of Mathematics and Computer Science, Department of Computer Science

## Committee Member 1

Dr. Shafay Shamail, LUMS, Lahore

## Committee Member 2

Dr. Malik Muhammad Saad Missen, Islamia University, Bahawalpur

## Project Type

Dissertation

## Access Type

Restricted Access

## Document Version

Final

## Pages

xxiii, 174

## Abstract

The efficient retrieval of mathematical expressions over the web is a complex process as compared to simple text searches. It is only possible when the syntactic (for example, Textual) and semantic (for example, Structural) information of a mathematical expression is retrieved properly and analyzed methodically. This research proposes a technique that indexes expressions along with their syntactic and semantic information. The proposed technique also improves memory storage efficiency for the inverted index by encoding indexing terms in Braille Unicode.

The mathematical expressions are originally represented in Content MathML (CMML) for indexing. However, the majority of scientific collection of documents contains mathematical expressions in the LATEX math style. Therefore, a rule-based conversion technique is developed for transforming LATEX math expressions into CMML, termed as LATEX Math Grammar (LMG).

A weighting function that assigns a weight to each indexing term is introduced to improve the ranking of retrieved documents. The weighting score of each term contributes to the ranking function that improves the rank of a document that contains query terms. Multiple indices are created in a distributed environment to avoid large storage of an inverted index in a centralized location. Additionally, a user-friendly graphical user interface is developed for users so that both experienced and general users can use systems without any hassle.

The proposed technique has been evaluated on Wikipedia and Arxiv NTCIR-12- MathIR corpora, other than that three sets of ArXiv documents dumps are also selected for testing the performance of the system on a large collection of mathematical expressions.

The performance metrics are divided into two categories; retrieval performance and system execution performance. Retrieval performance is measured using NTCIR-MathIR evaluation criteria. The Wikipedia queries without wildcards resulted in the nDCG value of 49.02%, the MSnDCG value of 49.66%, Precision values of 45.50%, the Average Precision (AP) value of 49.32%, and nERR value of 65.69% at the top 5 documents. The Arxiv queries without text resulted in the nDCG value of 48.38%, the MSnDCG value of 47.88%, Precision values of 44%, the AP value of 34.83%, and nERR value of 56.20% at the top 5 documents. The system execution performance on an uncompressed index (for example, without Braille encoding), it is observed that 18.63 million formulae stored per Gigabytes storage, 53.26 million formulae are indexed in per-hour time; the average search time of a query is 267 milliseconds. In contrast, The system execution performance on a compressed index (for example, with Braille encoding), it is observed that 45.10 million formulae stored in per Gigabytes storage, 49.66 million formulae are indexed in per hour time; the average search time of a query is 347 milliseconds.

## Link to Catalog Record

https://ils.iba.edu.pk/cgi-bin/koha/opac-detail.pl?biblionumber=114107

## Recommended Citation

Hussain, S.
(2021). *Retrieval of mathematical information with syntactic and semantic structure* (Unpublished doctoral dissertation). Institute of Business Administration, Pakistan.
Retrieved from https://ir.iba.edu.pk/etd/88

The full text of this document is only accessible to authorized users.