Degree
Master of Science in Data Science
Department
Department of Computer Science
Faculty/ School
School of Mathematics and Computer Science (SMCS)
Date of Submission
Fall 2024
Supervisor
Dr. Muhammad Sarim, Visiting Faculty, Department of Computer Science, School of Mathematics and Computer Science (SMCS)
Keywords
Product Listing, E-commerce, Natural Language Processing (NLP), Large Language Models (LLMs), BERT Embeddings, Cosine Similarity, Policy Compliance, Title Correction, Image Verification, GPT-4, Fuzzy Matching, Regular Expressions, Pretrained Models
Abstract
This study aims to optimize the product listing process on e-commerce platforms by ensuring that all listings comply with social norms, government policies, and product listing guidelines. The project leverages automation to streamline the product listing procedure, reducing labor hours, minimizing the time required for new products to go live, and mitigating the risk of human error. The methodology involves utilizing Natural Language Processing (NLP) techniques to test product titles for prohibited keywords or non-dictionary terms that may be perceived as manipulations intended to bypass system checks. The process begins by uploading the product listing policy guidelines, followed by an evaluation to assess whether the product listings align with these policies. Any non-compliant products are then flagged. To evaluate product titles, the project employs cosine similarity with BERT embeddings to compare title words with a list of prohibited keywords, identifying high-risk titles. These titles are further analyzed and refined using Large Language Models (LLMs) such as GPT, which also generates corresponding descriptions. The revised titles are then tested against the listing policy to ensure compliance with the requirements for publication on the live site. Techniques such as regular expressions, cosine similarity, and fuzzy matching are employed to identify the highest-risk titles, effectively reducing operational costs in subsequent stages. Several pretrained text-to-text models, including OpenAI GPT- 4, GPT-4.0-min, Mistral, and Llama, are used for title correction and analysis. In the secondary phase, the system verifies product images to ensure they do not feature prohibited content. By generating a description of each image, the system compares it against the product listing policy to identify any prohibited items. This step helps detect instances where sellers may attempt to bypass quality control by submitting images that do not correspond with the product title. The system not only refines product titles and listings but also provides valuable insights into their compliance with listing policies. In the case of a policy violation, the system identifies the specific policy being violated and offers actionable feedback, ensuring that the listings are appropriate for publication on the live site.
Document Type
Restricted Access
Submission Type
Research Project
Recommended Citation
Khan, A. (2024). Automated Product Filtering Using Large Language Models by Enhanced Similarity Detection in E-Commerce Catalog (Unpublished graduate research project). Institute of Business Administration, Pakistan. Retrieved from https://ir.iba.edu.pk/research-projects-msds/46
The full text of this document is only accessible to authorized users.