Student Name

AbdulRehman KhanFollow

Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Date of Submission

Fall 2024

Supervisor

Dr. Muhammad Sarim, Visiting Faculty, Department of Computer Science, School of Mathematics and Computer Science (SMCS)

Keywords

Product Listing, E-commerce, Natural Language Processing (NLP), Large Language Models (LLMs), BERT Embeddings, Cosine Similarity, Policy Compliance, Title Correction, Image Verification, GPT-4, Fuzzy Matching, Regular Expressions, Pretrained Models

Abstract

This study aims to optimize the product listing process on e-commerce platforms by ensuring that all listings comply with social norms, government policies, and product listing guidelines. The project leverages automation to streamline the product listing procedure, reducing labor hours, minimizing the time required for new products to go live, and mitigating the risk of human error. The methodology involves utilizing Natural Language Processing (NLP) techniques to test product titles for prohibited keywords or non-dictionary terms that may be perceived as manipulations intended to bypass system checks. The process begins by uploading the product listing policy guidelines, followed by an evaluation to assess whether the product listings align with these policies. Any non-compliant products are then flagged. To evaluate product titles, the project employs cosine similarity with BERT embeddings to compare title words with a list of prohibited keywords, identifying high-risk titles. These titles are further analyzed and refined using Large Language Models (LLMs) such as GPT, which also generates corresponding descriptions. The revised titles are then tested against the listing policy to ensure compliance with the requirements for publication on the live site. Techniques such as regular expressions, cosine similarity, and fuzzy matching are employed to identify the highest-risk titles, effectively reducing operational costs in subsequent stages. Several pretrained text-to-text models, including OpenAI GPT- 4, GPT-4.0-min, Mistral, and Llama, are used for title correction and analysis. In the secondary phase, the system verifies product images to ensure they do not feature prohibited content. By generating a description of each image, the system compares it against the product listing policy to identify any prohibited items. This step helps detect instances where sellers may attempt to bypass quality control by submitting images that do not correspond with the product title. The system not only refines product titles and listings but also provides valuable insights into their compliance with listing policies. In the case of a policy violation, the system identifies the specific policy being violated and offers actionable feedback, ensuring that the listings are appropriate for publication on the live site.

Document Type

Restricted Access

Submission Type

Research Project

The full text of this document is only accessible to authorized users.

Share

COinS