On building an interpretable topic modeling approach for the Urdu language
Faculty / School
Faculty of Computer Sciences (FCS)
Department
Department of Computer Science
Was this content written or created while at IBA?
Yes
Document Type
Conference Paper
Publication Date
1-1-2020
Conference Name
The 29th International Joint Conference on Artificial Intelligence and the 17th Pacific Rim International Conference on Artificial Intelligence!IJCAI-PRICAI2020
Conference Location
Yokohama, Japan
Conference Dates
7-15 January 2021
ISBN/ISSN
85097355737 (Scopus)
First Page
5200
Last Page
5201
Publisher
IJCAI International Joint Conference on Artificial Intelligence
Keywords
Natural language processing, NLP applications and tools, Embeddings, Natural language summarization
Abstract / Description
This research is an endeavor to combine deep-learning-based language modeling with classical topic modeling techniques to produce interpretable topics for a given set of documents in Urdu, a low resource language. The existing topic modeling techniques produce a collection of words, often uninterpretable, as suggested topics without integrating them into a semantically correct phrase/sentence. The proposed approach would first build an accurate Part of Speech (POS) tagger for the Urdu Language using a publicly available corpus of many million sentences. Using semantically rich feature extraction approaches including Word2Vec and BERT, the proposed approach, in the next step, would experiment with different clustering and topic modeling techniques to produce a list of potential topics for a given set of documents. Finally, this list of topics would be sent to a labeler module to produce syntactically correct phrases that will represent interpretable topics.
DOI
https://doi.org/10.24963/ijcai.2020/740
Recommended Citation
Nasim, Z. (2020). On building an interpretable topic modeling approach for the Urdu language., 5200-5201. https://doi.org/10.24963/ijcai.2020/740
COinS