An Ensemble Multi‑label Themes‑Based Classification for Holy Qur'an Verses Using Word2Vec Embedding

Ensaf Hussein Mohamed; Wessam H.El‑Behaidy

摘要

Automatic themes-based classification of Quran verses is the process of classifying verses to predefined categorizes or themes. It is an essential task for all Muslims and people interested in studying the Quran. Quran themes-based classification could be used in many natural language processing (NLP) fields such as search engines, data mining, question–answering systems, and information retrieval applications. This paper presents an ensemble multi-label classification model that automatically identifies and classifies the Quran verses based on themes/topics. The model is composed of four phases: pre-processing, data vectorization, binary relevance classifier, and voting module. Firstly, the verses of the second chapter of the Quran (Al-Baqarah) are tokenized and normalized. Then, the topics of these verses are manually labeled based on “Mushaf Al-Tajweed” classification. Secondly, verses are converted into features’ vectors using term frequency-inverse document frequency (TF-IDF) and word2vec techniques. Word2vec is used to consider the semantic meaning of Quranic words and to improve performance. Also, they are trained on a collected classic Arabic corpus of 200 million words. Then, the binary relevance multi-label classification technique is applied using three different classifiers: logistic regression, support vector machine, and random forest, which categorize verses into 393 topics/tags. Finally, the voting module is applied by picking the tags with the maximum prediction probability among the three classifiers. The results of the three classifiers and the ensemble model are compared against “Mushaf Al-Tajweed.” The ensemble model outperforms the three classifiers. Its average hamming loss, recall, precision, and F1-Score are 0.224, 81%, 75%, and 77%, respectively.

机译：基于自动主题的古兰经经文的分类是将经文分类为预定义分类或主题的过程。对所有穆斯林和有兴趣研究古兰经的人来说，这是一项重要任务。基于古兰经主题的分类可以用于许多自然语言处理（NLP）场，例如搜索引擎，数据挖掘，问答系统和信息检索应用程序。本文介绍了一个基于主题/主题自动识别和分类古兰经经文的集合多标签分类模型。该模型由四个阶段组成：预处理，数据矢量化，二进制相关性分类器和投票模块。首先，古兰经（Al-Baqarah）的第二章的经文被令牌化并标准化。然后，根据“MUSHAF AL-TAJWEED”分类，手动标记这些经文的主题。其次，经文使用术语频率 - 逆文档频率（TF-IDF）和Word2Vec技术转换为具有的传感器。 Word2VEC用于考虑古兰经词的语义含义，提高性能。此外，他们培训了200万字的收集的经典阿拉伯语语料库。然后，使用三个不同的分类器应用二进制相关性多标签分类技术：Logistic回归，支持向量机和随机林，将经文分类为393个主题/标签。最后，通过在三个分类器之间用最大预测概率挑选标签来应用投票模块。将三个分类器和集合模型的结果与“Mushaf Al-Tajweed进行了比较。集合模型优于三个分类器。其平均汉明损失，召回，精确和F1分数分别为0.224,81％，75％和77％。

An Ensemble Multi‑label Themes‑Based Classification for Holy Qur'an Verses Using Word2Vec Embedding

摘要

著录项

相关主题

期刊订阅