首页> 外文期刊>Information Processing & Management >Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms
【24h】

Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms

机译:多标签阿拉伯语文本分类:多标签学习算法的基准和基线比较

获取原文
获取原文并翻译 | 示例
           

摘要

Multi-label text categorization refers to the problem of assigning each document to a subset of categories by means of multi-label learning algorithms. Unlike English and most other languages, the unavailability of Arabic benchmark datasets prevents evaluating multi-label learning algorithms for Arabic text categorization. As a result, only a few recent studies have dealt with multi-label Arabic text categorization on non-benchmark and inaccessible datasets. Therefore, this work aims to promote multi-label Arabic text categorization through (a) introducing “RTAnews”, a new benchmark dataset of multi-label Arabic news articles for text categorization and other supervised learning tasks. The benchmark is publicly available in several formats compatible with the existing multi-label learning tools, such as MEKA and Mulan. (b) Conducting an extensive comparison of most of the well-known multi-label learning algorithms for Arabic text categorization in order to have baseline results and show the effectiveness of these algorithms for Arabic text categorization on RTAnews. The evaluation involves four multi-label transformation-based algorithms: Binary Relevance, Classifier Chains, Calibrated Ranking by Pairwise Comparison and Label Powerset, with three base learners (Support Vector Machine,k-Nearest-Neighbors and Random Forest); and four adaptation-based algorithms (Multi-labelkNN, Instance-Based Learning by Logistic Regression Multi-label, Binary RelevancekNN and RFBoost). The reported baseline results show that both RFBoost and Label Powerset with Support Vector Machine as base learner outperformed other compared algorithms. Results also demonstrated that adaptation-based algorithms are faster than transformation-based algorithms.
机译:多标签文本分类是指通过多标签学习算法将每个文档分配给类别的子集的问题。与英语和大多数其他语言不同,阿拉伯语基准数据集的不可用性阻止评估用于阿拉伯文本分类的多标签学习算法。结果,只有很少的最新研究处理了非基准和不可访问的数据集上的多标签阿拉伯文本分类。因此,这项工作旨在通过(a)引入“ RTAnews”(一种用于文本分类和其他监督学习任务的多标签阿拉伯新闻文章的新基准数据集)来促进多标签阿拉伯文本分类。该基准以与现有的多标签学习工具(例如MEKA和Mulan)兼容的几种格式公开提供。 (b)对阿拉伯语文本分类的大多数著名的多标签学习算法进行广泛的比较,以得出基线结果,并显示这些算法在RTAnews上进行阿拉伯语文本分类的有效性。评估涉及四个基于多标签变换的算法:二进制相关性,分类器链,通过成对比较和标签Powerset进行的校准排名,以及三个基础学习者(支持向量机,k最近邻和随机森林);以及四种基于自适应的算法(多标签kNN,基于逻辑回归多标签的基于实例的学习,二进制相关性kNN和RFBoost)。报告的基准结果表明,以支持向量机为基础学习器的RFBoost和Label Powerset均优于其他比较算法。结果还表明,基于适应的算法比基于变换的算法要快。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号