...
首页> 外文期刊>Journal of Information Science >Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study
【24h】

Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study

机译:基于主题建模的多标签文本分类增强算法:一项比较经验研究

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Boosting algorithms have received significant attention over the past several years and are considered to be the state-of-the-art classifiers for multi-label classification tasks. The disadvantage of using boosting algorithms for text categorization (TC) is the vast number of features that are generated using the traditional Bag-of-Words (BOW) text representation, which dramatically increases the computational complexity. In this paper, an alternative text representation method using topic modeling for enhancing and accelerating multi-label boosting algorithms is concerned. An extensive empirical experimental comparison of eight multi-label boosting algorithms using topic-based and BOW representation methods was undertaken. For the evaluation, three well-known multi-label TC datasets were used. Furthermore, to justify boosting algorithms performance, three well-known instance-based multi-label algorithms were involved in the evaluation. For completely credible evaluations, all algorithms were evaluated using their native software tools, except for data formats and user settings. The experimental results demonstrated that the topic-based representation significantly accelerated all algorithms and slightly enhanced the classification performance, especially for near-balanced and balanced datasets. For the imbalanced dataset, BOW representation led to the best performance. The MP-Boost algorithm is the most efficient and effective algorithm for imbalanced datasets using BOW representation. For topic-based representation, AdaBoost-MH with meta base learners, Hamming Tree (AdaMH-Tree) and Product (AdaMH-Product) achieved the best performance; however, with respect to the computational time, these algorithms are the slowest overall. Moreover, the results indicated that topic-based representation is more significant for instance-based algorithms; nevertheless, boosting algorithms, such as MP-Boost, AdaMH-Tree and AdaMH-Product notably exceed their performance.
机译:提升算法在过去几年中受到了广泛关注,被认为是多标签分类任务的最新分类器。使用boosting算法进行文本分类(TC)的缺点是使用传统的词袋(BOW)文本表示生成的大量功能,这极大地增加了计算复杂性。本文研究了一种使用主题建模的替代文本表示方法,用于增强和加速多标签增强算法。使用基于主题和BOW表示方法的八种多标签增强算法进行了广泛的经验实验比较。为了进行评估,使用了三个众所周知的多标签TC数据集。此外,为了证明提高算法的性能是合理的,评估中涉及了三种著名的基于实例的多标签算法。为了获得完全可信的评估,所有算法均使用其本机软件工具进行了评估,但数据格式和用户设置除外。实验结果表明,基于主题的表示方式显着加快了所有算法的速度,并且略微提高了分类性能,尤其是对于接近平衡和平衡的数据集。对于不平衡的数据集,BOW表示法导致最佳性能。对于使用BOW表示的不平衡数据集,MP-Boost算法是最有效的算法。对于基于主题的表示,具有元基础学习者的AdaBoost-MH,汉明树(AdaMH-Tree)和产品(AdaMH-Product)获得了最佳性能;但是,就计算时间而言,这些算法是整体上最慢的。而且,结果表明,基于主题的表示对于基于实例的算法更重要。但是,诸如MP-Boost,AdaMH-Tree和AdaMH-Product之类的增强算法明显超出了它们的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号