Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study

Bassam Al-Salemi; Mohd. Juzaiddin Ab Aziz; Shahrul Azman Noah

首页> 外文期刊>Journal of Information Science >Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study

【24h】

Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study

机译：基于主题建模的多标签文本分类增强算法：一项比较经验研究

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Boosting algorithms have received significant attention over the past several years and are considered to be the state-of-the-art classifiers for multi-label classification tasks. The disadvantage of using boosting algorithms for text categorization (TC) is the vast number of features that are generated using the traditional Bag-of-Words (BOW) text representation, which dramatically increases the computational complexity. In this paper, an alternative text representation method using topic modeling for enhancing and accelerating multi-label boosting algorithms is concerned. An extensive empirical experimental comparison of eight multi-label boosting algorithms using topic-based and BOW representation methods was undertaken. For the evaluation, three well-known multi-label TC datasets were used. Furthermore, to justify boosting algorithms performance, three well-known instance-based multi-label algorithms were involved in the evaluation. For completely credible evaluations, all algorithms were evaluated using their native software tools, except for data formats and user settings. The experimental results demonstrated that the topic-based representation significantly accelerated all algorithms and slightly enhanced the classification performance, especially for near-balanced and balanced datasets. For the imbalanced dataset, BOW representation led to the best performance. The MP-Boost algorithm is the most efficient and effective algorithm for imbalanced datasets using BOW representation. For topic-based representation, AdaBoost-MH with meta base learners, Hamming Tree (AdaMH-Tree) and Product (AdaMH-Product) achieved the best performance; however, with respect to the computational time, these algorithms are the slowest overall. Moreover, the results indicated that topic-based representation is more significant for instance-based algorithms; nevertheless, boosting algorithms, such as MP-Boost, AdaMH-Tree and AdaMH-Product notably exceed their performance.

机译：提升算法在过去几年中受到了广泛关注，被认为是多标签分类任务的最新分类器。使用boosting算法进行文本分类（TC）的缺点是使用传统的词袋（BOW）文本表示生成的大量功能，这极大地增加了计算复杂性。本文研究了一种使用主题建模的替代文本表示方法，用于增强和加速多标签增强算法。使用基于主题和BOW表示方法的八种多标签增强算法进行了广泛的经验实验比较。为了进行评估，使用了三个众所周知的多标签TC数据集。此外，为了证明提高算法的性能是合理的，评估中涉及了三种著名的基于实例的多标签算法。为了获得完全可信的评估，所有算法均使用其本机软件工具进行了评估，但数据格式和用户设置除外。实验结果表明，基于主题的表示方式显着加快了所有算法的速度，并且略微提高了分类性能，尤其是对于接近平衡和平衡的数据集。对于不平衡的数据集，BOW表示法导致最佳性能。对于使用BOW表示的不平衡数据集，MP-Boost算法是最有效的算法。对于基于主题的表示，具有元基础学习者的AdaBoost-MH，汉明树（AdaMH-Tree）和产品（AdaMH-Product）获得了最佳性能；但是，就计算时间而言，这些算法是整体上最慢的。而且，结果表明，基于主题的表示对于基于实例的算法更重要。但是，诸如MP-Boost，AdaMH-Tree和AdaMH-Product之类的增强算法明显超出了它们的性能。

著录项

来源
《Journal of Information Science》 |2015年第5期|732-746|共15页
作者
Bassam Al-Salemi; Mohd. Juzaiddin Ab Aziz; Shahrul Azman Noah;
展开▼
作者单位

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM, Bangi Selangor, Malaysia;

Universiti Kebangsaan Malaysia, Malaysia;

Universiti Kebangsaan Malaysia, Malaysia;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
AdaBoost.MH; boosting; multi-label classification; text categorization; text representation; topic modeling;

机译：AdaBoost.MH;促进多标签分类;文本分类文字表示;主题建模;

相似文献

外文文献
中文文献
专利

1. Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms [J] . Bassam Al-Salemi, Masri Ayob, Graham Kendall, Information Processing & Management . 2019,第1期

机译：多标签阿拉伯语文本分类：多标签学习算法的基准和基线比较
2. Feature ranking for enhancing boosting-based multi-label text categorization [J] . Al-Salemi Bassam, Ayob Masri, Noah Shahrul Azman Mohd Expert Systems with Application . 2018,第DECa期

机译：功能分级，以增强基于增强的多标签文本分类
3. Boosting multi-label hierarchical text categorization [J] . Andrea Esuli, Tiziano Fagni, Fabrizio Sebastiani Information retrieval . 2008,第4期

机译：促进多标签分层文本分类
4. Feature selection based on supervised topic modeling for boosting-based multi-label text categorization [C] . Bassam Al-Salemi, Masri Ayob, Shahrul Azman Mohd Noah, International Conference on Electrical Engineering and Informatics . 2017

机译：基于监督主题建模的特征选择用于基于Boosting的多标签文本分类
5. An Empirical Study of Edge Computing Architectural Framework Boosted With a New Caching Algorithm [D] . Zhang, Ziyin. 2019

机译：一种新的缓存算法促进的边缘计算架构框架的实证研究
6. Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling [O] . Aytuğ Onan 2018

机译：基于集合修剪和优化主题建模的生物医学文本分类
7. TreeBoost.MH: a boosting algorithm for multi-label hierarchical text categorization [O] . Esuli Andrea, Fagni Tiziano, Sebastiani Fabrizio 2006

机译：TreeBoost.MH：用于多标签层次文本分类的增强算法

Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study

摘要

著录项

相似文献

相关主题

期刊订阅