首页> 外文期刊>Arabian Journal for Science and Engineering. Section A, Sciences >Improving Text Classification Performance with Random Forests-Based Feature Selection
【24h】

Improving Text Classification Performance with Random Forests-Based Feature Selection

机译:随机林的特征选择改善文本分类性能

获取原文
获取原文并翻译 | 示例
           

摘要

Feature selection (FS) is employed to make text classification (TC) more effective. Well-known FS metrics like information gain (IG) and odds ratio (OR) rank terms without considering term interactions. Building classifiers with FS algorithms considering term interactions can yield better performance. But their computational complexity is a concern. This has resulted in two-stage algorithms such as information gain-principal component analysis (IG-PCA). Random forests-based feature selection (RFFS), proposed by Breiman, has demonstrated outstanding performance while capturing gene-gene relations in bioinformatics, but its usefulness for TC is less explored. RFFS has fewer control parameters and is found to be resistant to overfitting and thus generalizes well to new data. It does not require use of a test dataset to report accuracy and does not use conventional cross-validation. This paper investigates the working of RFFS forTCand compares its performance against IG,OR and IG-PCA.We carry out experiments on four widely used text data sets using naive Bayes'and support vector machines as classifiers. RFFS achieves macro-F_1 values higher than other FS algorithms in 73% of the experimental instances. We also analyze the performance of RFFS for TC in terms of its parameters and class skews of the data sets and yield interesting results.
机译:特征选择(FS)用于使文本分类(TC)更有效。众所周知的FS指标如信息增益(IG)和赔率比(或)等级术语而不考虑术语交互。考虑术语交互的FS算法构建分类器可以产生更好的性能。但他们的计算复杂性是一个问题。这导致了两阶段算法,例如信息增益主成分分析(IG-PCA)。 Breiman提出的随机森林的特征选择(RFF)在捕获生物信息学中的基因 - 基因关系时表现出出色的表现,但其对TC的有用程度较少。 RFFS具有更少的控制参数,发现抵抗过度装备,因此概括为新数据。它不需要使用测试数据集来报告准确性,并且不使用传统的交叉验证。本文调查了RFFS Fortc的工作比较了对IG,或IG-PCA的性能。我们在使用Naive Bayes'and和Spector Vector Machines作为分类器的四种广泛使用的文本数据集进行实验。 RFFS在73%的实验实例中实现高于其他FS算法的宏观F_1值。我们还在数据集的参数和类偏差方面分析了rffs对Tc的性能,并产生了有趣的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号