Improving Text Classification Performance with Random Forests-Based Feature Selection

Sameen Maruf; Kashif Javed; Haroon A. Babri

首页> 外文期刊>Arabian Journal for Science and Engineering. Section A, Sciences >Improving Text Classification Performance with Random Forests-Based Feature Selection

【24h】

Improving Text Classification Performance with Random Forests-Based Feature Selection

机译：随机林的特征选择改善文本分类性能

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Feature selection (FS) is employed to make text classification (TC) more effective. Well-known FS metrics like information gain (IG) and odds ratio (OR) rank terms without considering term interactions. Building classifiers with FS algorithms considering term interactions can yield better performance. But their computational complexity is a concern. This has resulted in two-stage algorithms such as information gain-principal component analysis (IG-PCA). Random forests-based feature selection (RFFS), proposed by Breiman, has demonstrated outstanding performance while capturing gene-gene relations in bioinformatics, but its usefulness for TC is less explored. RFFS has fewer control parameters and is found to be resistant to overfitting and thus generalizes well to new data. It does not require use of a test dataset to report accuracy and does not use conventional cross-validation. This paper investigates the working of RFFS forTCand compares its performance against IG,OR and IG-PCA.We carry out experiments on four widely used text data sets using naive Bayes'and support vector machines as classifiers. RFFS achieves macro-F_1 values higher than other FS algorithms in 73% of the experimental instances. We also analyze the performance of RFFS for TC in terms of its parameters and class skews of the data sets and yield interesting results.

机译：特征选择（FS）用于使文本分类（TC）更有效。众所周知的FS指标如信息增益（IG）和赔率比（或）等级术语而不考虑术语交互。考虑术语交互的FS算法构建分类器可以产生更好的性能。但他们的计算复杂性是一个问题。这导致了两阶段算法，例如信息增益主成分分析（IG-PCA）。 Breiman提出的随机森林的特征选择（RFF）在捕获生物信息学中的基因 - 基因关系时表现出出色的表现，但其对TC的有用程度较少。 RFFS具有更少的控制参数，发现抵抗过度装备，因此概括为新数据。它不需要使用测试数据集来报告准确性，并且不使用传统的交叉验证。本文调查了RFFS Fortc的工作比较了对IG，或IG-PCA的性能。我们在使用Naive Bayes'and和Spector Vector Machines作为分类器的四种广泛使用的文本数据集进行实验。 RFFS在73％的实验实例中实现高于其他FS算法的宏观F_1值。我们还在数据集的参数和类偏差方面分析了rffs对Tc的性能，并产生了有趣的结果。

著录项

来源
《Arabian Journal for Science and Engineering. Section A, Sciences》 |2016年第3期|951-964|共14页
作者
Sameen Maruf; Kashif Javed; Haroon A. Babri;
展开▼
作者单位

Faculty of Engineering University of Central Punjab Lahore Pakistan;

Departmental of Electrical Engineering University of Engineering and Technology Lahore Pakistan;

Departmental of Electrical Engineering University of Engineering and Technology Lahore Pakistan;

展开▼
收录信息美国《科学引文索引》(SCI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Text classification; Feature selection; Random forests; Term interactions;

机译：文本分类;特征选择;随机森林;术语互动;

相似文献

外文文献
中文文献
专利

1. A Two-stage Text Feature Selection Algorithm for Improving Text Classification [J] . Ashokkumar P., Shankar Siva G., Srivastava Gautam, ACM transactions on Asian and low-resource language information processing . 2021,第3期

机译：改进文本分类的两级文本特征选择算法
2. A novel random forests-based feature selection method for microarray expression data analysis [J] . Yao Dengju, Yang Jing, Zhan Xiaojuan, International journal of data mining and bioinformatics . 2015,第1期

机译：一种基于随机森林的特征选择方法用于微阵列表达数据分析
3. Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach [J] . Ursula Neumann, Mona Riemenschneider, Jan-Peter Sowa, BioData Mining . 2016,第1期

机译：通过使用一种新颖的整体特征选择方法来补偿特征选择偏差并改善二进制分类的预测性能
4. RANDOM FORESTS-BASED FEATURE SELECTION FOR LAND-USE CLASSIFICATION USING LIDAR DATA AND ORTHOIMAGERY [C] . Haiyan Guan, Jun Yu, Jonathan Li, ISPRS Congress . 2013

机译：使用LIDAR数据和原子造纸的土地使用的随机森林的特征选择
5. Improving Feature Learning, Feature Selection, and Classification in Facial Expression Analysis [D] . Liu, Ping 2015

机译：改善面部表情分析中的特征学习，特征选择和分类
6. Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach [O] . Ursula Neumann, Mona Riemenschneider, Jan-Peter Sowa, 2016

机译：通过使用新颖的集成特征选择方法补偿特征选择偏差并改善二进制分类的预测性能
7. RANDOM FORESTS-BASED FEATURE SELECTION FOR LAND-USE CLASSIFICATION USING LIDAR DATA AND ORTHOIMAGERY [O] . Haiyan Guana, Jun Yub, Jonathan Lia, 2016

机译：利用激光雷达数据和正交图像进行土地利用分类的基于随机森林的特征选择

Improving Text Classification Performance with Random Forests-Based Feature Selection

摘要

著录项

相似文献

相关主题

期刊订阅