首页> 中文期刊> 《电子科技大学学报》 >不均衡大数据集下的文本特征基因提取方法

不均衡大数据集下的文本特征基因提取方法

         

摘要

In the cases of imbalance big datasets, the traditional feature processing method is biased to the large class and ignores the small class, which affects the classification performance. So a text feature gene extraction method is proposed in this paper. First of all, considering the feature selection impact of imbalance distribution of sample categorization, a feature selection method based on the CHI statistical matrix combined with information entropy is used to strengthen the characteristics of the small class. Secondly, based on the high order correlation of multidimensional statistical data, the method of text feature extraction is designed to enhance the generalization ability of feature item. Finally, the two methods are combined to construct a new method of text feature extraction under unbalanced large datasets. The experimental results show that the proposed method has a better performance in early maturity and feature dimension reduction, and is far superior to the common feature selection algorithm in the classification ability of small classes.%在不均衡大数据集情况下,传统特征处理方法偏重大类而忽略小类,影响分类性能.该文提出了一种文本特征基因提取方法.首先,基于样本类别分布不均衡对特征选择的影响,给出了一种结合信息熵的CHI统计矩阵特征选择方法,以强化小类的特征;然后,在探究多维统计数据高阶相关性的基础上,采取独立成分分析手段,设计了文本特征基因提取方法,用以增强特征项的泛化能力;最后,将这两种方法相融合,实现了在不均衡大数据集下的文本特征基因提取新方法.实验结果表明,所提方法具有较好的早熟性及特征降维能力,在小类的分类效果上优于常见特征选择算法.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号