首页> 外文会议>International Seminar on Application for Technology of Information and Communication >Influence of Word Normalization and Chi-Squared Feature Selection on Support Vector Machine (SVM) Text Classification
【24h】

Influence of Word Normalization and Chi-Squared Feature Selection on Support Vector Machine (SVM) Text Classification

机译:字标准化与Chi方向特征选择对支持向量机(SVM)文本分类的影响

获取原文

摘要

In this study, we used SVM for text classification. There is stemming or Iemmatization word normalization with the addition of Chi-squarefeature selection on the classification that we made. There are also pre-processing data being performed, namely stopwords removal and tokenize. We used BBC dataset containing 2,225 documents and 5 categories. There are 21,813. features resulting from the use of stemming and 31,007 features resulting from the use of lemmatization. Each feature represents the number of words that come out in the document. We used confusion matrix to evaluate the results of text clasification. SVM text classification performance using stemming enhanced by Chi-squared (method 1) get better results than using lemmatization enhanced by Chi-squared (method 2). The best performance was obtained using 80% feature reduction where method 1 received a precision value of 95%, a recall value of 95%, and an accuracy value of 95.05%. Method 2 only received a precision value of 93%, a recall value of 93%, and an accuracy value of 93.24% using the same amount of feature reduction.
机译:在这项研究中,我们使用SVM进行文本分类。在我们所做的分类上添加了Chi-Squestfeefure选择,有声明或IEMMATIZ化词标准化。还有预处理数据正在执行,即停止并令授权的停止。我们使用包含2,225个文档和5个类别的BBC数据集。有21,813。利用溶液的使用产生的特征是由使用lemmatization产生的31,007个功能。每个功能都代表文档中出现的单词数。我们使用了困惑矩阵来评估文本分解的结果。 SVM文本分类性能使用Chi平方增强(方法1)获得比Chi平方(方法2)增强的lemmatization更好的结果。使用80%特征减少获得的最佳性能,其中方法1接收到95%的精度值,召回值为95%,精度值为95.05%。方法2仅接收93%的精确值,召回值为93%,使用相同的特征减少量为93.24%的精度值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号