首页> 外文会议>International Conference on Engineering MIS >Arabic text classification: New study
【24h】

Arabic text classification: New study

机译:阿拉伯文字分类:新研究

获取原文

摘要

Text classification performance is considerably influenced by a factor selected from the text and presented to the classification algorithm: the feature type. Character N-grams, word roots, word stems, and full words have been altogether used as features for Arabic text classification. No prior studies, as shown in a survey of current literature, have been conducted on the effect of using root N-grams and stem N-grams (N consecutive roots or stems) on Arabic Text classification performance. Consequently, we conducted 108 experiments. For these, three Feature types (1-grams, 2-grams, and 3-grams) of roots, stems and full words were used. For feature selection method, chi square was employed with three thresholds for numbers of features (100, 500, and 1000). As a representation schema, term frequency-inversed document frequency was utilized. Three classifiers were brought to action alongside; Naïve Bayes, K-Nearest Neighbor, and Support Vector Machine. Results show that, compared to stem or word N-grams, the use of root 1-grams as a feature provides greater classification performance for Arabic text classification. It was made manifest, as well, that classification performance decreases whenever the number of N-grams increases. The data exhibit, also, that the support vector machine outperforms Naïve Bayes and k-nearest neighbor with 1-grams. Whenever the K-Nearest Neighbor was used, however, Root 2-grams achieved the best performance. Root 3-grams, on the other hand, achieved the best performance whenever the Support Vector Machine was used.
机译:文本分类性能很大程度上受从文本中选择并显示给分类算法的要素(要素类型)影响。字符N-gram,词根,词干和完整词已完全用作阿拉伯文本分类的功能。如当前文献调查所示,尚未进行过关于使用根N-gram和茎N-gram(N个连续的根或茎)对阿拉伯语文本分类性能的影响的先前研究。因此,我们进行了108次实验。为此,使用了三种特征类型(1克,2克和3克)的词根,词干和完整单词。对于特征选择方法,采用了具有三个阈值的卡方平方(100、500和1000)。作为表示模式,使用了术语频率倒置文档频率。三个分类器并排执行;朴素贝叶斯,K最近邻和支持向量机。结果表明,与词干或单词N-gram相比,使用词根1-gram作为特征为阿拉伯文本分类提供了更好的分类性能。同样明显的是,每当N-gram数量增加时,分类性能就会下降。数据还表明,支持向量机的性能优于朴素贝叶斯和k近邻,只有1克。但是,无论何时使用K最近邻,根2克都能获得最佳性能。另一方面,无论何时使用支持向量机,根3克都能获得最佳性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号