首页> 中文期刊> 《中文信息学报》 >词性对中英文文本聚类的影响研究

词性对中英文文本聚类的影响研究

         

摘要

不同词性特征在文本聚类中有不同的贡献度.该文对四组有代表性的中英文数据集,利用三种聚类算法验证了四种主要词性及其组合对中英文文本聚类的影响.实验结果表明,在中文和英文两种语言中,名词均是表征文本内容的最重要词性,动词、形容词和副词均对文本聚类结果有帮助,仅选择名词作为特征聚类的结果与保留所有词性聚类的结果相近,但可大大降低文本的维度;选用名词为文本特征不能实现最好的聚类效果;相对其他词性组合和单一词性,采用名词、动词、形容词和副词的组合特征往往可以实现更好的聚类效果.在词性所占的比例以及单一词性聚类的结果上,同一词性在中英文文本聚类中呈现出较大差异.相对于英文,不同词性特征及其组合在中文文本聚类中呈现的差异更为稳定.%Different part-of-speeches have different roles in document clustering. Using 4 popular English and Chinese datasets, the paper choose three clustering algorithms to investigate the influence of 4 major part-of-speeches as well as their combination on Chinese and English document clustering. The experimental result reveals that nouns are the most important in presenting the content of the document. Besides, verbs, adjectives and adverbs contribute to document clustering. Although similar result is obtained from the experiments, nouns. Using only nouns to characterize the document can not produce the best clustering result, but it can reduce the document dimensions to a great extent. The combination of 4 part-of-speeches produces the best clustering result. Single part-of-speech vary considerably in Chinese and English document clustering performance,and the differences are more consistent in Chinese document clustering.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号