首页> 外文会议>2012 International Symposium on Information Theory and its Applications. >Chinese text categorization using the character N-gram
【24h】

Chinese text categorization using the character N-gram

机译:使用字符N-gram的中文文本分类

获取原文
获取原文并翻译 | 示例

摘要

We previously proposed the accumulation method, which is a language-independent text classification method that is based on the character N-gram, and classified English, Japanese, and Korean text documents. The accumulation method does not depend on the language structure, because this method uses the character N-gram to form index terms. If text documents are expressed in Unicode, then the accumulation method can classify documents using the same algorithm. In the present paper, we classify Chinese text documents, which are newspaper articles from the People's Daily 2009–2010 data set. The highest macro-averaged F-measure of the proposed method was 92.6% for the People's Daily 2009–2010 data set. Thus, we obtain good results for the Chinese language. Moreover, we can construct a framework whereby the computer can automatically distinguish the difficulty of each document classification.
机译:我们之前提出了累积方法,这是一种基于字符N-gram并与英语,日语和朝鲜语文本文档进行分类的语言无关的文本分类方法。累积方法不依赖于语言结构,因为该方法使用字符N-gram来形成索引词。如果文本文档以Unicode表示,则累积方法可以使用相同的算法对文档进行分类。在本文中,我们对中文文本文档进行分类,这些文档是《人民日报》 2009-2010年数据集中的报纸文章。对于《人民日报》 2009-2010年数据集,该方法的最高宏观平均F测度为92.6%。因此,我们获得了良好的中文效果。此外,我们可以构建一个框架,使计算机可以自动区分每个文档分类的难度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号