Chinese text categorization using the character N-gram

机译：使用字符N-gram的中文文本分类

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We previously proposed the accumulation method, which is a language-independent text classification method that is based on the character N-gram, and classified English, Japanese, and Korean text documents. The accumulation method does not depend on the language structure, because this method uses the character N-gram to form index terms. If text documents are expressed in Unicode, then the accumulation method can classify documents using the same algorithm. In the present paper, we classify Chinese text documents, which are newspaper articles from the People's Daily 2009–2010 data set. The highest macro-averaged F-measure of the proposed method was 92.6% for the People's Daily 2009–2010 data set. Thus, we obtain good results for the Chinese language. Moreover, we can construct a framework whereby the computer can automatically distinguish the difficulty of each document classification.

机译：我们之前提出了累积方法，这是一种基于字符N-gram并与英语，日语和朝鲜语文本文档进行分类的语言无关的文本分类方法。累积方法不依赖于语言结构，因为该方法使用字符N-gram来形成索引词。如果文本文档以Unicode表示，则累积方法可以使用相同的算法对文档进行分类。在本文中，我们对中文文本文档进行分类，这些文档是《人民日报》 2009-2010年数据集中的报纸文章。对于《人民日报》 2009-2010年数据集，该方法的最高宏观平均F测度为92.6％。因此，我们获得了良好的中文效果。此外，我们可以构建一个框架，使计算机可以自动区分每个文档分类的难度。

著录项

来源
《2012 International Symposium on Information Theory and its Applications.》|2012年|p.722-726|共5页
会议地点 Hawaii HI(US);Hawaii HI(US)
作者
Suzuki Makoto; Yamagishi Naohide; Tsai Yi-Ching;
展开▼
作者单位

Shonan Institute of Technology 1-1-25 Tsujido Nishikaigan, Fujisawa, Kanagawa 251-8511, Japan;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类信息处理技术;信息处理技术;
关键词

相似文献

外文文献
中文文献
专利

1. Efficient n-gram construction for text categorization using feature selection techniques [J] . Garcia Maximiliano, Maldonado Sebastian, Vairetti Carla Intelligent data analysis . 2021,第3期

机译：使用特征选择技术的文本分类高效的n-gram结构
2. N-gram Based Text Categorization Method for Improved Data Mining [J] . Kennedy Ogada, Waweru Mwangi, Wilson Cheruiyot Journal of Information Engineering and Applications . 2015,第8期

机译：基于N元语法的文本分类方法
3. A variant of n-gram based language-independent text categorization [J] . Jelena Graovac Intelligent data analysis . 2014,第4期

机译：基于n元语法的独立于语言的文本分类的变体
4. Chinese text categorization using the character N-gram [C] . Suzuki Makoto, Yamagishi Naohide, Tsai Yi-Ching International Symposium on Information Theory and its Applications . 2012

机译：使用字符n-gram的中文文本分类
5. An N-gram enhanced learning classifier for Chinese character recognition. [D] . Ayer, Eliot William. 2013

机译：用于汉字识别的N-gram增强型学习分类器。
6. Text Categorization of Heart Lung and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features [O] . Mindy K. Ross, Ko-Wei Lin, Karen Truong, 2013

机译：利用n-gram和元数据特征对基因型和表型（dbGaP）数据库中的心脏肺和血液研究进行文本分类
7. A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization [O] . Jingyang Li, Maosong Sun, Xian Zhang 2006

机译：汉字分类中的特征词与汉字的比较和半定量分析
8. Phonetic and Structural Encoding of Chinese Characters in Chinese Texts [R] . Boitet, C., Tcheou, F. X. 1990

机译：汉语语篇汉字的语音和结构编码

Chinese text categorization using the character N-gram

摘要

著录项

相似文献

相关主题

期刊订阅