首页> 外文期刊>Journal of Intelligent Information Systems >Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization
【24h】

Automatic Category Theme Identification and Hierarchy Generation for Chinese Text Categorization

机译:中文文本分类的自动分类主题识别和层次生成

获取原文
获取原文并翻译 | 示例
       

摘要

Recently research on text mining has attracted lots of attention from both industrial and academic fields. Text mining concerns of discovering unknown patterns or knowledge from a large text repository. The problem is not easy to tackle due to the semi-structured or even unstructured nature of those texts under consideration. Many approaches have been devised for mining various kinds of knowledge from texts. One important aspect of text mining is on automatic text categorization, which assigns a text document to some predefined category if the document falls into the theme of the category. Traditionally the categories are arranged in hierarchical manner to achieve effective searching and indexing as well as easy comprehension for human beings. The determination of category themes and their hierarchical structures were most done by human experts. In this work, we developed an approach to automatically generate category themes and reveal the hierarchical structure among them. We also used the generated structure to categorize text documents. The document collection was trained by a self-organizing map to form two feature maps. These maps were then analyzed to obtain the category themes and their structure. Although the test corpus contains documents written in Chinese, the proposed approach can be applied to documents written in any language and such documents can be transformed into a list of separated terms.
机译:近年来,关于文本挖掘的研究已经引起了工业和学术领域的广泛关注。文本挖掘涉及从大型文本存储库中发现未知模式或知识的问题。由于所考虑的这些文本具有半结构化甚至非结构化的性质,因此该问题不容易解决。已经设计了许多方法来从文本中挖掘各种知识。文本挖掘的一个重要方面是自动文本分类,如果文档属于该类别的主题,则它将文本文档分配给某个预定义的类别。传统上,类别以分层的方式排列以实现有效的搜索和索引以及对人类的容易理解。类别主题及其层次结构的确定大多数由人类专家完成。在这项工作中,我们开发了一种自动生成类别主题并揭示其中的层次结构的方法。我们还使用生成的结构对文本文档进行分类。该文档集合由一个自组织图训练而成,以形成两个特征图。然后分析这些地图以获得类别主题及其结构。尽管测试语料库包含以中文编写的文档,但是建议的方法可以应用于以任何语言编写的文档,并且可以将此类文档转换为单独的术语列表。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号