首页> 外国专利> An automatic device for training and classifying documents based on N-gram statistics and An automatic method for training and classifying documents based on N-gram statistics therefor

An automatic device for training and classifying documents based on N-gram statistics and An automatic method for training and classifying documents based on N-gram statistics therefor

机译:基于N-gram统计信息的文档训练和分类的自动装置及其基于N-gram统计信息的文档训练和分类的自动方法

摘要

The present invention relates to an apparatus for automatically learning documents and a method for automatically learning documents using the same, and an apparatus for automatically classifying documents and a method for automatically classifying documents using the same, which are capable of automatically learning and classifying mass documents on the web through a process of automatically learning and classifying documents based on n-gram. The apparatus for automatically classifying documents according to the present invention includes: a learning document pool including a plurality of learning document groups which are classified according to categories; a preprocessing unit configured to preprocess each of the learning document groups of the learning document pool; and an n-gram data set pool configured to store a set of n-gram data of the learning document pool, which is formed by being learned through the preprocessing of the preprocessing unit. Additionally, the apparatus for automatically classifying documents includes: an automatic document learning unit configured to allow the preprocessing unit to preprocess a corresponding new document to form a bigram set, when the new document occurs, which is not identified through the learning document pool; and an automatic document classifying unit configured to compare the bigram set of the new document, formed through the preprocessing unit, with a bigram set of the n-gram data set pool and to allocate and store the bigram set of the new document to one of n-gram data sets of the n-gram data set pool. [Reference numerals] (220) Automatic document classifying unit; (230) Learned n-gram data set(bigram example); (AA) Non-identified document; (BB) Appearance of a new document; (CC) Preprocessing
机译:自动学习文档的设备和使用该文档的自动学习的方法,自动分类文档的设备和使用该文档自动分类的方法技术领域通过基于n-gram的文档自动学习和分类过程在网络上进行搜索。根据本发明的用于自动分类文档的设备包括:学习文档库,该学习文档池包括根据类别分类的多个学习文档组;预处理单元,用于对所述学习文档库中的每个学习文档组进行预处理; n-gram数据集池,用于存储所述学习文档库的n-gram数据集,所述n-gram数据集是通过预处理单元的预处理而获知的。另外,用于自动分类文档的设备包括:自动文档学习单元,其被配置为允许预处理单元在新文档出现时对对应的新文档进行预处理以形成双字母组,该新文档集不能通过学习文档库来识别。自动文档分类单元,其配置为将通过预处理单元形成的新文档的双字母组与n-gram数据集池的双字母组进行比较,并将新文档的双字母组分配并存储到以下其中一个: n-gram数据集池的n-gram数据集。 [附图标记](220)自动文档分类单元; (230)学习n元语法数据集(二进制图示例); (AA)不明文件; (BB)出现新文件; (CC)预处理

著录项

  • 公开/公告号KR101400548B1

    专利类型

  • 公开/公告日2014-05-27

    原文格式PDF

  • 申请/专利权人

    申请/专利号KR20120115730

  • 发明设计人 김판구;최동진;김정인;고미아;

    申请日2012-10-18

  • 分类号G06F17/21;G06F17/27;

  • 国家 KR

  • 入库时间 2022-08-21 15:40:51

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号