【24h】

Segmented Document Classification: Problem and Solution

机译:分段文档分类:问题与解决方案

获取原文
获取原文并翻译 | 示例

摘要

In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like "title", "body", etc. We call them "segmented documents". To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN_MIX and OUT_MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaieveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.
机译:近年来,诸如XML文件之类的结构化文本文档在基于Web的应用程序中扮演着重要角色。其中,有些文档分为“标题”,“正文”等不同部分。我们称它们为“分段文档”。要对分段文档进行分类,我们可以将它们视为单词袋,并使用完善的文本分类模型。但是,分段文档中的不同部分可能会对分类结果产生不同的影响。最好在分类过程中区别对待它们。遵循此思想,设计了两种算法:IN_MIX和OUT_MIX,以由训练有素的分类器标记分段文档。我们使用四种常用模型执行算法:SVM,NaieveBayes,回归和基于实例的分类器。根据Reuters-21578上的实验,与传统的单词袋方法相比,改进了不同分类模型的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号