Segmented Document Classification: Problem and Solution

机译：分段文档分类：问题与解决方案

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

In recent years, structured text documents like XML files are playing an important role in the Web-based applications. Among them, there are some documents that are segmented into different sections like "title", "body", etc. We call them "segmented documents". To classify segmented documents, we can treat them as bags of words and use well-developed text classification models. However different sections in a segmented document may have different impact on the classification result. It is better to treat them differently in the classification process. Following this idea, two algorithms: IN_MIX and OUT_MIX are designed to label segmented documents by a trained classifier. We perform our algorithms using four frequently used models: SVM, NaieveBayes, Regression and Instance-based Classifiers. According to the experiment on Reuters-21578, the performance of different classification models is improved comparing to the conventional bag of words method.

机译：近年来，诸如XML文件之类的结构化文本文档在基于Web的应用程序中扮演着重要角色。其中，有些文档分为“标题”，“正文”等不同部分。我们称它们为“分段文档”。要对分段文档进行分类，我们可以将它们视为单词袋，并使用完善的文本分类模型。但是，分段文档中的不同部分可能会对分类结果产生不同的影响。最好在分类过程中区别对待它们。遵循此思想，设计了两种算法：IN_MIX和OUT_MIX，以由训练有素的分类器标记分段文档。我们使用四种常用模型执行算法：SVM，NaieveBayes，回归和基于实例的分类器。根据Reuters-21578上的实验，与传统的单词袋方法相比，改进了不同分类模型的性能。

著录项

来源
《Database and Expert Systems Applications; Lecture Notes in Computer Science; 4080》|2006年|538-548|共11页
会议地点 Krakow(PL)
作者
Hang Guo; Lizhu Zhou;
展开▼
作者单位

Computer Science Technology Department 100084, Tsinghua University, Beijing, China;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类 TP311.13;
关键词

相似文献

外文文献
中文文献
专利

1. Color and document classification in ancient China:The classification-centered functions of color in document [J] . Ya ZHOU1 中国图书馆学报年刊(英文版) . 2014,第001期

机译：中国古代的颜色与文件分类：文件中颜色的分类中心功能
2. The fundamental solution to □ b documentclass[12pt]{minimal} usepackage{amsmath} usepackage{wasysym} usepackage{amsfonts} usepackage{amssymb} usepackage{amsbsy} usepackage{mathrsfs} usepackage{upgreek} setlength{oddsidemargin}{-69pt} egin{document}$$Box _b$$end{document} on quadric manifolds: part 2. L p documentclass[12pt]{minimal} usepackage{amsmath} usepackage{wasysym} usepackage{amsfonts} usepackage{amssymb} usepackage{amsbsy} usepackage{mathrsfs} usepackage{upgreek} setlength{oddsidemargin}{-69pt} egin{document}$$L^p$$end{document} regularity and invariant normal forms [J] . Albert Boggess, Andrew Raich Complex Analysis and its Synergies . 2020,第2期

机译：<内联公式ID =“IEQ1”> <替代方案> □ B documentClass [12pt] {minimal} usepackage {ammath} usepackage {isysym} usepackage {amsfonts} usepackage {amssymb } usepackage {amsbsy} usepackage {mathrsfs} usepackage {supmeek} setLength { oddsidemargin} {-69pt} begin {document} $$$$$ box _b $$ nocument} <在线 - 图形xlink：href =“40627_2020_50_ARTICLE_IEQ1.gif”/> 在二次歧管上：第2部分。<内联 - 公式id =“IEQ2”> <替代方案> < MML：MSUP> L P DocumentClass [12pt] {minimal} usepackage {ammath} usepackage {keysym} usepackage {amsfonts} usepackage {amssysfs} usepackage {mathrsfs} usepackage {supmeek} setLength { oddsidemargin} { -69pt} begin {document} $$ l ^ p $$$$$$$ end {document} <内联图xlin k：href =“40627_202020_50_ARTICLE_IEQ2.GIF”/> 规则性和不变的正常形式
3. Fourth universal definition of myocardial infarction. Selected messages from the European Society of Cardiology document and lessons learned from the new guidelines on ST-segment elevation myocardial infarction and non-ST-segment elevation-acute coronary syndrome [J] . Justyna Domienik-Kar?owicz, Karolina Kupczyńska, B?a?ej Michalski, Cardiology Journal . 2021,第2期

机译：心肌梗死的第四个普遍定义。从欧洲心脏病学文档和经验教训中选择了来自ST段抬高心肌梗死和非ST段升高 - 急性冠状动脉综合征的新指南
4. Segmented Document Classification: Problem and Solution [C] . Hang Guo, Lizhu Zhou Database and Expert Systems Applications; Lecture Notes in Computer Science; 4080 . 2006

机译：分段文档分类：问题与解决方案
5. Latent Probabilistic Topic Discovery for Text Documents Incorporating Segment Structure and Word Order [D] . Jameel, Mohammad Shoaib 2014

机译：包含段结构和单词顺序的文本文档的潜在概率主题发现
6. Fourth universal definition of myocardial infarction. Selected messages from the European Society of Cardiology document and lessons learned from the new guidelines on ST-segment elevation myocardial infarction and non-ST-segment elevation-acute coronary syndrome [O] . Justyna Domienik-Karłowicz, Karolina Kupczyńska, Błażej Michalski, 2021

机译：心肌梗死的第四个普遍定义。从欧洲心脏病学文档和经验教训中选择了来自ST段抬高心肌梗死和非ST段升高 - 急性冠状动脉综合征的新指南
7. ACC/AHA guidelines for the management of patients with unstable angina and non–st-segment elevation myocardial infarction A report of the american college of cardiology/ american heart association task force on practice guidelines (committee on the management of patients with unstable angina)31This document was approved by the American College of Cardiology Board of Trustees in June 2000 and by the American Heart Association Science Advisory and Coordinating Committee in June 2000.32When citing this document, the American College of Cardiology and the American Heart Association would appreciate the following citation format: Braunwald E, Antman EM, Beasley JW, Califf RM, Cheitlin MD, Hochman JS, Jones RH, Kereiakes D, Kupersmith J, Levin TN, Pepine CJ, Schaeffer JW, Smith EE III, Steward DE, Theroux P. ACC/AHA guidelines for the management of patients with unstable angina and non–ST-segment elevation myocardial infarction: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines (Committee on the Management of Patients With Unstable Angina). J Am Coll Cardiol 2000;36:970–1062.33This document is available on the websites of the ACC (www.acc.org) and the AHA (www.americanheart.org). Reprints of this document (the complete guidelines) are available for $5 each by calling 800-253-4636 (US only) or writing the American College of Cardiology, Educational Services, 9111 Old Georgetown Road, Bethesda, MD 20814-1699. Ask for reprint No. 71-0188. To obtain a reprint of the shorter version (executive summary and summary of recommendations) published in the May 4, 1999, issue of Circulation, ask for reprint No. 71-0187. To purchase additional reprints (specify version and reprint number): up to 999 copies, call 800-611-6083 (US only) or fax 413-665-2671; 1000 or more copies, call 214-706-1466, fax 214-691-6342, or e-mail pubauth@heart.org. [O] . Braunwald Eugene, Antman Elliott M, Beasley John W, 2000

机译：ACC / AHA不稳定型心绞痛和非分段性心肌梗死患者的治疗指南美国心脏病学会/美国心脏协会工作组关于实践指南的报告（不稳定型心绞痛患者管理委员会）31该文件于2000年6月获得美国心脏病学会董事会的批准，并于2000年6月获得美国心脏协会科学咨询与协调委员会的批准。32引用该文件时，美国心脏病学会和美国心脏协会希望采用以下引文格式：Braunwald E，Antman EM，Beasley JW，Califf RM，Cheitlin MD，Hochman JS，Jones RH，Kereakes D，Kupersmith J，Levin TN，Pepine CJ，Schaeffer JW，Smith EE III，Steward DE，Theroux P. ACC / AHA不稳定型心绞痛和非ST段抬高型心肌梗死患者的治疗指南：美国心脏病学会/美国心脏A的报告协会实践指南工作组（不稳定型心绞痛患者管理委员会）。 J Am Coll Cardiol 2000； 36：970–1062.33该文档可在ACC（www.acc.org）和AHA（www.americanheart.org）的网站上找到。通过致电800-253-4636（仅限美国）或撰写美国心脏病，教育服务学院，地址为9111 Old Georgetown Road，Bethesda，MD 20814-1699，可按每本5美元的价格重印本文档（完整指南）。要求转载第71-0188号。要获得在1999年5月4日发行的Circulation上发布的较短版本（执行摘要和建议摘要）的重印本，请索要第71-0187号重印本。要购买其他重印本（指定版本和重印本号码）：最多999份，请致电800-611-6083（仅限美国）或传真413-665-2671；否则，请重新发送。 1000或更多副本，请致电214-706-1466，传真214-691-6342或电子邮件pubauth@heart.org。

Segmented Document Classification: Problem and Solution

摘要

著录项

相似文献

相关主题

期刊订阅