Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering

YEN-HSIEN LEE; PAUL JEN-HWA HU; CHING-YI TU

首页> 外文期刊>ACM Transactions on Management Information Systems >Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering

【24h】

Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering

机译：基于本体的自动文档管理映射：一种基于概念的文档聚类中词不匹配和歧义问题的技术

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Document clustering is crucial to automated document management, especially for the fast-growing volume of textual documents available digitally. Traditional lexicon-based approaches depend on document content analysis and measure overlap of the feature vectors representing different documents, which cannot effectively address word mismatch or ambiguity problems. Alternative query expansion and local context discovery approaches are developed but suffer from limited efficiency and effectiveness, because the large number of expanded terms create noise and increase the dimensionality and complexity of the overall feature space. Several techniques extend lexicon-based analysis by incorporating latent semantic indexing but produce less comprehensible clustering results and questionable performance. We instead propose a concept-based document representation and clustering (CDRC) technique and empirically examine its effectiveness using 433 articles concerning information systems and technology, randomly selected from a popular digital library. Our evaluation includes two widely used benchmark techniques and shows that CDRC outperforms them. Overall, our results reveal that clustering documents at an ontology-based, concept-based level is more effective than techniques using lexicon-based document features and can generate more comprehensible clustering results.

机译：文档集群对于自动化文档管理至关重要，尤其是对于数字文档数量迅速增长的情况。传统的基于词典的方法依赖于文档内容分析并测量代表不同文档的特征向量的重叠，这无法有效解决单词不匹配或歧义性问题。已经开发了替代查询扩展和局部上下文发现方法，但是它们的效率和有效性受到限制，因为大量扩展的术语会产生噪声并增加整个特征空间的维数和复杂性。几种技术通过合并潜在的语义索引来扩展基于词典的分析，但是产生的可理解性却较差，而性能却令人质疑。相反，我们提出了一种基于概念的文档表示和聚类（CDRC）技术，并使用433篇有关信息系统和技术的文章进行了经验检验，该文章是从流行的数字图书馆中随机选择的。我们的评估包括两项广泛使用的基准技术，并表明CDRC的性能优于它们。总体而言，我们的结果表明，在基于本体，基于概念的级别上对文档进行聚类比使用基于词典的文档特征的技术更为有效，并且可以生成更易于理解的聚类结果。

著录项

来源
《ACM Transactions on Management Information Systems》 |2015年第1期|4.1-4.22|共22页
作者
YEN-HSIEN LEE; PAUL JEN-HWA HU; CHING-YI TU;
展开▼
作者单位

Department of Management Information Systems, National Chiayi University;

Department of Operations and Information Systems, University of Utah;

CIM-Ⅱ Department, ASE Group Kaohsiung;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Document-category management; ontology-supported document clustering; document clustering; knowledge management;

机译：文件类别管理;本体支持的文档聚类;文档聚类;知识管理;

相似文献

外文文献
中文文献
专利

1. DOCUMENT CLUSTERING USING CO-WORD ANALYSIS AND FORMATION OF KEYWORD AGAINST DOCUMENT MATRIX [J] . Document Clustering, Text Mining, Keyword Extraction, Journal of Theoretical and Applied Information Technology . 2014,第3期

机译：使用共词分析对文档进行聚类以及针对文档矩阵的关键词形成
2. An Approach to Improve Quality of Document Clustering by Word Set Based Documenting Clustering Algorithm [J] . Sandeep Sharma, Ruchi Dave, Naveen Hemrajani Oriental journal of computer science and technology . 2011,第2期

机译：基于词集的文档聚类算法提高文档聚类质量的方法
3. Word-Sense Disambiguation for Ontology Mapping: Concept Disambiguation using Virtual Documents and Information Retrieval Techniques [J] . Frederik C. Schadd, Nico Roos Journal on Data Semantics . 2015,第3期

机译：本体映射的词义消歧：使用虚拟文档和信息检索技术的概念消歧
4. A semantic ontology-based document organizer to cluster elearning documents [C] . Sara Alaee, Fattaneh Taghiyareh International Conference on Web Research . 2016

机译：基于语义本体的文档组织者，对学习文档进行聚类
5. Keywords in the mist: Automated keyword extraction for very large documents and back of the book indexing. [D] . Csomai, Andras. 2008

机译：薄雾中的关键字：自动提取非常大的文档并在书后建立索引的关键字。
6. Clinical map document based on XML (cMDX): document architecture with mapping feature for reporting and analysing prostate cancer in radical prostatectomy specimens [O] . Okyaz Eminaga, Reemt Hinkelammert, Axel Semjonow, 2010

机译：基于XML（cMDX）的临床地图文档：具有映射功能的文档体系结构用于报告和分析前列腺癌根治术标本中的前列腺癌
7. British Letters Patent of 1908 and 1917 constituting the Falkland Islands Dependencies The following are the texts of the two Letters Patent denning the boundaries of the Falkland Islands Dependencies. They are reprinted here in view of the current political interest in this area. Some confusion has arisen owing to misrepresentation of the wording of these documents. The Letters Patent of 1908 made provision for the government of certain specified land areas lying between specified latitudes and longitudes. No claim was made to jurisdiction over the High Seas within these boundaries; still less was any claim made to that part of South America which lies to the south of latitude 50° S. The Letters Patent of 1917 denned the area more precisely in order to avoid this ambiguity. All subsequent British legislation for the administration of these Dependencies is based on the authority of these two documents. [O] . 1948

机译：英国信件1908年和1917年构成福克兰群岛的依赖性以下是谴责福克兰群岛依赖性的边界的两封信的文本。鉴于目前对该领域的政治兴趣，他们在此转载。由于这些文件的措辞歪曲了一些混乱。 1908年的信件专利为某些指定土地区域的政府提供了符合特定纬度和纵向的政府。在这些界限内没有索赔对公海的司法管辖区;仍然仍然是对南美洲的那部分索利的任何索赔，这些南美侧向纬度为50°S南部。1917年的字母专利更准确地击落了该地区，以避免这种歧义。所有后续英国人的管理这些依赖项的立法是基于这两份文件的权威。
8. Automation Document Pseudoclassification and Retrieval by Word Frequency Techniques, [R] . cameron,james slagle 1972

机译：自动化文档按字频技术进行伪分类和检索，

Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering

摘要

著录项

相似文献

相关主题

期刊订阅