...
首页> 外文期刊>ACM Transactions on Management Information Systems >Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering
【24h】

Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering

机译:基于本体的自动文档管理映射:一种基于概念的文档聚类中词不匹配和歧义问题的技术

获取原文
获取原文并翻译 | 示例
           

摘要

Document clustering is crucial to automated document management, especially for the fast-growing volume of textual documents available digitally. Traditional lexicon-based approaches depend on document content analysis and measure overlap of the feature vectors representing different documents, which cannot effectively address word mismatch or ambiguity problems. Alternative query expansion and local context discovery approaches are developed but suffer from limited efficiency and effectiveness, because the large number of expanded terms create noise and increase the dimensionality and complexity of the overall feature space. Several techniques extend lexicon-based analysis by incorporating latent semantic indexing but produce less comprehensible clustering results and questionable performance. We instead propose a concept-based document representation and clustering (CDRC) technique and empirically examine its effectiveness using 433 articles concerning information systems and technology, randomly selected from a popular digital library. Our evaluation includes two widely used benchmark techniques and shows that CDRC outperforms them. Overall, our results reveal that clustering documents at an ontology-based, concept-based level is more effective than techniques using lexicon-based document features and can generate more comprehensible clustering results.
机译:文档集群对于自动化文档管理至关重要,尤其是对于数字文档数量迅速增长的情况。传统的基于词典的方法依赖于文档内容分析并测量代表不同文档的特征向量的重叠,这无法有效解决单词不匹配或歧义性问题。已经开发了替代查询扩展和局部上下文发现方法,但是它们的效率和有效性受到限制,因为大量扩展的术语会产生噪声并增加整个特征空间的维数和复杂性。几种技术通过合并潜在的语义索引来扩展基于词典的分析,但是产生的可理解性却较差,而性能却令人质疑。相反,我们提出了一种基于概念的文档表示和聚类(CDRC)技术,并使用433篇有关信息系统和技术的文章进行了经验检验,该文章是从流行的数字图书馆中随机选择的。我们的评估包括两项广泛使用的基准技术,并表明CDRC的性能优于它们。总体而言,我们的结果表明,在基于本体,基于概念的级别上对文档进行聚类比使用基于词典的文档特征的技术更为有效,并且可以生成更易于理解的聚类结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号