A Scalable System for Identifying Co-derivative Documents

机译：识别共衍生文档的可扩展系统

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present SPEX, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe DECO, a prototype system that makes use of SPEX. Our experiments with several document collections demonstrate the effectiveness of the approach.

机译：如果文档共享内容，则它们是共衍生的：要使两个文档共同衍生，一个文档的一部分必须从另一个文档衍生，或者两个文档的一部分必须从第三文档衍生。用于同时检测集合中所有共导数的当前技术是文档指纹识别，它基于所选文档子序列或块的哈希值来匹配文档。当前，由于无法准确地隔离可用于识别共衍生物的信息而阻碍了指纹识别。在本文中，我们介绍了SPEX，这是一种新颖的基于散列的算法，用于从文档集中提取重复的数据块。我们将讨论有关共享组块的信息如何用于有效和可靠地标识共衍生簇的方法，并描述DECO，它是利用SPEX的原型系统。我们通过几个文档集进行的实验证明了该方法的有效性。

著录项

来源
《International Conference on String Processing and Information Retrieval(SPIRE 2004); 20041005-08; Padova(IT)》|2004年|P.55-67|共13页
会议地点 Padova(IT)
作者
Yaniv Bernstein; Justin Zobel;
展开▼
作者单位

School of Computer Science and Information Technology RMIT University, Melbourne, Australia;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类数据备份与恢复;
关键词

相似文献

外文文献
中文文献
专利

1. Accurate discovery of co-derivative documents via duplicate text detection [J] . Yaniv Bernstein, Justin Zobel Information Systems . 2006,第7期

机译：通过重复文本检测准确发现共衍生文档
2. On the co-derivative of normal cone mappings to inequality systems [J] . R. Henrion, J. Outrata, T. Surowiec Nonlinear Analysis: An International Multidisciplinary Journal . 2009,第3a4期

机译：关于不等式系统的正圆锥映射的共导数
3. Impacts and underlying factors of landscape-scale, historical disturbance of mountain forest identified using archival documents [J] . J. Bruna, J. Wilda, M. Svoboda, Forest Ecology and Management . 2013,第Null期

机译：利用档案文件确定山地森林景观规模，历史干扰的影响及其潜在因素
4. A Scalable System for Identifying Co-derivative Documents [C] . Yaniv Bernstein, Justin Zobel International Conference on String Processing and Information Retrieval . 2004

机译：用于识别共同衍生文档的可扩展系统
5. Identifying experts and authoritative documents in social bookmarking systems [D] . Grady, Jonathan P. 2013

机译：识别社交书签系统中的专家和权威文档
6. Automated systems to identify relevant documents in product risk management [O] . Xue Ting Wee, Yvonne Koh, Chun Wei Yap 2012

机译：自动化系统以识别产品风险管理中的相关文件
7. A Scalable System for Identifying Co-Derivative Documents [O] . Yaniv Bernstein, Justin Zobel 2004

机译：用于识别共同衍生文档的可扩展系统

A Scalable System for Identifying Co-derivative Documents

摘要

著录项

相似文献

相关主题

期刊订阅