【24h】

A Scalable System for Identifying Co-derivative Documents

机译:识别共衍生文档的可扩展系统

获取原文
获取原文并翻译 | 示例

摘要

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present SPEX, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe DECO, a prototype system that makes use of SPEX. Our experiments with several document collections demonstrate the effectiveness of the approach.
机译:如果文档共享内容,则它们是共衍生的:要使两个文档共同衍生,一个文档的一部分必须从另一个文档衍生,或者两个文档的一部分必须从第三文档衍生。用于同时检测集合中所有共导数的当前技术是文档指纹识别,它基于所选文档子序列或块的哈希值来匹配文档。当前,由于无法准确地隔离可用于识别共衍生物的信息而阻碍了指纹识别。在本文中,我们介绍了SPEX,这是一种新颖的基于散列的算法,用于从文档集中提取重复的数据块。我们将讨论有关共享组块的信息如何用于有效和可靠地标识共衍生簇的方法,并描述DECO,它是利用SPEX的原型系统。我们通过几个文档集进行的实验证明了该方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号