首页> 外文会议> >Online system for detection of Chinese near-duplicate documents
【24h】

Online system for detection of Chinese near-duplicate documents

机译:在线中文近重复文件检测系统

获取原文
获取原文并翻译 | 示例

摘要

In various types of information retrieval systems, searching engines, and some data-mining systems, there is one task cannot be avoided—how to detect the large-scale duplicate and near-duplicate documents rapidly. Too many duplicates will influence our systems in many aspects malignantly. For example it reduces the computational performance, cuts down the user experience and so on. On the other hand, if quantity of documents increases dynamically, we should take another way to tackle this problem. This paper aims to construct a practical online detection system under the guidance of the fingerprint extraction technique based on simhash. Our contribution is that we develop a system running online, which means we don't know the accurate quantity of the documents before, and the system is able to accept new documents anytime. It requires efficiency and flexibility, and we propose a favorable solution.
机译:在各种类型的信息检索系统,搜索引擎和某些数据挖掘系统中,无法回避一项任务-如何快速检测大规模重复和接近重复的文档。太多重复项将在许多方面严重影响我们的系统。例如,它降低了计算性能,减少了用户体验等。另一方面,如果文档数量动态增加,我们应该采取另一种方法来解决此问题。本文旨在在基于simhash的指纹提取技术的指导下,构建一个实用的在线检测系统。我们的贡献是开发了一个在线运行的系统,这意味着我们之前不知道文件的准确数量,而且该系统能够随时接收新文件。这需要效率和灵活性,因此我们提出了一种有利的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号