Online system for detection of Chinese near-duplicate documents

机译：在线中文近重复文件检测系统

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

In various types of information retrieval systems, searching engines, and some data-mining systems, there is one task cannot be avoided—how to detect the large-scale duplicate and near-duplicate documents rapidly. Too many duplicates will influence our systems in many aspects malignantly. For example it reduces the computational performance, cuts down the user experience and so on. On the other hand, if quantity of documents increases dynamically, we should take another way to tackle this problem. This paper aims to construct a practical online detection system under the guidance of the fingerprint extraction technique based on simhash. Our contribution is that we develop a system running online, which means we don't know the accurate quantity of the documents before, and the system is able to accept new documents anytime. It requires efficiency and flexibility, and we propose a favorable solution.

机译：在各种类型的信息检索系统，搜索引擎和某些数据挖掘系统中，无法回避一项任务-如何快速检测大规模重复和接近重复的文档。太多重复项将在许多方面严重影响我们的系统。例如，它降低了计算性能，减少了用户体验等。另一方面，如果文档数量动态增加，我们应该采取另一种方法来解决此问题。本文旨在在基于simhash的指纹提取技术的指导下，构建一个实用的在线检测系统。我们的贡献是开发了一个在线运行的系统，这意味着我们之前不知道文件的准确数量，而且该系统能够随时接收新文件。这需要效率和灵活性，因此我们提出了一种有利的解决方案。

著录项

来源
《》|2012年|726-731|共6页
会议地点 Taipei(CT)
作者
Yang Yang; YuQuan Chen;
展开▼
作者单位

Dept. of Computer Science Engineering, Shanghai Jiao Tong University, Chinac;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Chinese document; Hamming distance; near-duplicate; online detection; simhash;

机译：中文文档;汉明距离;近似重复;在线检测; Simhash;;

相似文献

外文文献
中文文献
专利

1. Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection [J] . Phuc-TranHo, Sung-RyulKim International Journal of Distributed Sensor Networks . 2014,第3期

机译：基于指纹的近重复文档检测及其在SNS垃圾邮件检测中的应用
2. Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL [J] . Ercan Canhasi Journal of computer sciences . 2018,第5期

机译：通过OpenCL在几乎重复的文档检测中评估CPU，GPU和FPGA的效率
3. Evaluating the Efficiency of CPUs, GPUs and FPGAs on a Near-Duplicate Document Detection Via OpenCL [J] . Canhasi Ercan Journal of computer sciences . 2018,第5期

机译：通过OpenCL在几乎重复的文档检测中评估CPU，GPU和FPGA的效率
4. Online System for Detection of Chinese Near-Duplicate Documents [C] . Yang Yang, YuQuan Chen ISSDM 2012 . 2012

机译：用于检测中国近重复文件的在线系统
5. Mining the intangible past of Virginia City's Chinese pioneers: Using historical geographic information system (HGIS) to document, visualize and interpret the spatial history of Chinese in a Montana mining camp (CA 1863 - mid-20th century). [D] . Yang, Cheng. 2011

机译：挖掘弗吉尼亚市中国先驱者的无形过去：使用历史地理信息系统（HGIS）在蒙大纳州的一个采矿营地（1863年-20世纪中叶）中记录，可视化和解释中国人的空间历史。
6. High-density linkage mapping aided by transcriptomics documents ZW sex determination system in the Chinese mitten crab Eriocheir sinensis [O] . Z Cui, M Hui, Y Liu, 2015

机译：转录组学文件ZW性别决定系统辅助中华绒螯蟹的高密度连锁作图
7. XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning [O] . Pamulaparty Lavanya, Guru Rao C.V., Rao M. Sreenivasa 2015

机译：XNDDF：建立一种使用监督和无监督学习的灵活的近重复文档检测框架

Online system for detection of Chinese near-duplicate documents

摘要

著录项

相似文献

相关主题

期刊订阅