首页> 外文期刊>Recent patents on computer science >A Duplicate Data Detection Approach Based on MapReduce and HDFS
【24h】

A Duplicate Data Detection Approach Based on MapReduce and HDFS

机译:基于MapReduce和HDFS的重复数据检测方法

获取原文
获取原文并翻译 | 示例
           

摘要

Background: With the surge in the volume of collected data, deduplication will undoubtedly become one of the problems faced by researchers. There is significant advantage for deduplication to reduce storage, network bandwidth, and system scalability of coarse-grained redundant data.Since the conventional methods of deleting duplicate data include hash comparison and binary differential incremental. They will lead to several bottlenecks for processing large scale data. And, the traditional Simhash similarity method has less consideration on the natural similarity of textin some specific fields and cannot run in parallel program with large scale text data processing efficiently. This paper examines several most important patents in the area of data detection. Then, this paper will focus on large scale of data deduplication based on MapReduce and HDFS. Methods: We propose a duplicate data detection approach based on MapReduce and HDFS, which uses the Simhash similarity computing algorithm and SSN algorithm, and explain our distributed duplicate detection workflow. The important technical advantages of the invention include generating a checksumfor each processed record and comparing the generated checksum to detect duplicate record. It produces the fingerprints of short text with Simhash similarity algorithm. It clusters the fingerprint results using Shared Nearest Neighbor (SNN) algorithm. The whole parallel progress is implementedusing MapReduce programming model. Results: From the experimental results, we conclude that our proposed approach obtains MapReduce job schedules with significantly less executing time, making it suitable for processing large scale datasets in real applications. The experimental resultsshow the proposed approach has better performance and efficiency. Conclusion: In this patent, we propose a duplicate data detection approach based on MapReduce and HDFS, which uses the Simhash similarity computing algorithm and SSN algorithm. The results show that the new approach isapplied to MapReduce, which is suitable for the document similarity calculation of large scale data sets, which greatly reduces the time overhead, has higher precision and recall rate, and provides some reference value for solving the same problem in large scale data. The invention is alsoapplied to large scale duplicate data detection. And it is a good solution for large scale data process issue. In the future, we plan to design and implement a scheduler for MapReduce jobs and new similarity algorithm with the primary focus of large scale duplicate data detection.
机译:背景:随着收集数据量的激增,重复数据删除无疑将成为研究人员面临的问题之一。重复数据删除具有显着的优势,以减少粗粒化冗余数据的存储,网络带宽和系统可伸缩性.SINCE删除重复数据的传统方法包括哈希比较和二进制差分增量。它们会导致几个用于处理大规模数据的瓶颈。并且,传统的SimHash相似性方法对TextIn的自然相似性较少考虑了一些特定字段的自然相似性,并且无法在并行程序中运行,并有效地具有大规模的文本数据处理。本文研究了数据检测领域的几项最重要的专利。然后,本文将重点关注基于MapReduce和HDF的大规模数据重复数据删除。方法:我们提出了一种基于MapReduce和HDF的重复数据检测方法,它使用SimHash相似性计算算法和SSN算法,并解释了我们的分布式重复检测工作流程。本发明的重要技术优势包括为每个处理的记录生成校验和,并将生成的校验和进行比较以检测重复记录。它生成了简短文本的指纹,具有Simhash相似性算法。它使用共享最近邻(SNN)算法群集指纹结果。整个并行进度实现了MapReduce编程模型。结果:从实验结果中,我们得出的结论是,我们的建议方法获得了MapReduce作业计划,其执行时间明显较低,使其适用于在实际应用中处理大规模数据集。实验结果表明,该方法具有更好的性能和效率。结论:在本专利中,我们提出了一种基于MapReduce和HDF的重复数据检测方法,它使用SimHash相似性计算算法和SSN算法。结果表明,新方法是映射到MapReduce,适用于大规模数据集的文档相似性计算,这大大减少了时间开销,具有更高的精度和召回速率,并提供了一些参考值来解决相同的问题大规模数据。本发明还应用于大规模重复数据检测。并且它是大规模数据流程问题的好解决方案。在未来,我们计划为MapReduce作业和新的相似性算法设计和实施具有大规模重复数据检测的主要焦点的调度程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号