A Duplicate Data Detection Approach Based on MapReduce and HDFS

Wei Fang; Xue-Zhi Wen; Yu Zheng

首页> 外文期刊>Recent patents on computer science >A Duplicate Data Detection Approach Based on MapReduce and HDFS

【24h】

A Duplicate Data Detection Approach Based on MapReduce and HDFS

机译：基于MapReduce和HDFS的重复数据检测方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background: With the surge in the volume of collected data, deduplication will undoubtedly become one of the problems faced by researchers. There is significant advantage for deduplication to reduce storage, network bandwidth, and system scalability of coarse-grained redundant data.Since the conventional methods of deleting duplicate data include hash comparison and binary differential incremental. They will lead to several bottlenecks for processing large scale data. And, the traditional Simhash similarity method has less consideration on the natural similarity of textin some specific fields and cannot run in parallel program with large scale text data processing efficiently. This paper examines several most important patents in the area of data detection. Then, this paper will focus on large scale of data deduplication based on MapReduce and HDFS. Methods: We propose a duplicate data detection approach based on MapReduce and HDFS, which uses the Simhash similarity computing algorithm and SSN algorithm, and explain our distributed duplicate detection workflow. The important technical advantages of the invention include generating a checksumfor each processed record and comparing the generated checksum to detect duplicate record. It produces the fingerprints of short text with Simhash similarity algorithm. It clusters the fingerprint results using Shared Nearest Neighbor (SNN) algorithm. The whole parallel progress is implementedusing MapReduce programming model. Results: From the experimental results, we conclude that our proposed approach obtains MapReduce job schedules with significantly less executing time, making it suitable for processing large scale datasets in real applications. The experimental resultsshow the proposed approach has better performance and efficiency. Conclusion: In this patent, we propose a duplicate data detection approach based on MapReduce and HDFS, which uses the Simhash similarity computing algorithm and SSN algorithm. The results show that the new approach isapplied to MapReduce, which is suitable for the document similarity calculation of large scale data sets, which greatly reduces the time overhead, has higher precision and recall rate, and provides some reference value for solving the same problem in large scale data. The invention is alsoapplied to large scale duplicate data detection. And it is a good solution for large scale data process issue. In the future, we plan to design and implement a scheduler for MapReduce jobs and new similarity algorithm with the primary focus of large scale duplicate data detection.

机译：背景：随着收集数据量的激增，重复数据删除无疑将成为研究人员面临的问题之一。重复数据删除具有显着的优势，以减少粗粒化冗余数据的存储，网络带宽和系统可伸缩性.SINCE删除重复数据的传统方法包括哈希比较和二进制差分增量。它们会导致几个用于处理大规模数据的瓶颈。并且，传统的SimHash相似性方法对TextIn的自然相似性较少考虑了一些特定字段的自然相似性，并且无法在并行程序中运行，并有效地具有大规模的文本数据处理。本文研究了数据检测领域的几项最重要的专利。然后，本文将重点关注基于MapReduce和HDF的大规模数据重复数据删除。方法：我们提出了一种基于MapReduce和HDF的重复数据检测方法，它使用SimHash相似性计算算法和SSN算法，并解释了我们的分布式重复检测工作流程。本发明的重要技术优势包括为每个处理的记录生成校验和，并将生成的校验和进行比较以检测重复记录。它生成了简短文本的指纹，具有Simhash相似性算法。它使用共享最近邻（SNN）算法群集指纹结果。整个并行进度实现了MapReduce编程模型。结果：从实验结果中，我们得出的结论是，我们的建议方法获得了MapReduce作业计划，其执行时间明显较低，使其适用于在实际应用中处理大规模数据集。实验结果表明，该方法具有更好的性能和效率。结论：在本专利中，我们提出了一种基于MapReduce和HDF的重复数据检测方法，它使用SimHash相似性计算算法和SSN算法。结果表明，新方法是映射到MapReduce，适用于大规模数据集的文档相似性计算，这大大减少了时间开销，具有更高的精度和召回速率，并提供了一些参考值来解决相同的问题大规模数据。本发明还应用于大规模重复数据检测。并且它是大规模数据流程问题的好解决方案。在未来，我们计划为MapReduce作业和新的相似性算法设计和实施具有大规模重复数据检测的主要焦点的调度程序。

著录项

来源
《Recent patents on computer science》 |2017年第2期|共9页
作者
Wei Fang; Xue-Zhi Wen; Yu Zheng;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
deduplication; MapReduce; Large scale data sets; shared nearest neighbor; HDFS; Simhash;

机译：重复数据删除;mapreduce;大规模数据集;共享最近邻居;HDFS;SIMHASH;

相似文献

外文文献
中文文献
专利

1. A Duplicate Data Detection Approach Based on MapReduce and HDFS [J] . Wei Fang, Xue-Zhi Wen, Yu Zheng Recent patents on computer science . 2017,第2期

机译：基于MapReduce和HDFS的重复数据检测方法
2. Detection and Elimination of Duplicate Data Using Token-Based Method for a Data Warehouse: A Clustering Based Approach [J] . J. Jebamalar Tamilselvi, V. Saravanan International journal of dynamics of fluids . 2009,第2期

机译：使用基于令牌的数据仓库方法检测和消除重复数据：一种基于聚类的方法
3. Detection and Elimination of Duplicate Data Using Token-Based Method for a Data Warehouse: A Clustering Based Approach [J] . J. Jebamalar Tamilselvi, V. Saravanan International journal of computational intelligence research . 2009,第2期

机译：使用基于令牌的数据仓库方法检测和消除重复数据：一种基于聚类的方法
4. An efficient approach for data-duplication detection based on RDBMS [C] . Chanhom Kiettisak, Natwichai Juggapong 2011 Eighth International Joint Conference on Computer Science and Software Engineering . 2011

机译：一种基于RDBMS的高效重复数据检测方法
5. Novel Class Detection and Cross-Lingual Duplicate Detection Over Online Data Stream [D] . Mustafa, Ahmad Mohammad. 2018

机译：在线数据流上的新型类检测和跨语言重复检测
6. MapReduce Algorithms for Inferring Gene Regulatory Networks from Time-Series Microarray Data Using an Information-Theoretic Approach [O] . Yasser Abduallah, Turki Turki, Kevin Byron, 2006

机译：使用信息理论方法从时间序列微阵列数据推断基因调控网络的MapReduce算法
7. AES – MR: A Novel Encryption Scheme for securing Data in HDFS Environment using MapReduce [O] . Viplove Kadre, Sushil Chaturvedi 2015

机译：AES - MR：使用MapReduce保护HDFS环境中的数据的新加密方案

A Duplicate Data Detection Approach Based on MapReduce and HDFS

摘要

著录项

相似文献

相关主题

期刊订阅