首页> 外文期刊>Fuzzy sets and systems >Parallel sampling from big data with uncertainty distribution
【24h】

Parallel sampling from big data with uncertainty distribution

机译:具有不确定性分布的大数据并行采样

获取原文
获取原文并翻译 | 示例
           

摘要

Data are inherently uncertain in most applications. Uncertainty is encountered when an experiment such as sampling is to proceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments of data collection and distribution storage technologies, big data have become a bigger-than-ever problem. And dealing with big data with uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Sampling method based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept of Minimal Consistent Subset (MCS) of Hyper Surface Classification (HSC). Our inspiration for handling uncertainties in sampling from big data depends on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of all the possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based on MapReduce framework, which is a current and powerful parallel programming technique used in many fields. Experiments have been carried out on several data sets including real world data from UCI repository and synthetic data. The results show that our algorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the data sets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efficiency.
机译:在大多数应用中,数据固有地不确定。当进行诸如采样之类的实验时会遇到不确定性,我们的结果未知,同时导致各种潜在结果。随着数据收集和分发存储技术的飞速发展,大数据已成为前所未有的大问题。具有不确定性分布的大数据处理是大数据研究中最重要的问题之一。本文针对具有不确定性分布的大数据,提出了一种基于超曲面的并行采样方法,即PSHS,它采用了超曲面分类(HSC)的最小一致性子集(MCS)的通用概念。我们处理大数据采样中不确定性的灵感取决于(1)对我们而言,原始采样集的固有结构尚不确定,(2)由所有可能的分离超曲面形成的边界集是模糊集,(3) MCS中元素的不确定性。 PSHS是基于MapReduce框架实现的,MapReduce框架是当前在许多领域中使用的强大的并行编程技术。已经对多个数据集进行了实验,这些数据集包括来自UCI存储库的现实世界数据和综合数据。结果表明,我们的算法在保持相同分布的同时缩小了数据集,这对于获得数据集的固有结构很有用。此外,加速,放大和放大的评估标准验证了其效率。

著录项

  • 来源
    《Fuzzy sets and systems》 |2015年第1期|117-133|共17页
  • 作者单位

    Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China;

    Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China,University of Chinese Academy of Sciences, Beijing 100049, China;

    Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China;

    Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China,University of Chinese Academy of Sciences, Beijing 100049, China;

    Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Fuzzy boundary set; Uncertainty; Minimal consistent subset; Sampling; MapReduce;

    机译:模糊边界集;不确定;最小一致子集;采样;MapReduce;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号