Parallel sampling from big data with uncertainty distribution

Qing He; Haocheng Wang; Fuzhen Zhuang; Tianfeng Shang; Zhongzhi Shi

首页> 外文期刊>Fuzzy sets and systems >Parallel sampling from big data with uncertainty distribution

【24h】

Parallel sampling from big data with uncertainty distribution

机译：具有不确定性分布的大数据并行采样

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data are inherently uncertain in most applications. Uncertainty is encountered when an experiment such as sampling is to proceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments of data collection and distribution storage technologies, big data have become a bigger-than-ever problem. And dealing with big data with uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Sampling method based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept of Minimal Consistent Subset (MCS) of Hyper Surface Classification (HSC). Our inspiration for handling uncertainties in sampling from big data depends on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of all the possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based on MapReduce framework, which is a current and powerful parallel programming technique used in many fields. Experiments have been carried out on several data sets including real world data from UCI repository and synthetic data. The results show that our algorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the data sets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efficiency.

机译：在大多数应用中，数据固有地不确定。当进行诸如采样之类的实验时会遇到不确定性，我们的结果未知，同时导致各种潜在结果。随着数据收集和分发存储技术的飞速发展，大数据已成为前所未有的大问题。具有不确定性分布的大数据处理是大数据研究中最重要的问题之一。本文针对具有不确定性分布的大数据，提出了一种基于超曲面的并行采样方法，即PSHS，它采用了超曲面分类（HSC）的最小一致性子集（MCS）的通用概念。我们处理大数据采样中不确定性的灵感取决于（1）对我们而言，原始采样集的固有结构尚不确定，（2）由所有可能的分离超曲面形成的边界集是模糊集，（3） MCS中元素的不确定性。 PSHS是基于MapReduce框架实现的，MapReduce框架是当前在许多领域中使用的强大的并行编程技术。已经对多个数据集进行了实验，这些数据集包括来自UCI存储库的现实世界数据和综合数据。结果表明，我们的算法在保持相同分布的同时缩小了数据集，这对于获得数据集的固有结构很有用。此外，加速，放大和放大的评估标准验证了其效率。

著录项

来源
《Fuzzy sets and systems》 |2015年第1期|117-133|共17页
作者
Qing He; Haocheng Wang; Fuzhen Zhuang; Tianfeng Shang; Zhongzhi Shi;
展开▼
作者单位

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China;

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China,University of Chinese Academy of Sciences, Beijing 100049, China;

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China;

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China,University of Chinese Academy of Sciences, Beijing 100049, China;

Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Fuzzy boundary set; Uncertainty; Minimal consistent subset; Sampling; MapReduce;

机译：模糊边界集;不确定;最小一致子集;采样;MapReduce;

相似文献

外文文献
中文文献
专利

1. Quantifying uncertainty in statistical distribution of small sample data using Bayesian inference of unbounded Johnson distribution [J] . Kun Marhadi, Satchi Venkataraman, Shantaram S. Pai International Journal of Reliability and Safety . 2012,第4期

机译：使用无界Johnson分布的贝叶斯推断对小样本数据的统计分布中的不确定性进行量化
2. Crystal Size Distributions Derived from 3D Datasets: Sample Size Versus Uncertainties [J] . GUILHERME A. R. GUALDA Journal of Petrology . 2006,第6期

机译：从3D数据集得出的晶体尺寸分布：样品尺寸与不确定性
3. Spectral analysis of irregularly-sampled data: Paralleling the regularly-sampled data approaches [J] . Stoica P, Sandgren N Digital Signal Processing . 2006,第6期

机译：不规则采样数据的频谱分析：并行化常规采样数据方法
4. Quantifying Uncertainty in Statistical Distribution of Small Sample Data Using Bayesian Inference of Unbounded Johnson Distribution [C] . K. Marhadi, S. Venkataraman, S. Pai AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference . 2008

机译：使用无界Johnson分布的贝叶斯推断对小样本数据的统计分布中的不确定性进行量化
5. MISSING VALUES IN STATISTICAL ANALYSIS. (MODIFIED SAMPLING DISTRIBUTIONS,APPROXIMATE STATISTICAL ANALYSIS OF EXPERIMENTAL DATA AND ESTIMATION OF POPULATION PARAMETERS FROM FRAGMENTARY SAMPLES [D] . MATHAI, MATHAI ARAKAPARAMPIL. 1964

机译：统计分析中的缺失值。修改后的抽样分布，实验数据的近似统计分析和片段样本的人口参数估计
6. Accounting for observation processes across multiple levels of uncertainty improves inference of species distributions and guides adaptive sampling of environmental DNA [O] . Amy J. Davis, Kelly E. Williams, Nathan P. Snow, 2018

机译：考虑到跨多个不确定性水平的观测过程可以改善物种分布的推断并指导环境DNA的自适应采样
7. Crystal Size Distributions Derived from 3D Datasets: Sample Size Versus Uncertainties [O] . Guilherme A. R. Gualda 2005

机译：从3D数据集得出的晶体尺寸分布：样本大小与不确定性

Parallel sampling from big data with uncertainty distribution

摘要

著录项

相似文献

相关主题

期刊订阅