...
首页> 外文期刊>Knowledge and information systems >Model-based probabilistic frequent itemset mining
【24h】

Model-based probabilistic frequent itemset mining

机译:基于模型的概率频繁项集挖掘

获取原文
获取原文并翻译 | 示例
           

摘要

Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain database induces an exponential number of possible worlds. To tackle this problem, we propose a novel methods to capture the itemset mining process as a probability distribution function taking two models into account: the Poisson distribution and the normal distribution. These model-based approaches extract frequent itemsets with a high degree of accuracy and support large databases. We apply our techniques to improve the performance of the algorithms for (1) finding itemsets whose frequentness probabilities are larger than some threshold and (2) mining itemsets with the k highest frequentness probabilities. Our approaches support both tuple and attribute uncertainty models, which are commonly used to represent uncertain databases. Extensive evaluation on real and synthetic datasets shows that our methods are highly accurate and four orders of magnitudes faster than previous approaches. In further theoretical and experimental studies, we give an intuition which model-based approach fits best to different types of data sets.
机译:数据不确定性是新兴应用程序固有的,例如基于位置的服务,传感器监视系统和数据集成。为了处理大量不精确的信息,最近已经开发了不确定的数据库。在本文中,我们研究如何有效地从大型不确定数据库中发现频繁的项目集,这些数据库在“可能的世界语义学”下得到了解释。这在技术上具有挑战性,因为不确定的数据库会引发成倍数量的可能世界。为了解决这个问题,我们提出了一种新颖的方法来捕获项集挖掘过程作为概率分布函数,同时考虑了两个模型:泊松分布和正态分布。这些基于模型的方法可以高度准确地提取频繁项集并支持大型数据库。我们应用我们的技术来提高算法的性能,以:(1)查找频繁概率大于某个阈值的项目集,以及(2)挖掘具有k个最高概率的项目集。我们的方法支持元组和属性不确定性模型,它们通常用于表示不确定性数据库。对真实和合成数据集的广泛评估表明,我们的方法非常准确,比以前的方法快四个数量级。在进一步的理论和实验研究中,我们给出了一种直觉,即基于模型的方法最适合不同类型的数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号