首页> 外文学位 >Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data.
【24h】

Exploiting non-redundant local patterns and probabilistic models for analyzing structured and semi-structured data.

机译:利用非冗余本地模式和概率模型来分析结构化和半结构化数据。

获取原文
获取原文并翻译 | 示例

摘要

This work seeks to develop a probabilistic framework for modeling, querying and analyzing large-scale structured and semi-structured data. The framework has three components: (1) Mining non-redundant local patterns from data; (2) Gluing these local patterns together by employing probabilistic models (e.g., Markov random field (MRF), Bayesian network); and (3) Reasoning (making inference) over the data for solving various data analysis tasks. In more detail, our contributions are as follows:; Mining non-redundant frequent itemset patterns on large transactional data. Often times in many real-world problems frequent pattern mining algorithms yield so many frequent patterns that the end-user is swamped when it comes to interpreting the results. We present an approach of employing probabilistic models to identify non-redundant itemset patterns from a large collection of frequent itemsets on transactional data. We show that our approach can effectively eliminate a large amount of redundancy from a large collection of itemset patterns.; Employing local probabilistic models to glue non-redundant itemset patterns on large transactional or network data. We propose a technique of employing local probabilistic models to glue non-redundant itemset patterns together in tackling the link prediction task in co-authorship network analysis. The new technique effectively combines topology analysis on network structure data and frequency analysis on network event log data. The main idea is to consider the co-occurrence probability of two end nodes associated with a candidate link. We propose a method of building MRFs over local data regions to compute this co-occurrence probability. Experimental results demonstrate that the co-occurrence probability inferred from the local probabilistic models is very useful for link prediction.; Employing global probabilistic models to glue non-redundant itemset patterns on large transactional data. We explore employing global models, models over large data regions, to glue non-redundant itemset patterns together. To this end, we investigate learning approximate global MRFs on large transactional data and propose a divide-and-conquer style modeling approach. Empirical study shows that the models are effective in modeling the data and approximately answering queries on the data.; Mining non-redundant tree patterns and employing probabilistic approaches to glue them on large XML data. We propose a technique of identifying non-redundant tree patterns from a large collection of structural tree patterns. We show that our approach can effectively eliminate redundancies from a large collection of structural tree patterns. Furthermore, we present techniques of employing these non-redundant tree patterns as summary statistics for the XML data to solve the XML twig selection estimation problem. We propose a probabilistic framework under which the selectivity of a twig query can be estimated from the information of its subtrees. Empirical results demonstrate the efficacy of our approach on real and synthetic datasets.
机译:这项工作旨在开发一个概率框架,用于建模,查询和分析大规模结构化和半结构化数据。该框架包括三个部分:(1)从数据中挖掘非冗余本地模式; (2)通过采用概率模型(例如,马尔可夫随机场(MRF),贝叶斯网络)将这些局部模式融合在一起; (3)对数据进行推理(推理)以解决各种数据分析任务。更详细地说,我们的贡献如下:在大型交易数据上挖掘非冗余的频繁项集模式。在许多现实世界中的问题中,频繁的模式挖掘算法通常会产生如此多的频繁模式,以至于最终用户在解释结果时会陷入困境。我们提出一种采用概率模型的方法,该方法可以从交易数据上大量的频繁项集中识别出非冗余项集模式。我们证明了我们的方法可以有效地从大量的项目集模式集合中消除大量的冗余。使用本地概率模型在大型事务或网络数据上粘贴非冗余项目集模式。我们提出了一种使用本地概率模型将非冗余项目集模式粘合在一起的技术,以解决共同作者网络分析中的链接预测任务。新技术有效地结合了对网络结构数据的拓扑分析和对网络事件日志数据的频率分析。主要思想是考虑与候选链接关联的两个末端节点的同时出现概率。我们提出了一种在本地数据区域上构建MRF的方法,以计算此同时出现的概率。实验结果表明,从局部概率模型推断出的共现概率对于链路预测非常有用。使用全局概率模型在大型交易数据上粘贴非冗余项目集模式。我们探索使用全局模型,大数据区域上的模型将非冗余项目集模式粘合在一起。为此,我们调查了在大型交易数据上学习近似全局MRF的情况,并提出了分而治之的样式建模方法。实证研究表明,该模型在建模数据和近似回答数据查询方面是有效的。挖掘非冗余树模式并采用概率方法将其粘合在大型XML数据上。我们提出了一种从大量结构树模式集合中识别非冗余树模式的技术。我们证明了我们的方法可以有效地消除大量结构树模式集合中的冗余。此外,我们提出了将这些非冗余树模式用作XML数据的摘要统计信息的技术,以解决XML树枝选择估计问题。我们提出了一种概率框架,在该框架下可以从树枝查询的子树信息中估计树枝查询的选择性。实证结果证明了我们的方法在真实和综合数据集上的有效性。

著录项

  • 作者

    Wang, Chao.;

  • 作者单位

    The Ohio State University.;

  • 授予单位 The Ohio State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 166 p.
  • 总页数 166
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号