首页> 外文学位 >Information integration in a grid environment Applications in the bioinformatics domain.
【24h】

Information integration in a grid environment Applications in the bioinformatics domain.

机译:网格环境中的信息集成在生物信息学领域中的应用。

获取原文
获取原文并翻译 | 示例

摘要

Grid computing emerged as a framework for supporting complex operations over large datasets, it enables the harnessing of large numbers of processors working in parallel to solve computing problems that typically spread across various domains. We focus on the problems of data management in a grid/cloud environment.;The broader context of designing a services oriented architecture (SOA) for information integration is studied, identifying the main components for realizing this architecture. The BioFederator is a web services-based data federation architecture for bioinformatics applications. Based on collaborations with bioinformatics researchers, several domain-specific data federation challenges and needs are identified. The BioFederator addresses such challenges and provides an architecture that incorporates a series of utility services; these address issues like automatic workflow composition, domain semantics, and the distributed nature of the data. The design also incorporates a series of data-oriented services that facilitate the actual integration of data. Schema integration is a core problem in the BioFederator context. Previous methods for schema integration rely on the exploration, implicit or explicit, of the multiple design choices that are possible for the integrated schema. Such exploration relies heavily on user interaction; thus, it is time consuming and labor intensive. Furthermore, previous methods have ignored the additional information that typically results from the schema matching process, that is, the weights and in some cases the directions that are associated with the correspondences. We propose a more automatic approach to schema integration that is based on the use of directed and weighted correspondences between the concepts that appear in the source schemas. A key component of our approach is a ranking mechanism for the automatic generation of the best candidate schemas. The algorithm gives more weight to schemas that combine the concepts with higher similarity or coverage. Thus, the algorithm makes certain decisions that otherwise would likely be taken by a human expert. We show that the algorithm runs in polynomial time and moreover has good performance in practice. The proposed methods and algorithms are compared to the state of the art approaches. The BioFederator design, services, and usage scenarios are discussed. We demonstrate how our architecture can be leveraged on real-world bioinformatics applications. We preformed a whole human genome annotation for nucleosome exclusion regions. The resulting annotations were studied and correlated with tissue specificity, gene density and other important gene regulation features.;We also study data processing models on grid environments. MapReduce is one popular parallel programming model that is proven to scale. However, using the low-level MapReduce for general data processing tasks poses the problem of developing, maintaining and reusing custom low-level user code. Several frameworks have emerged to address this problem; these frameworks share a top-down approach, where a high-level language is used to describe the problem semantics, and the framework takes care of translating this problem description into the MapReduce constructs. We highlight several issues in the existing approaches and alternatively propose a novel refined MapReduce model that addresses the maintainability and reusability issues, without sacrificing the low-level controllability offered by directly writing MapReduce code. We present MapReduce-LEGOS (MR-LEGOS), an explicit model for composing MapReduce constructs from simpler components, namely, "Maplets", "Reducelets" and optionally "Combinelets". Maplets and Reducelets are standard MapReduce constructs that can be composed to define aggregated constructs describing the problem semantics. This composition can be viewed as defining a micro-workflow inside the MapReduce job. Using the proposed model, complex problem semantics can be defined in the encompassing micro-workflow provided by MR-LEGOS while keeping the building blocks simple. We discuss the design details, its main features and usage scenarios. Through experimental evaluation, we show that the proposed design is highly scalable and has good performance in practice.
机译:网格计算作为支持大型数据集上的复杂操作的框架而出现,它使利用大量并行工作的处理器成为可能,从而解决了通常跨越各个领域的计算问题。我们专注于网格/云环境中的数据管理问题。;研究了设计面向服务的体系结构以进行信息集成的广阔环境,确定了实现该体系结构的主要组成部分。 BioFederator是用于生物信息学应用程序的基于Web服务的数据联合体系结构。基于与生物信息学研究人员的合作,确定了几个特定领域的数据联合挑战和需求。 BioFederator解决了这些挑战,并提供了包含一系列公用服务的架构。这些解决的问题包括自动工作流程组成,域语义和数据的分布式性质。该设计还包含了一系列面向数据的服务,这些服务促进了数据的实际集成。模式集成是BioFederator上下文中的核心问题。模式集成的先前方法依赖于对集成模式可能的多种设计选择的隐式或显式探索。这种探索严重依赖于用户交互。因此,这既费时又费力。此外,先前的方法已经忽略了通常由模式匹配过程产生的附加信息,即权重以及在某些情况下与对应关系关联的方向。我们提出了一种更自动的模式集成方法,该方法基于使用源模式中出现的概念之间的有向和加权对应关系。我们方法的关键组成部分是用于自动生成最佳候选方案的排名机制。该算法对结合了具有更高相似性或覆盖率的概念的方案给予了更大的权重。因此,该算法做出某些决定,否则人类专家可能会做出某些决定。我们证明该算法在多项式时间内运行,并且在实践中具有良好的性能。将所提出的方法和算法与最新方法进行了比较。讨论了BioFederator的设计,服务和使用方案。我们演示了如何在实际的生物信息学应用程序中利用我们的体系结构。我们为核小体排除区域预备了整个人类基因组注释。研究了由此产生的注释,并将其与组织特异性,基因密度和其他重要的基因调控特征相关联。;我们还研究了网格环境下的数据处理模型。 MapReduce是一种流行的并行编程模型,已被证明可以缩放。但是,将低级MapReduce用于常规数据处理任务会带来开发,维护和重用自定义低级用户代码的问题。已经出现了几个框架来解决这个问题。这些框架共享自上而下的方法,其中使用高级语言描述问题的语义,并且框架负责将问题描述转换为MapReduce构造。我们重点介绍了现有方法中的几个问题,或者提出了一种新颖的改进MapReduce模型,该模型解决了可维护性和可重用性问题,同时又不牺牲直接编写MapReduce代码所提供的低级可控制性。我们介绍了MapReduce-LEGOS(MR-LEGOS),这是一个显式模型,用于由较简单的组件(“ Maplets”,“ Reducelets”和可选的“ Combinelets”)组成MapReduce构造。 Maplets和Reducelets是标准的MapReduce构造,可以构造为定义描述问题语义的聚合构造。可以将这种组合视为在MapReduce作业内部定义微工作流程。使用提出的模型,可以在保持构建基块简单的同时,在MR-LEGOS提供的微流程中定义复杂的问题语义。我们讨论设计细节,其主要功能和使用场景。通过实验评估,我们表明所提出的设计具有很高的可扩展性,并且在实践中具有良好的性能。

著录项

  • 作者

    Radwan, Ahmed M.;

  • 作者单位

    University of Miami.;

  • 授予单位 University of Miami.;
  • 学科 Engineering Computer.;Biology Bioinformatics.;Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 179 p.
  • 总页数 179
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号