首页> 外文OA文献 >Interpretation, Identification and Reuse of Models. Theory and algorithms with applications in predictive toxicology.
【2h】

Interpretation, Identification and Reuse of Models. Theory and algorithms with applications in predictive toxicology.

机译:模型的解释,识别和重用。理论和算法在预测毒理学中的应用。

摘要

This thesis is concerned with developing methodologies that enable existingudmodels to be effectively reused. Results of this thesis are presented inudthe framework of Quantitative Structural-Activity Relationship (QSAR)udmodels, but their application is much more general. QSAR models relateudchemical structures with their biological, chemical or environmentaludactivity. There are many applications that offer an environment to buildudand store predictive models. Unfortunately, they do not provide advancedudfunctionalities that allow for efficient model selection and for interpretationudof model predictions for new data. This thesis aims to address theseudissues and proposes methodologies for dealing with three research problems:udmodel governance (management), model identification (selection),udand interpretation of model predictions. The combination of these methodologiesudcan be employed to build more efficient systems for model reuseudin QSAR modelling and other areas.udThe first part of this study investigates toxicity data and model formatsudand reviews some of the existing toxicity systems in the context of modeluddevelopment and reuse. Based on the findings of this review and the principlesudof data governance, a novel concept of model governance is defined.udModel governance comprises model representation and model governanceudprocesses. These processes are designed and presented in the context ofudmodel management. As an application, minimum information requirementsudand an XML representation for QSAR models are proposed.udOnce a collection of validated, accepted and well annotated models isudavailable within a model governance framework, they can be applied forudnew data. It may happen that there is more than one model available forudthe same endpoint. Which one to chose? The second part of this thesisudproposes a theoretical framework and algorithms that enable automatedudidentification of the most reliable model for new data from the collectionudof existing models. The main idea is based on partitioning of the searchudspace into groups and assigning a single model to each group. The constructionudof this partitioning is difficult because it is a bi-criteria problem.udThe main contribution in this part is the application of Pareto points forudthe search space partition. The proposed methodology is applied to threeudendpoints in chemoinformatics and predictive toxicology.udAfter having identified a model for the new data, we would like to knowudhow the model obtained its prediction and how trustworthy it is. An interpretationudof model predictions is straightforward for linear models thanksudto the availability of model parameters and their statistical significance.udFor non linear models this information can be hidden inside the modeludstructure. This thesis proposes an approach for interpretation of a randomudforest classification model. This approach allows for the determination ofudthe influence (called feature contribution) of each variable on the modeludprediction for an individual data. In this part, there are three methods proposedudthat allow analysis of feature contributions. Such analysis mightudlead to the discovery of new patterns that represent a standard behaviourudof the model and allow additional assessment of the model reliability forudnew data. The application of these methods to two standard benchmarkuddatasets from the UCI machine learning repository shows a great potentialudof this methodology. The algorithm for calculating feature contributionsudhas been implemented and is available as an R package called rfFC.
机译:本文致力于开发使现有 udmodel能够有效重用的方法。本文的结果在定量构效关系(QSAR) udmodel框架中给出,但其应用更为广泛。 QSAR模型将化学结构与其生物学,化学或环境化学活性联系起来。有许多应用程序提供了构建 udand存储预测模型的环境。不幸的是,它们没有提供允许有效模型选择和对新数据的模型预测进行解释的高级功能。本文旨在解决这些问题,并提出解决三个研究问题的方法:模型管理(管理),模型识别(选择),模型预测的解释。这些方法的组合可以用来建立更有效的模型重用系统 udin QSAR建模和其他领域。 ud本研究的第一部分研究毒性数据和模型格式 udand回顾了一些现有的毒性系统模型 uddevelop和重用。基于本次审查的结果和数据治理的原则 udud,定义了模型治理的新概念。 ud模型治理包括模型表示和模型治理 udprocesss。这些过程是在 udmodel管理的上下文中设计和呈现的。作为应用程序,提出了对QSAR模型的最低信息要求 udum和XML表示。 ud一旦在模型管理框架中可以使用一组经过验证,接受并带有良好注释的模型,就可以将它们应用于 udnew数据。同一端点可能有多个模型可用。选择哪一个?本论文的第二部分提出了一种理论框架和算法,该框架和算法可以对来自现有模型的集合中的新数据的最可靠模型进行自动识别。主要思想是基于将search udspace划分为组,并为每个组分配一个模型。分区的构造 ud很困难,因为它是一个双准则问题。 ud这部分的主要贡献是将Pareto点应用于 ud搜索空间分区。拟议的方法应用于化学信息学和预测毒理学的三个方面。 ud在为新数据确定了模型之后,我们想知道 udad该模型如何获得其预测以及其可信度。得益于模型参数的可用性及其统计意义,对于线性模型,模型预测的解释非常简单。 ud对于非线性模型,此信息可以隐藏在模型 udstructure内部。本文提出了一种解释随机 udforest分类模型的方法。这种方法可以确定每个变量对模型的预测的每个变量的影响(称为特征贡献)。在本部分中,提出了三种可以分析特征贡献的方法。这样的分析可能会导致发现代表模型的标准行为的新模式,并允许对新数据的模型可靠性进行额外评估。这些方法对UCI机器学习存储库中的两个标准基准 uddatasets的应用显示了该方法的巨大潜力。已经实现了用于计算特征贡献的算法,该算法可作为称为rfFC的R包获得。

著录项

  • 作者

    Palczewska Anna Maria;

  • 作者单位
  • 年度 2014
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号