首页> 外文学位 >Multiple hypothesis testing and multiple outlier identification methods.
【24h】

Multiple hypothesis testing and multiple outlier identification methods.

机译:多种假设检验和多种离群值识别方法。

获取原文
获取原文并翻译 | 示例

摘要

Traditional multiple hypothesis testing procedures, such as that of Benjamini and Hochberg, fix an error rate and determine the corresponding rejection region. In 2002 Storey proposed a fixed rejection region procedure and showed numerically that it can gain more power than the fixed error rate procedure of Benjamini and Hochberg while controlling the same false discovery rate (FDR). In this thesis it is proved that when the number of alternatives is small compared to the total number of hypotheses, Storey's method can be less powerful than that of Benjamini and Hochberg. Moreover, the two procedures are compared by setting them to produce the same FDR. The difference in power between Storey's procedure and that of Benjamini and Hochberg is near zero when the distance between the null and alternative distributions is large, but Benjamini and Hochberg's procedure becomes more powerful as the distance decreases. It is shown that modifying the Benjamini and Hochberg procedure to incorporate an estimate of the proportion of true null hypotheses as proposed by Black gives a procedure with superior power.;The proposed Bayesian multiple outlier identification procedure is applied to some simulated data sets. Various simulation and prior parameters are used to study the sensitivity of the posteriors to the priors. The area under the ROC curves (AUC) is calculated for each combination of parameters. A factorial design analysis on AUC is carried out by choosing various simulation and prior parameters as factors. The resulting AUC values are high for various selected parameters, indicating that the proposed method can identify the majority of outliers within tolerable errors. The results of the factorial design show that the priors do not have much effect on the marginal posterior probability as long as the sample size is not too small.;In this thesis, the proposed Bayesian procedure is also applied to a real data set obtained by Kanduc et al. in 2008. The proteomes of thirty viruses examined by Kanduc et al. are found to share a high number of pentapeptide overlaps to the human proteome. In a linear regression analysis of the level of viral overlaps to the human proteome and the length of viral proteome, it is reported by Kanduc et al. that among the thirty viruses, human T-lymphotropic virus 1, Rubella virus, and hepatitis C virus, present relatively higher levels of overlaps with the human proteome than the predicted level of overlaps. The results obtained using the proposed procedure indicate that the four viruses with extremely large sizes (Human herpesvirus 4, Human herpesvirus 6, Variola virus, and Human herpesvirus 5) are more likely to be the outliers than the three reported viruses. The results with the four extreme viruses deleted confirm the claim of Kanduc et al.;Multiple hypothesis testing can also be applied to regression diagnostics. In this thesis, a Bayesian method is proposed to test multiple hypotheses, of which the ith null and alternative hypotheses are that the ith observation is not an outlier versus it is, for i = 1, ···, m. In the proposed Bayesian model, it is assumed that outliers have a mean shift, where the proportion of outliers and the mean shift respectively follow a Beta prior distribution and a normal prior distribution. It is proved in the thesis that for the proposed model, when there exists more than one outlier, the marginal distributions of the deletion residual of the ith observation under both null and alternative hypotheses are doubly noncentral t distributions. The "outlyingness" of the i th observation is measured by the marginal posterior probability that the ith observation is an outlier given its deletion residual. An importance sampling method is proposed to calculate this probability. This method requires the computation of the density of the doubly noncentral F distribution and this is approximated using Patnaik's approximation. An algorithm is proposed in this thesis to examine the accuracy of Patnaik's approximation. The comparison of this algorithm's output with Patnaik's approximation shows that the latter can save massive computation time without losing much accuracy.
机译:传统的多重假设测试程序(例如Benjamini和Hochberg的程序)可以确定错误率并确定相应的拒绝区域。在2002年,Storey提出了一个固定的拒绝区域程序,并通过数字显示了它可以比Benjamini和Hochberg的固定错误率程序获得更多的功率,同时控制相同的错误发现率(FDR)。本文证明,当替代方案的数量少于假设总数时,Storey的方法可能不如Benjamini和Hochberg的方法有效。此外,通过将两个过程设置为产生相同的FDR来进行比较。当零值分布和替代分布之间的距离较大时,Storey程序与Benjamini和Hochberg的程序之间的功率差接近于零,但是Benjamini和Hochberg的程序随着距离的减小而变得更强大。结果表明,如Black所建议的那样,修改Benjamini和Hochberg程序以合并真实零假设的估计,可以得到具有较高功效的程序。拟议的贝叶斯多重离群值识别程序被应用于一些模拟数据集。各种模拟和先验参数用于研究后验者对先验的敏感性。针对每种参数组合计算ROC曲线下的面积(AUC)。通过选择各种模拟和先验参数作为因子,可以对AUC进行析因设计分析。对于各种选择的参数,所得的AUC值都很高,表明所提出的方法可以识别可容忍误差内的大多数离群值。析因设计的结果表明,只要样本量不太小,先验对边际后验概率的影响就不大。在本文中,本文提出的贝叶斯方法也适用于由Kanduc等。 Kanduc等人在2008年研究了30种病毒的蛋白质组。被发现与人体蛋白质组共享大量的五肽重叠。在对人类蛋白质组病毒重叠水平和病毒蛋白质组长度的线性回归分析中,据Kanduc等报道。在这三十种病毒中,人类T淋巴病毒1,风疹病毒和丙型肝炎病毒与人类蛋白质组的重叠程度相对于预测的重叠程度相对较高。使用建议的过程获得的结果表明,与三种报告的病毒相比,四种具有极大尺寸的病毒(人类疱疹病毒4,人类疱疹病毒6,天花病毒和人类疱疹病毒5)更可能是异常值。删除了四种极端病毒的结果证实了Kanduc等人的主张;多种假设检验也可以应用于回归诊断。本文提出了一种贝叶斯方法来检验多个假设,其中第i个零假设和另类假设是,第i个观察值与i的关系不是离群值,对于i = 1,···,m。在提出的贝叶斯模型中,假设离群值具有均值漂移,其中离群值的比例和均值漂移分别遵循Beta先验分布和正态先验分布。论文证明,对于提出的模型,当存在多个异常值时,在原假设和替代假设下,第i个观测值的删除残差的边际分布都是双重的非中心t分布。第i个观测值的“外在性”是通过第i个观测值在给定其删除残差后为离群值的边际后验概率来衡量的。提出了一种重要度抽样方法来计算该概率。该方法需要计算双重非中心F分布的密度,并且可以使用Patnaik逼近法对其进行近似。本文提出了一种算法来检验Patnaik逼近的准确性。该算法的输出与Patnaik逼近的比较表明,后者可以节省大量的计算时间,而不会损失太多准确性。

著录项

  • 作者

    Yin, Yaling.;

  • 作者单位

    The University of Saskatchewan (Canada).;

  • 授予单位 The University of Saskatchewan (Canada).;
  • 学科 Statistics.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 223 p.
  • 总页数 223
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号