首页> 外文学位 >Classification and variable selection for high dimensional multivariate binary data: Adaboost based new methods and a theory for the plug-in rule.
【24h】

Classification and variable selection for high dimensional multivariate binary data: Adaboost based new methods and a theory for the plug-in rule.

机译:高维多元二进制数据的分类和变量选择:基于Adaboost的新方法和插件规则的理论。

获取原文
获取原文并翻译 | 示例

摘要

We consider theoretically a classification problem where all the covariates are independent Bernoulli random variables Xji,1 ≤ i ≤ n and j = 0, 1, i.e., each variable has values 0 or 1, recording the presence or absence of an event. The parameters of Bernoulli random variables are estimated by using maximum likelihood estimation and they are plugged into the optimal Bayes rule, which is called the plug-in rule. This rule has been applied to real DNA fingerprint data as well as simulations in Wilbur et al. [2002] and shown to classify well even when the independence assumption does not hold. The asymptotic performance of the plug-in rule is the primary object of this study.;Since the number of variables and hence the number of Bernoulli parameters depend on the sample size, n, indicating the need of more and more complex models as n increases, the usual notion of consistency, i.e., convergence of estimates to fixed parameter values isn't applicable. We introduce triangular arrays and a suitably modified definition of consistency called persistence based on how close the performance of the plug-in rule to the classifier with known parameters, pji 1 ≤ i ≤ n and j = 0, 1. We present various cases where the plug-in rule is persistent or not persistent. Under sparsity condition, we show that the plug-in rule with well-chosen variables may overcome non-persistence. This shows that variable selection can be effective in high dimensional data with sparsity condition.;We also discuss convergence rate of the plug-in rule with sobolev ball type parameter spaces. We show that the plug-in rule with selected variables can improve convergence rate which shows that a simpler model may achieve better performance than the full model. As Bickel and Levina [2004] showed a naive Bayes model performs better than the full model, our results also underpin the well-known practical finding that a model with well-chosen variables may achieve better rate in prediction than the full model especially for high dimensional data.;In addition to the theoretical study of the plug-in rule, we propose and study a new methodology for classification and variable selection based on adaboost. Our application to real and simulated data suggests the new methods perform consider ably better than the plug-in rule. A theoretical study of the new methods is yet to be done.
机译:我们从理论上考虑一个分类问题,其中所有协变量都是独立的伯努利随机变量Xji,1≤i≤n且j = 0、1,即每个变量的值均为0或1,记录事件的存在或不存在。通过使用最大似然估计来估计伯努利随机变量的参数,并将其插入最佳贝叶斯规则(称为插入规则)。该规则已应用于真实的DNA指纹数据以及Wilbur等人的模拟中。 (2002年),即使在独立性假设不成立的情况下,也能很好地分类。插入规则的渐近性能是本研究的主要目标。由于变量的数量以及伯努利参数的数量取决于样本量n,因此随着n的增加,需要越来越复杂的模型,通常的一致性概念(即估算值收敛到固定参数值)不适用。我们基于插入式规则的性能与已知参数pji 1≤i≤n且j = 0,1的分类器的接近程度,介绍三角数组和适当修改的一致性定义(称为持久性)。插件规则是持久性还是非持久性。在稀疏条件下,我们表明具有精心选择的变量的插件规则可以克服非持久性。这表明变量选择在稀疏条件下的高维数据中是有效的。;我们还讨论了带有sobolev球类型参数空间的插入规则的收敛速度。我们表明,具有选定变量的插件规则可以提高收敛速度,这表明,较简单的模型可能会比完整模型具有更好的性能。由于Bickel和Levina [2004]的研究表明,朴素的贝叶斯模型的性能要优于完整模型,因此我们的研究结果也支持了一个著名的实践发现,即具有良好选择变量的模型的预测率可能高于完整模型,特别是对于维数据;除了对插件规则的理论研究以外,我们还提出并研究了一种基于adaboost的分类和变量选择的新方法。我们对真实和模拟数据的应用表明,新方法在性能上要比插件规则好得多。新方法的理论研究尚未完成。

著录项

  • 作者

    Park, Junyong.;

  • 作者单位

    Purdue University.;

  • 授予单位 Purdue University.;
  • 学科 Statistics.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 77 p.
  • 总页数 77
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号