首页> 外文学位 >Classification and variable selection for high dimensional multivariate binary data: Adaboost based new methods and a theory for the plug-in rule.

【24h】

Classification and variable selection for high dimensional multivariate binary data: Adaboost based new methods and a theory for the plug-in rule.

机译：高维多元二进制数据的分类和变量选择：基于Adaboost的新方法和插件规则的理论。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We consider theoretically a classification problem where all the covariates are independent Bernoulli random variables Xji,1 ≤ i ≤ n and j = 0, 1, i.e., each variable has values 0 or 1, recording the presence or absence of an event. The parameters of Bernoulli random variables are estimated by using maximum likelihood estimation and they are plugged into the optimal Bayes rule, which is called the plug-in rule. This rule has been applied to real DNA fingerprint data as well as simulations in Wilbur et al. [2002] and shown to classify well even when the independence assumption does not hold. The asymptotic performance of the plug-in rule is the primary object of this study.;Since the number of variables and hence the number of Bernoulli parameters depend on the sample size, n, indicating the need of more and more complex models as n increases, the usual notion of consistency, i.e., convergence of estimates to fixed parameter values isn't applicable. We introduce triangular arrays and a suitably modified definition of consistency called persistence based on how close the performance of the plug-in rule to the classifier with known parameters, pji 1 ≤ i ≤ n and j = 0, 1. We present various cases where the plug-in rule is persistent or not persistent. Under sparsity condition, we show that the plug-in rule with well-chosen variables may overcome non-persistence. This shows that variable selection can be effective in high dimensional data with sparsity condition.;We also discuss convergence rate of the plug-in rule with sobolev ball type parameter spaces. We show that the plug-in rule with selected variables can improve convergence rate which shows that a simpler model may achieve better performance than the full model. As Bickel and Levina [2004] showed a naive Bayes model performs better than the full model, our results also underpin the well-known practical finding that a model with well-chosen variables may achieve better rate in prediction than the full model especially for high dimensional data.;In addition to the theoretical study of the plug-in rule, we propose and study a new methodology for classification and variable selection based on adaboost. Our application to real and simulated data suggests the new methods perform consider ably better than the plug-in rule. A theoretical study of the new methods is yet to be done.

机译：我们从理论上考虑一个分类问题，其中所有协变量都是独立的伯努利随机变量Xji，1≤i≤n且j = 0、1，即每个变量的值均为0或1，记录事件的存在或不存在。通过使用最大似然估计来估计伯努利随机变量的参数，并将其插入最佳贝叶斯规则（称为插入规则）。该规则已应用于真实的DNA指纹数据以及Wilbur等人的模拟中。（2002年），即使在独立性假设不成立的情况下，也能很好地分类。插入规则的渐近性能是本研究的主要目标。由于变量的数量以及伯努利参数的数量取决于样本量n，因此随着n的增加，需要越来越复杂的模型，通常的一致性概念（即估算值收敛到固定参数值）不适用。我们基于插入式规则的性能与已知参数pji 1≤i≤n且j = 0，1的分类器的接近程度，介绍三角数组和适当修改的一致性定义（称为持久性）。插件规则是持久性还是非持久性。在稀疏条件下，我们表明具有精心选择的变量的插件规则可以克服非持久性。这表明变量选择在稀疏条件下的高维数据中是有效的。；我们还讨论了带有sobolev球类型参数空间的插入规则的收敛速度。我们表明，具有选定变量的插件规则可以提高收敛速度，这表明，较简单的模型可能会比完整模型具有更好的性能。由于Bickel和Levina [2004]的研究表明，朴素的贝叶斯模型的性能要优于完整模型，因此我们的研究结果也支持了一个著名的实践发现，即具有良好选择变量的模型的预测率可能高于完整模型，特别是对于维数据；除了对插件规则的理论研究以外，我们还提出并研究了一种基于adaboost的分类和变量选择的新方法。我们对真实和模拟数据的应用表明，新方法在性能上要比插件规则好得多。新方法的理论研究尚未完成。

著录项

作者
Park, Junyong.;
展开▼
作者单位

Purdue University.;

展开▼
授予单位 Purdue University.;
学科 Statistics.
学位 Ph.D.
年度 2006
页码 77 p.
总页数 77
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Partial correlation based variable selection approach for multivariate data classification methods [J] . K. Raghuraj Rao, S. Lakshminarayanan Chemometrics and Intelligent Laboratory Systems . 2007,第1期

机译：基于偏相关的变量选择方法用于多元数据分类方法
2. Variable selection in high-dimensional multivariate binary data with application to the analysis of microbial community DNA fingerprints. [J] . Wilbur JD, Ghosh JK, Nakatsu CH, Biometrics: Journal of the Biometric Society : An International Society Devoted to the Mathematical and Statistical Aspects of Biology . 2002,第2期

机译：高维多元二进制数据中的变量选择及其在微生物群落DNA指纹分析中的应用。
3. Multivariate binary classification of imbalanced datasetsA case study based on high-dimensional multiplex autoimmune assay data [J] . Biometrical Journal . 2017,第5期

机译：基于高维复用自身免疫检测数据的非相等数据集案例的多变量二进制分类
4. A Hybrid Feature Selection Method Based on Symmetrical Uncertainty and Support Vector Machine for High-Dimensional Data Classification [C] . Yongjun Piao, Keun Ho Ryu Asian conference on intelligent information and database systems . 2017

机译：基于对称不确定度和支持向量机的高维数据分类混合特征选择方法
5. Pre-processing methods and stepwise variable selection for binary classification of high-dimensional data. [D] . Ramachandar, Shahla. 2010

机译：高维数据二进制分类的预处理方法和逐步变量选择。
6. A Kernel-Based Multivariate Feature Selection Method for Microarray Data Classification [O] . Shiquan Sun, Qinke Peng, Adnan Shakoor -1

机译：基于核的多元特征选择方法在微阵列数据分类中的应用
7. A kernel-based multivariate feature selection method for microarray data classification. [O] . Shiquan Sun, Qinke Peng, Adnan Shakoor 2014

机译：一种基于核的多变量特征选择方法，用于微阵列数据分类。

Classification and variable selection for high dimensional multivariate binary data: Adaboost based new methods and a theory for the plug-in rule.

摘要

著录项

相似文献

相关主题

期刊订阅