Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks

Bartlett Peter L.; Helmbold David P.; Long Philip M.

首页> 外文期刊>Neural computation >Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks

【24h】

Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks

机译：具有身份初始化的梯度下降可通过深度残差网络有效地学习正定线性变换

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We analyze algorithms for approximating a function f(x)=phi x mapping R-d to R-d using deep linear neural networks, that is, that learn a function h parameterized by matrices Theta 1, horizontal ellipsis ,Theta(L) and defined by h(x)=Theta(L)Theta(L)-1 horizontal ellipsis Theta(1)x. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least-squares matrix phi, in the case where the initial hypothesis Theta 1= horizontal ellipsis =Theta(L)=I has excess loss bounded by a small enough constant. We also show that gradient descent fails to converge for phi whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If phi is symmetric positive definite, we show that an algorithm that initializes Theta i=I learns an epsilon-approximation of f using a number of updates polynomial in L, the condition number of phi, and log(d/epsilon). In contrast, we show that if the least-squares matrix phi is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that phi satisfies uT phi u>0 for all u but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant uT Theta L Theta L-1 horizontal ellipsis Theta 1u > 0 for all u and the other that "balances" Theta 1, horizontal ellipsis ,Theta L so that they have the same singular values.

机译：我们分析了使用深层线性神经网络近似将f（x）= phi x将Rd映射到Rd的算法，即学习由矩阵Theta 1，水平省略号，Theta（L）参数化并由h（ x）= Theta（L）Theta（L）-1水平省略号Theta（1）x。我们关注于在输入的分布是各向同性的情况下通过梯度下降对总体二次损失进行学习的算法。在初始假设Theta 1 =水平省略号= Theta（L）= I具有由足够小的常数限制的额外损失的情况下，我们提供了梯度下降的迭代次数的多项式界，以逼近最小二乘矩阵phi。我们还表明，梯度下降无法收敛到phi，该phi与身份的距离是一个更大的常数，并且我们表明，某些形式的正则化形式在每一层中都无济于事。如果phi是对称正定的，我们表明初始化Theta i = I的算法使用L中的多个更新多项式，phi的条件数和log（d / epsilon）来学习f的ε逼近。相反，我们表明，如果最小二乘矩阵phi是对称的并且具有负特征值，则执行梯度下降和身份初始化并可选地对每一层中的身份进行正则化的一类算法的所有成员都无法收敛。我们针对phi对所有u满足uT phi u> 0但可能不是对称的情况分析一种算法。该算法使用两个正则化器：一个对所有u维持不变的uT Theta L Theta L-1水平省略号Theta 1u> 0，另一个保持“平衡” Theta 1，水平省略号Theta L从而使它们具有相同的奇异值。

著录项

来源
《Neural computation》 |2019年第3期|477-502|共26页
作者
Bartlett Peter L.; Helmbold David P.; Long Philip M.;
展开▼
作者单位

Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA;

Univ Calif Santa Cruz, Comp Sci Dept, Santa Cruz, CA 95064 USA;

Google, Mountain View, CA 94043 USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《化学文摘》(CA);
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks [J] . Peter Bartlett, Dave Helmbold, Philip Long JMLR: Workshop and Conference Proceedings . 2018,第2010期

机译：具有身份初始化的梯度下降可通过深度残差网络有效地学习正定线性变换
2. The gradient clusteron: A model neuron that learns to solve classification tasks via dendritic nonlinearities, structural plasticity, and gradient descent [J] . Toviah Moldwin, Menachem Kalmenson, an Segev PLoS Computational Biology . 2021,第5期

机译：渐变群集：一个模型神经元，用于通过树突非线性，结构可塑性和梯度下降来解决分类任务
3. Learning to Learn without Gradient Descent by Gradient Descent [J] . Yutian Chen, Matthew W. Hoffman, Sergio Gómez Colmenarejo, JMLR: Workshop and Conference Proceedings . 2017,第2009期

机译：通过梯度下降学习无梯度下降的学习
4. Global Convergence of Gradient Descent for Deep Linear Residual Networks [C] . Lei Wu, Qingcan Wang, Chao Ma Conference on Neural Information Processing Systems . 2020

机译：深度线性残余网络梯度下降的全局融合
5. On the Ability of Gradient Descent to Learn Neural Networks [D] . Li, Yuanzhi. 2018

机译：梯度下降学习神经网络的能力
6. Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks [O] . Shrihari Vasudevan 2020

机译：基于互动信息的学习速率衰减用于深神经网络的随机梯度血统训练
7. Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks [O] . Peter L. Bartlett, David P. Helmbold, Philip M. Long 2019

机译：具有身份初始化的梯度下降有效地通过深度剩余网络了解正定的线性变换

Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks

摘要

著录项

相似文献

相关主题

期刊订阅