首页> 外文期刊>Neural computation >Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks
【24h】

Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks

机译:具有身份初始化的梯度下降可通过深度残差网络有效地学习正定线性变换

获取原文
获取原文并翻译 | 示例
           

摘要

We analyze algorithms for approximating a function f(x)=phi x mapping R-d to R-d using deep linear neural networks, that is, that learn a function h parameterized by matrices Theta 1, horizontal ellipsis ,Theta(L) and defined by h(x)=Theta(L)Theta(L)-1 horizontal ellipsis Theta(1)x. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least-squares matrix phi, in the case where the initial hypothesis Theta 1= horizontal ellipsis =Theta(L)=I has excess loss bounded by a small enough constant. We also show that gradient descent fails to converge for phi whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If phi is symmetric positive definite, we show that an algorithm that initializes Theta i=I learns an epsilon-approximation of f using a number of updates polynomial in L, the condition number of phi, and log(d/epsilon). In contrast, we show that if the least-squares matrix phi is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that phi satisfies uT phi u>0 for all u but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant uT Theta L Theta L-1 horizontal ellipsis Theta 1u > 0 for all u and the other that "balances" Theta 1, horizontal ellipsis ,Theta L so that they have the same singular values.
机译:我们分析了使用深层线性神经网络近似将f(x)= phi x将Rd映射到Rd的算法,即学习由矩阵Theta 1,水平省略号,Theta(L)参数化并由h( x)= Theta(L)Theta(L)-1水平省略号Theta(1)x。我们关注于在输入的分布是各向同性的情况下通过梯度下降对总体二次损失进行学习的算法。在初始假设Theta 1 =水平省略号= Theta(L)= I具有由足够小的常数限制的额外损失的情况下,我们提供了梯度下降的迭代次数的多项式界,以逼近最小二乘矩阵phi。我们还表明,梯度下降无法收敛到phi,该phi与身份的距离是一个更大的常数,并且我们表明,某些形式的正则化形式在每一层中都无济于事。如果phi是对称正定的,我们表明初始化Theta i = I的算法使用L中的多个更新多项式,phi的条件数和log(d / epsilon)来学习f的ε逼近。相反,我们表明,如果最小二乘矩阵phi是对称的并且具有负特征值,则执行梯度下降和身份初始化并可选地对每一层中的身份进行正则化的一类算法的所有成员都无法收敛。我们针对phi对所有u满足uT phi u> 0但可能不是对称的情况分析一种算法。该算法使用两个正则化器:一个对所有u维持不变的uT Theta L Theta L-1水平省略号Theta 1u> 0,另一个保持“平衡” Theta 1,水平省略号Theta L从而使它们具有相同的奇异值。

著录项

  • 来源
    《Neural computation》 |2019年第3期|477-502|共26页
  • 作者单位

    Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA;

    Univ Calif Santa Cruz, Comp Sci Dept, Santa Cruz, CA 95064 USA;

    Google, Mountain View, CA 94043 USA;

  • 收录信息 美国《科学引文索引》(SCI);美国《化学文摘》(CA);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号