The resurgence of deep learning, as a highly effective machine learningparadigm, has brought back to life the old optimization question ofnon-convexity. Indeed, the challenges related to the large-scale nature of manymodern machine learning applications are severely exacerbated by the inherentnon-convexity in the underlying models. In this light, efficient optimizationalgorithms which can be effectively applied to such large-scale and non-convexlearning problems are highly desired. In doing so, however, the bulk ofresearch has been almost completely restricted to the class of 1st-orderalgorithms. This is despite the fact that employing the curvature information,e.g., in the form of Hessian, can indeed help with obtaining effective methodswith desirable convergence properties for non-convex problems, e.g., avoidingsaddle-points and convergence to local minima. The conventional wisdom, in themachine learning community is that the application of 2nd-order methods, i.e.,those that employ Hessian as well as gradient information, can be highlyinefficient. Consequently, 1st-order algorithms, such as stochastic gradientdescent (SGD), have been at the center-stage for solving such machine learningproblems. Here, we aim at addressing this misconception by consideringefficient and stochastic variants of Newton's method, namely, sub-sampledtrust-region and cubic regularization, whose theoretical convergence propertieshave recently been established in [Xu 2017]. Using a variety of experiments, weempirically evaluate the performance of these methods for solving non-convexmachine learning applications. In doing so, we highlight the shortcomings of1st-order methods, e.g., high sensitivity to hyper-parameters such as step-sizeand undesirable behavior near saddle-points, and showcase the advantages ofemploying curvature information as effective remedy.
展开▼