Thoughts and Theory

AdaHessian: a second order optimizer for deep learning

Harald Scheidl
Towards Data Science

--

AdaHessian on its way to the minimum. (Image by author)

Most of the optimizers used in deep learning are (stochastic) gradient descent methods. They only consider the gradient of the loss function. In comparison, second order methods also take the curvature of the loss function into account. With that better update steps can be computed (at least in theory). There are only a few second order…

--

--