AdaHessian: a second order optimizer for deep learning

6 min readAug 9, 2021

AdaHessian on its way to the minimum. (Image by author)

Most of the optimizers used in deep learning are (stochastic) gradient descent methods. They only consider the gradient of the loss function. In comparison, second order methods also take the curvature of the loss function into account. With that better update steps can be computed (at least in theory). There are only a few second order methods available for deep learning — one of them is AdaHessian, published 2020 by Yao et al. A PyTorch implementation is provided by the authors.

In its most basic form, second order methods require computing the Hessian matrix, which contains N×N elements, where N is the number of parameters (weights) in the neural network. This is infeasible for most models out there. AdaHessian contains some interesting techniques to tackle that problem. However, they might look complex at first sight. This article goes through these techniques which efficiently compute an approximation of the Hessian matrix.

Motivation

Gradient descent models the loss function using only its first order derivatives: the algorithm takes a small enough step (controlled by the learning rate) in the direction of the negative gradient and thereby the loss value decreases. Curvature information (second order derivatives collected in the Hessian matrix) of the loss function is ignored.

Fig. 1 shows the loss function (green) and the gradient (red) at the current position. The left plot shows the case where the gradient exactly matches the loss function locally. The right plot shows the case where the loss function turns upwards when moving in the direction of the negative gradient. While it might make sense to apply a large update step in the left plot, a smaller update step is needed in the right plot to avoid overshooting the minimum.

Fig. 1: Loss function f(w) (green) and its gradient at w=-1 (red). (Image by author)

The Newton update step

Newton’s method handles exactly this: it takes both the gradient and the curvature at the current position into account. It models the loss function f(w) (with w the parameters or weights of the model) locally with a quadratic model. The update step is derived as follows (see Fig. 2 for formulas):

A quadratic model m is computed using a Taylor polynomial
g is the gradient and H the Hessian matrix
To find the minimum of that model m, compute the derivative of the model, set it to zero, and solve for the update step Δw
This gives the Newton update with the negative gradient scaled and rotated by the inverse of the Hessian

Fig. 2: Derive update step for Newton’s method. (Image by author)

Both gradient and Hessian could be computed with a deep learning framework like PyTorch, which is everything that is needed to apply Newton’s method. However, for a neural network model with N parameters, the Hessian matrix consists of N×N elements, which is not feasible for typical models. AdaHessian approximates the Hessian matrix with a diagonal matrix, which only consists of N elements (same size as the gradient vector).

Compute diagonal of Hessian

We now have the Newton update formula and we restrict the Hessian approximation to a diagonal matrix. Let’s see how it is computed.

Hutchinson’s method

Fig. 3 shows the formula of Hutchinson’s method, which computes the diagonal elements of the Hessian:

Create a random vector z by flipping a coin for each of its elements, and set +1 for head and -1 for tail, so in the 2D case z could be (1, -1) as an example
Compute matrix-vector product H·z
Multiply z element-wise (denoted by ⊙) with the result from the previous step z⊙(H·z)
Compute expected value (or the average value using multiple instances of the z vector in a real implementation)

Fig. 3: Hutchinson’s method to compute diagonal of H. (Image by author)

This looks strange at first sight. We already need H to get the diagonal elements of H — this does not sound very clever. But it turns out that we only need to know the result of H·z (a vector), which is easy to compute, so we never need to know the actual elements of the full Hessian matrix.

But how does Hutchinson’s method work? When writing it down for the 2D case (Fig. 4) it is easy to see. Element-wise multiplication means multiplying the vectors row-wise. Inspecting the result of z⊙(H·z) shows terms with z₁² (and z₂²) and terms with z₁·z₂ in it. When computing z⊙(H·z) for multiple trials, z₁² (and z₂²) is always +1, while z₁·z₂ gives +1 in 50% of the trials and -1 for the other 50% (simply write down all possible products: 1·1, 1·(-1), (-1)·1, (-1)·(-1)). When computing the mean over multiple trials, the terms containing z₁·z₂ tend to zero, and we are left with a vector of the diagonal elements of H.

Fig. 4: Hutchinson’s method in the 2D case. (Image by author)

Matrix vector product

There is only one problem left: we don’t have the Hessian matrix H which is used in Hutchinson’s method. However, as already mentioned we don’t actually need the Hessian. It is enough if we have the result of H·z. It is computed with the help of PyTorch’s automatic differentiation functions: if we take the gradient already computed by PyTorch, multiple it by z, and differentiate w.r.t. the parameter vector w, we get the same result as if we compute H·z directly. So we can compute H·z without knowing the elements of H. Fig. 5 shows why this is true: differentiating the gradient one more time gives the Hessian and z is treated as a constant.

Fig. 5: Equality of what we compute with PyTorch’s automatic differentiation (left) and the matrix vector product H·z that we need for Hutchinson’s method (right). (Image by author)

Putting it all together

We now have the gradient and the diagonal values of the Hessian, so we could directly apply the Newton update step. However, AdaHessian follows a similar approach as the Adam optimizer: it averages over multiple time-steps to get a smoother estimate of the gradient and the Hessian.

A single update step looks as follows:

Compute current gradient
Compute current diagonal Hessian approximation
Average gradient by mixing it with earlier values
Average Hessian by mixing it with earlier values, and take care that the diagonal elements are positive (otherwise, we could end up moving in an ascent direction)
Compute Newton update step, including a learning rate parameter to avoid too large steps leading to divergence

In Fig. 6 a simple quadratic function is minimized with gradient descent and AdaHessian. Here, AdaHessian is used without momentum. On this toy example, when compared to gradient descent, it converges faster and does not need tuning of the learning rate. Of course, minimizing a quadratic function favors second order optimizers.

Fig. 6: 3 steps towards the minimum of a quadratic function. Left: gradient descent with an appropriate learning rate. Right: Newton’s method using a diagonal Hessian approximation (right). (Image by author)

Conclusion

AdaHessian solves the task of computing the Hessian by (1) using a diagonal approximation, (2) making use of Hutchinson’s method and (3) the automatic differentiation functions included in PyTorch.

Links: