Why is the Adam Optimizer Working so Well?

1 February 2026deep learning | math

\gdef\H{\v{H}}

I wrote this post because I was frustrated to find no convincing theoretical explanation of the success of the Adam optimizer (Kingma, 2014). More precisely, Adam is the RMSprop optimizer (Hinton, 2012) + momentum. While understanding momentum is quite simple as a way to deal with a badly conditioned loss landscape, the RMSProp update rule is often unintuitive when coming from the world of quadratic optimization; in particular, the square root in the denominator is quite intriguing. Fortunately, I stumbled on an article that gave a somewhat satisfying answer (Aitchison, 2020). However, I found the derivation somewhat convoluted, particularly the introduction of Ornstein–Uhlenbeck dynamics. In this article, I describe a Bayesian derivation of the RMSProp update rule, close to the one proposed in the mentioned article, but conceptually simpler.

The derivation uses the framework of Bayesian estimation/filtering, unlike other more common geometric approaches.

Let's begin! We will first demonstrate how SGD (Stochastic Gradient Descent) can be derived from Bayesian principles and then show how Adam refines it.

Bayesian derivation

Maximum likelihood principle

Usually, we train models using the maximum likelihood principle: we want to find the most likely parameters $\v{\theta}^*$ of our model given the observation of our dataset $\mathcal{D}$ . More formally, we are searching:

\v{\theta}^* = \arg\max_{\v{\theta}} p(\v{\theta} | \mathcal{D})

Bayes theorem is usually used here to get:

p(\v{\theta}|\mathcal{D}) = \frac{p(\mathcal{D}|\v{\theta}) p(\v{\theta})}{p(\mathcal{D})}

And using the log likelihood instead of the likelihood, we get:

\begin{align*} \log p(\v{\theta}|\mathcal{D}) &= \log p(\mathcal{D}|\v{\theta}) + \log p(\v{\theta}) - \log p(\mathcal{D}) \\ &\smile \underbrace{\log p(\mathcal{D}|\v{\theta})}_\text{evidence} + \underbrace{\log p(\v{\theta})}_\text{prior} \end{align*}

This motivates the use of the loss function as an optimization objective with regularization of the weights (e.g., weight decay) to model the evidence and prior terms, respectively. The analysis generally proceeds by introducing an optimizer to find the best parameters using geometric arguments. It then explains the difficulty of obtaining the true gradient and introduces Stochastic Gradient Descent (SGD) as a solution to this issue.

Going further to discover SGD

In this section, we further apply Bayesian methods to gain deeper insight into the problem.

In practice, the model sees dataset samples sequentially. At each step, we present the model only a subset $\mathcal{B}$ of the dataset (a "batch"). Suppose we are already at some step $t$ of our training and we have an estimate $\v{\theta}_t$ of $\v{\theta}^*$ . We have a new batch $\mathcal{B}$ , and we want to update our estimate using the information it contains. As before, we can express this problem as a maximum likelihood estimation problem. However, this time we will not assume we know the whole dataset, only the previous parameter estimate and the new batch:

\v{\theta}^* = \arg\max_{\v{\theta}} p(\v{\theta}|\mathcal{B}, \v{\theta_{t}})

We can use the Bayes theorem here too:

\begin{align*} p(\v{\theta}|\mathcal{B}, \v{\theta}_t) &= \frac{p(\mathcal{B}|\v{\theta}) p(\v{\theta}|\v{\theta}_t)} {p(\mathcal{B}|\v{\theta}_t)} \end{align*}

Taking the log likelihood and getting rid of constant (wrt $\v{\theta}$ ) terms, we get:

\begin{align*} \underbrace{\log p(\v{\theta}|\mathcal{B}, \v{\theta}_t)}_\text{new belief} &\smile \underbrace{\log p(\mathcal{B}|\v{\theta})}_\text{new evidence} + \underbrace{\log p(\v{\theta}|\v{\theta}_t)}_\text{past belief} \end{align*}

Now, if we decide to model the term $p(\v{\theta}|\v{\theta}_t)$ by a normal distribution $\mathcal{N}(\v{\theta}_t, \sigma^2 \v{I})$ with a variance $\sigma^2$ chosen to represent our uncertainty about the value of $\v{\theta}^*$ prior to observing $\mathcal{B}$ , we get:

\begin{align*} \log p(\v{\theta}|\mathcal{B}, \v{\theta}_t) &\smile \log p(\mathcal{B}|\v{\theta}) - \frac{1}{2\sigma^2}\Vert \v{\theta} - \v{\theta}_t\Vert^2 \end{align*}

The second term forces the optimal value to be relatively close to $\v{\theta}_t$ . We can thus use a linear approximation for the loss function $\mathcal{L}(\theta) = -\log{p(\mathcal{B} | \v{\theta})}$ near $\v{\theta}_t$ , the problem becomes the following minimization problem:

\begin{align*} \theta_{t+1} &\approx \arg\min_{\v{\theta}} \nabla \mathcal{L}(\v{\theta}_t)^\top (\v{\theta} - \v{\theta}_t) + \frac{1}{2\sigma^2} \Vert \v{\theta} - \v{\theta}_t \Vert^2 \\ &\approx \v{\theta}_t - \underbrace{\sigma^2}_{\triangleq \eta} \nabla \mathcal{L}(\v{\theta}) \\ &\approx \v{\theta}_t - \eta\nabla\mathcal{L}(\v{\theta}) \end{align*}

This is the stochastic gradient descent algorithm. This derivation shows that the SGD algorithm can be understood and derived using Bayesian principles. The learning rate represents the variance in our belief of the "optimal parameters". As a bonus, within this framework, decaying learning rate scheduling has a clear interpretation as an increase in confidence in the estimate of the best parameters as training progresses. A learning rate decreasing like $1/t$ would match the Cramer-Rao bound convergence rate.

Going even further to discover RMSProp

The question is, can we derive other popular optimization algorithms from this framework? In particular, can we estimate $\sigma^2$ instead of guessing its value?

We will start from here:

\begin{align*} \log p(\v{\theta}|\mathcal{B}, \v{\theta}_t) &\smile \underbrace{\log p(\mathcal{B}|\v{\theta})}_\text{evidence} + \underbrace{\log p(\v{\theta}|\v{\theta}_t)}_\text{prior} \end{align*}

In our derivation of SGD, we used a linear approximation for $\mathcal{L}(\v{\theta}) = -\log p(\mathcal{B}|\v{\theta})$ . We will see that we can actually do better using a quadratic approximation. Indeed, the Taylor expansion theorem states that we can do the following approximation:

- \log p(\mathcal{B}|\v{\theta}) \approx - \log{p(\mathcal{B}|\v{\theta}_t)} - \v{g}^\top (\v{\theta}-\v{\theta}_t) - \frac{1}{2}(\v{\theta} - \v{\theta}_t)^\top \v{H} (\v{\theta} - \v{\theta}_t)

with:

$\v{g} \triangleq \nabla \log p(\mathcal{B}|\v{\theta})$
$\v{H} \triangleq \frac{\partial^2}{\partial \v{\theta}^2} \log p(\mathcal{B}|\v{\theta})$

Unfortunately, computing $\H$ is intractable when there are many parameters. Fortunately, we will see that we can get an estimate of $\mathbb{E}[\v{H}]$ . indeed:

\begin{align*} \mathbb{E}[\v{H}] &= \mathbb{E}\left[\frac{\partial^2}{\partial \v{\theta}^2} \log p(\mathcal{B}|\v{\theta})\right] \\ &= \mathbb{E}\left[\frac{\partial}{\partial \v{\theta}} \frac{\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})}{ p(\mathcal{B}|\v{\theta})}\right] \\ &= \mathbb{E}\left[ \frac{\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})^\top - \frac{\partial^2}{\partial \v{\theta}^2} p(\mathcal{B}|\v{\theta})p(\mathcal{B}|\v{\theta})}{p(\mathcal{B}|\v{\theta})^2}\right] \\ &= \mathbb{E}\left[\left(\frac{\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})}{p(\mathcal{B}|\v{\theta})}\right)\left(\frac{\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})^\top}{p(\mathcal{B}|\v{\theta})}\right)^\top - \frac{\frac{\partial^2}{\partial \v{\theta}^2} p(\mathcal{B}|\v{\theta})}{ p(\mathcal{B}|\v{\theta})}\right] \\ &= \mathbb{E}\left[\v{g}\v{g}^\top\right] - \int p(\mathcal{B}|\v{\theta}) \frac{\frac{\partial^2}{\partial \v{\theta}^2} p(\mathcal{B}|\v{\theta})}{p(\mathcal{B}|\v{\theta})}d\mathcal{B} \\ &= \mathbb{E}\left[\v{g}\v{g}^\top\right] - \underbrace{\frac{\partial^2}{\partial \v{\theta}^2} \underbrace{\int p(\mathcal{B}|\v{\theta}) d\mathcal{B}}_{=1}}_{=0} \\ &= \mathbb{E}\left[\v{g}\v{g}^\top\right] \end{align*}

Defining $\v{G}\triangleq \mathbb{E}\left[\v{g}\v{g}^\top\right]$ , our approximation becomes:

- \log p(\mathcal{B}|\v{\theta}) \approx - \log{p(\mathcal{B}|\v{\theta}_t)} - \v{g}^\top (\v{\theta}-\v{\theta}_t) - \frac{1}{2}(\v{\theta} - \v{\theta}_t)^\top \v{G} (\v{\theta} - \v{\theta}_t)

and

\begin{align*} p(\mathcal{B}|\v{\theta}) \propto& \exp \left(-\v{g}^\top (\v{\theta}-\v{\theta}_t) - \frac{1}{2}(\v{\theta} - \v{\theta}_t)^\top \v{G} (\v{\theta} - \v{\theta}_t)\right) \\ \propto& \exp \left(-\frac{1}{2}(\v{\theta} - \v{\theta^*})^\top \v{G} (\v{\theta} - \v{\theta^*})\right) \end{align*}

With $\v{\theta^*} = \v{\theta}_t - \v{G}^{-1} \v{g}$ (the proof is simple and left to the reader or to ChatGPT). It means that $p(\mathcal{B}|\v{\theta})$ is approximately Gaussian.

It follows that $p(\v{\theta}|\mathcal{B}, \v{\theta}_t) = \frac{p(\mathcal{B}|\v{\theta}) p(\v{\theta}|\v{\theta}_t)} {p(\mathcal{B}|\v{\theta}_t)}$ is also close to a Gaussian (as a product of Gaussian) with inverse covariance:

\v{\Sigma}^{-1} = \v{G} + \v{\Sigma}^{-1}_t

with $\v{\Sigma}_t$ the covariance matrix of $p(\v{\theta}|\v{\theta}_t)$ . (To see it, multiply the two terms in the numerator; the inverse covariances will sum in the exponential).

And mean:

\begin{align*} \mu &= \v{\Sigma} (\v{G} (\v{\theta}_t - \v{G}^{-1} \v{g}) + \v{\Sigma}_t^{-1} \v{\theta}_t) \\ &= \v{\Sigma} (\underbrace{(\v{G} + \v{\Sigma}_t^{-1})}_{\v{\Sigma}^{-1}}\v{\theta}_t - \v{g}) \\ &= \v{\theta}_t - \v{\Sigma} \v{g} \end{align*}

To account for our approximation errors, we can add a small term $\eta^2 \v{I}$ to the covariance:

\v{\Sigma}_{t+1} = (\v{G} + \v{\Sigma}^{-1}_t)^{-1} +\eta^2\v{I}

This modelization (quadratic approximation + added variance) defines a new Bayesian optimizer:

\begin{align*} \v{G}_{t+1} &= \v{g}\v{g}^\top + (1- \beta) \v{G}_t \\ \v{\Sigma}_{t+1} &= (\v{G} + \v{\Sigma}_t^{-1})^{-1} + \eta^2 \v{I} \\ \v{\theta}_{t+1} &= \v{\theta}_t - \v{\Sigma}_t \v{g} \end{align*}

However, even in this form, the approximation is not usable because the matrix is generally too big. But if we suppose that the gradients are approximately centered and uncorrelated, off-diagonal terms vanish, and we can estimate $\v{G}$ as a diagonal matrix:

\text{diag}(\mathbb{E}[\v{g}^2_1], \dots,\mathbb{E}[\v{g}^2_d]) \approx \v{G}

the algorithm can now be decomposed component wise by denoting $\v{s}$ the estimated vector containing the diagonal element of $\v{G}$ obtained by an exponential moving average and $\v{\sigma}$ the diagonal of $\v{\Sigma}$ :

\begin{align*} \v{s}_t &= \beta \v{s}_{t-1} + (1-\beta) \v{g}^2 & \text{update of the moving average of square gradients} \\ \v{\sigma}_t &= (\v{s}_t + \v{\sigma}_{t-1}^{-1})^{-1} + \eta^2 & \text{update of the covariance of the belief distribution} \\ \v{\theta}_t &= \v{\theta}_{t-1} - \v{\sigma}_t \v{g} & \text{update of the mean of the belief distribution}\\ \end{align*}

Linking the new optimizer to Adam

You may not recognise its proximity to the RMSProp update rule. Let's push the analysis a bit further to see how it is connected.

To see to what value the covariance matrix converges, we can search for a fixed point verifying:

\v{\Sigma} = (\v{G} + \v{\Sigma}^{-1})^{-1} +\eta^2\v{I}

In our simplified optimizer, we replaced $\v{G}$ by $\v{s}$ and $\v{\Sigma}$ by $\v{\sigma}$ , and all the operations are element-wise. We can solve this equation easily using scalar arithmetics:

\begin{align*} \v{\sigma}^2 &= \frac{1}{\v{s} + \frac{1}{\v{\sigma}^2}} + \eta^2 \\ \left(\v{s} + \cancel{\frac{1}{\v{\sigma}^2}}\right) \v{\sigma}^2 &= \cancel{1} + \eta^2\left(\v{s} + \frac{1}{\v{\sigma}^2}\right) \\ \v{s}\v{\sigma}^4 &= \eta^2 \v{s} \v{\sigma}^2 + \eta^2 \\ \v{s}\v{\sigma}^4 - \eta^2 \v{s} \v{\sigma}^2 - \eta^2 &= 0 \end{align*}

This a second order polynomial in $\v{\sigma}^2$ let's compute the determinant:

\triangle = \eta^4 \v{s}^2 + 4 \v{s} \eta^2 = \eta^2\v{s} (\eta^2 \v{s} + 4)

And the unique positive solution:

\v{\sigma}^2 = \frac{\eta^2 \v{s} + \sqrt{\eta^2\v{s} (\eta^2 \v{s} + 4)}}{2\v{s}} = \frac{\eta^2}{2} + \eta\sqrt{\frac{\eta^2\v{s} / 4 + 1}{\v{s}}}

We can also get a similar result without supposing $\v{G}$ & $\v{\Sigma}$ diagonal using the full matrices. For completeness, you can find the demonstration here, even if the resulting algorithm is probably impractical.

We search $\v{\Sigma}$ such that:

\begin{align*} \v{\Sigma} &= (\v{G} + \v{\Sigma}^{-1})^{-1} + \eta^2 \v{I} \\ \v{\Sigma}(\v{G} + \cancel{\v{\Sigma}^{-1}}) &= \cancel{\v{I}} + \eta^2 (\v{G} + \v{\Sigma}^{-1})^{-1} \\ \v{\Sigma}^2 &= \eta^2 (\v{\Sigma} + \v{G}^{-1}) & \text{multiplying by}~\v{\Sigma}~\text{on the left \&}~\v{G}^{-1}~\text{on the right} \end{align*}

Note that we supposed that $\v{G}$ is invertible. Now substituting $\v{\Sigma}^2$ by $\left(\left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right) + \frac{\eta^2}{2}\v{I}\right)^2$ (inspired by canoniocal form of 2nd order polynomial) we get:

\begin{align*} \left(\left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right) + \frac{\eta^2}{2}\v{I}\right)^2 &= \eta^2 (\v{\Sigma} + \v{G^{-1}}) \\ \left(\left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right)^2 + \frac{\eta^4}{4}\v{I} + \eta^2\left(\cancel{\v{\Sigma}} - \frac{\eta^2}{2}\v{I}\right)\right) &= \eta^2 (\cancel{\v{\Sigma}} + \v{G}^{-1}) \\ \left(\left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right)^2 + \frac{\eta^4}{4}\v{I} - \frac{\eta^4}{2}\v{I}\right) &= \eta^2 \v{G}^{-1} \\ \left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right)^2 &= \eta^2 \left(\v{G}^{-1} + \frac{\eta^2}{4}\v{I}\right) \\ \v{\Sigma} &= \frac{\eta^2}{2}\v{I} + \eta \left(\v{G}^{-1} + \frac{\eta^2}{4}\v{I}\right)^{1/2} \end{align*}

The fixed point of the covariance matrix is:

\boxed{ \v{\Sigma} = \frac{\eta^2}{2}\v{I} + \eta \left(\v{G}^{-1} + \frac{\eta^2}{4}\v{I}\right)^{1/2} }

When $\eta$ is small enough (our quadratic approximation is good), we can neglect the term $\eta^2\v{s} / 4$ before $1$ leading to the following approximation:

\v{\sigma}^2 \approx \frac{\eta^2}{2} + \eta\frac{1}{\sqrt{\v{s}}}

Again, when $\eta$ is small, we can neglect the first term. Finally, we get:

\v{\sigma}^2 \approx \eta\frac{1}{\sqrt{\v{s}}}

Plugging this into our parameter update rule, we get the following approximation of the Bayesian optimizer described before:

\begin{align*} \v{s}_{t} &= \beta \v{s}_{t-1} + (1-\beta) \v{g}^2 & \text{update of the moving average of square gradients} \\ \v{\theta}_{t} &= \v{\theta}_{t-1} - \eta\frac{\v{g}}{\sqrt{\v{s}_t}} & \text{update of the mean of the belief distribution}\\ \end{align*}

This is the RMSProp optimizer update. It has a benefit over the full Bayesian optimizer: it needs one less statistic to be stored, reducing memory usage.

Choosing $\eta$

The parameter $\eta$ is introduced to account for the fact that we are using approximations which reduce our confidence in the estimates. It's interesting to study the optimizer in different regimes.

Large $\eta$

When $\eta$ is large (i.e., we have little confidence in our approximation) and $\eta^2 \v{s} / 4 \gg 1$ , we can make the following approximation:

\begin{align*} \v{\sigma}^2 &= \frac{\eta^2}{2} + \eta\sqrt{\frac{\eta^2\v{s} / 4 + 1}{\v{s}}} \\ &\approx \frac{\eta^2}{2} + \eta\sqrt{\frac{\eta^2\v{s} / 4}{\v{s}}} \\ &= \frac{\eta^2}{2} + \frac{\eta^2}{2} = \eta^2 \end{align*}

and the update rule is:

\begin{align*} \v{\theta}_{t} &= \v{\theta}_{t-1} - \eta^2\v{g} \end{align*}

This is the SGD algorithm. Thus, both RMSProp and SGD can be seen as approximations of two different regimes of a general Bayesian optimizer.

Small $\eta$

We already supposed that $\eta$ is small, but what happens when $\eta \rightarrow 0$ ? In this case, the fixed point also converges to zero, and we cannot use this approach directly. Instead, we can use the full update step, which approximates to

\begin{align*} \v{\Sigma}_t^{-1} &= \v{G} + \v{\Sigma}_{t-1}^{-1} \\ \v{\Sigma}_t^{-1} &= t \v{G} + \v{\Sigma_0}^{-1} \\ \v{\Sigma}_t &\approx \frac{1}{t} \v{G}^{-1} \qquad \text{When}~t\rightarrow \infty \end{align*}

\begin{align*} \theta_t &= \theta_{t-1} - \frac{1}{t}\v{G}^{-1}g_t \end{align*}

Interestingly, this leads exactly to the well-known natural gradient descent algorithm with a learning rate that decreases as $1/t$ . It is a bit like an "online" natural gradient optimizer. For reference, the update rule is the same, except that $t$ usually represents the total number of samples in the dataset. The variance decreases as $1/t \v{G}^{-1}$ , which is expected as we gather a datapoint at a time, our estimator matches the convergence rate of the Cramer-Rao bound.

Conclusion

In this article, I provided a hopefully more compelling explanation for the success of the Adam optimizer in deep learning compared with regret-based approaches. It changes the interpretation of RMSProp from an "optimization" algorithm searching for a good solution through a loss landscape to a "learning procedure" that incorporates information efficiently by updating a belief using the Bayes rule. From this perspective, Adam is probably close to optimal as a quadratic extension of SGD.

References

Aitchison, L. (2020). Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods . In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems (Vol. 33, pp. 18173–18182). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/d33174c464c877fb03e77efdab4ae804-Paper.pdf

Hinton, G. (2012). rmsprop: Divide the gradient by a running average of its recent magnitude (pp. 26–31). https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Kingma, D. P. (2014). Adam: A method for stochastic optimization. arXiv Preprint arXiv:1412.6980.

Citing this blog post

@misc{plumerault2026adam,
  author = {Plumerault Antoine},
  title = {Why is the Adam Optimizer Working so Well?},
  year = {2026},
}