Adam optimizer equations

Why is the Adam Optimizer Working so Well?

1 February 2026deep learning | math

\gdef\H{\v{H}}

I wrote this post because I was frustrated to find no convincing theoretical explanation of the success of the Adam optimizer (Kingma, 2014). More precisely, Adam is the RMSprop optimizer (Hinton, 2012) + momentum. While understanding momentum is quite simple as a way to deal with a badly conditioned loss landscape, the RMSProp update rule is often unintuitive when coming from the world of quadratic optimization; in particular, the square root in the denominator is quite intriguing. Fortunately, I stumbled on an article that gave a somewhat satisfying answer (Aitchison, 2020). However, I found the derivation somewhat convoluted, particularly the introduction of Ornstein–Uhlenbeck dynamics. In this article, I describe a Bayesian derivation of the RMSProp update rule, close to the one proposed in the mentioned article, but conceptually simpler.

The derivation uses the framework of Bayesian estimation/filtering, unlike other more common geometric approaches.

Let's begin! We will first demonstrate how SGD (Stochastic Gradient Descent) can be derived from Bayesian principles and then show how Adam refines it.

Bayesian derivation

Maximum likelihood principle

Usually, we train models using the maximum likelihood principle: we want to find the most likely parameters θ\v{\theta}^* of our model given the observation of our dataset D\mathcal{D}. More formally, we are searching:

θ=argmaxθp(θD)\v{\theta}^* = \arg\max_{\v{\theta}} p(\v{\theta} | \mathcal{D})

Bayes theorem is usually used here to get:

p(θD)=p(Dθ)p(θ)p(D)p(\v{\theta}|\mathcal{D}) = \frac{p(\mathcal{D}|\v{\theta}) p(\v{\theta})}{p(\mathcal{D})}

And using the log likelihood instead of the likelihood, we get:

logp(θD)=logp(Dθ)+logp(θ)logp(D)logp(Dθ)evidence+logp(θ)prior\begin{align*} \log p(\v{\theta}|\mathcal{D}) &= \log p(\mathcal{D}|\v{\theta}) + \log p(\v{\theta}) - \log p(\mathcal{D}) \\ &\smile \underbrace{\log p(\mathcal{D}|\v{\theta})}_\text{evidence} + \underbrace{\log p(\v{\theta})}_\text{prior} \end{align*}

This motivates the use of the loss function as an optimization objective with regularization of the weights (e.g., weight decay) to model the evidence and prior terms, respectively. The analysis generally proceeds by introducing an optimizer to find the best parameters using geometric arguments. It then explains the difficulty of obtaining the true gradient and introduces Stochastic Gradient Descent (SGD) as a solution to this issue.

Going further to discover SGD

In this section, we further apply Bayesian methods to gain deeper insight into the problem.

In practice, the model sees dataset samples sequentially. At each step, we present the model only a subset B\mathcal{B} of the dataset (a "batch"). Suppose we are already at some step tt of our training and we have an estimate θt\v{\theta}_t of θ\v{\theta}^*. We have a new batch B\mathcal{B}, and we want to update our estimate using the information it contains. As before, we can express this problem as a maximum likelihood estimation problem. However, this time we will not assume we know the whole dataset, only the previous parameter estimate and the new batch:

θ=argmaxθp(θB,θt)\v{\theta}^* = \arg\max_{\v{\theta}} p(\v{\theta}|\mathcal{B}, \v{\theta_{t}})

We can use the Bayes theorem here too:

p(θB,θt)=p(Bθ)p(θθt)p(Bθt)\begin{align*} p(\v{\theta}|\mathcal{B}, \v{\theta}_t) &= \frac{p(\mathcal{B}|\v{\theta}) p(\v{\theta}|\v{\theta}_t)} {p(\mathcal{B}|\v{\theta}_t)} \end{align*}

Taking the log likelihood and getting rid of constant (wrt θ\v{\theta}) terms, we get:

logp(θB,θt)new belieflogp(Bθ)new evidence+logp(θθt)past belief\begin{align*} \underbrace{\log p(\v{\theta}|\mathcal{B}, \v{\theta}_t)}_\text{new belief} &\smile \underbrace{\log p(\mathcal{B}|\v{\theta})}_\text{new evidence} + \underbrace{\log p(\v{\theta}|\v{\theta}_t)}_\text{past belief} \end{align*}

Now, if we decide to model the term p(θθt)p(\v{\theta}|\v{\theta}_t) by a normal distribution N(θt,σ2I)\mathcal{N}(\v{\theta}_t, \sigma^2 \v{I}) with a variance σ2\sigma^2 chosen to represent our uncertainty about the value of θ\v{\theta}^* prior to observing B\mathcal{B}, we get:

logp(θB,θt)logp(Bθ)12σ2θθt2\begin{align*} \log p(\v{\theta}|\mathcal{B}, \v{\theta}_t) &\smile \log p(\mathcal{B}|\v{\theta}) - \frac{1}{2\sigma^2}\Vert \v{\theta} - \v{\theta}_t\Vert^2 \end{align*}

The second term forces the optimal value to be relatively close to θt\v{\theta}_t. We can thus use a linear approximation for the loss function L(θ)=logp(Bθ)\mathcal{L}(\theta) = -\log{p(\mathcal{B} | \v{\theta})} near θt\v{\theta}_t, the problem becomes the following minimization problem:

θt+1argminθL(θt)(θθt)+12σ2θθt2θtσ2ηL(θ)θtηL(θ)\begin{align*} \theta_{t+1} &\approx \arg\min_{\v{\theta}} \nabla \mathcal{L}(\v{\theta}_t)^\top (\v{\theta} - \v{\theta}_t) + \frac{1}{2\sigma^2} \Vert \v{\theta} - \v{\theta}_t \Vert^2 \\ &\approx \v{\theta}_t - \underbrace{\sigma^2}_{\triangleq \eta} \nabla \mathcal{L}(\v{\theta}) \\ &\approx \v{\theta}_t - \eta\nabla\mathcal{L}(\v{\theta}) \end{align*}

This is the stochastic gradient descent algorithm. This derivation shows that the SGD algorithm can be understood and derived using Bayesian principles. The learning rate represents the variance in our belief of the "optimal parameters". As a bonus, within this framework, decaying learning rate scheduling has a clear interpretation as an increase in confidence in the estimate of the best parameters as training progresses. A learning rate decreasing like 1/t1/t would match the Cramer-Rao bound convergence rate.

Going even further to discover RMSProp

The question is, can we derive other popular optimization algorithms from this framework? In particular, can we estimate σ2\sigma^2 instead of guessing its value?

We will start from here:

logp(θB,θt)logp(Bθ)evidence+logp(θθt)prior\begin{align*} \log p(\v{\theta}|\mathcal{B}, \v{\theta}_t) &\smile \underbrace{\log p(\mathcal{B}|\v{\theta})}_\text{evidence} + \underbrace{\log p(\v{\theta}|\v{\theta}_t)}_\text{prior} \end{align*}

In our derivation of SGD, we used a linear approximation for L(θ)=logp(Bθ)\mathcal{L}(\v{\theta}) = -\log p(\mathcal{B}|\v{\theta}). We will see that we can actually do better using a quadratic approximation. Indeed, the Taylor expansion theorem states that we can do the following approximation:

logp(Bθ)logp(Bθt)g(θθt)12(θθt)H(θθt)- \log p(\mathcal{B}|\v{\theta}) \approx - \log{p(\mathcal{B}|\v{\theta}_t)} - \v{g}^\top (\v{\theta}-\v{\theta}_t) - \frac{1}{2}(\v{\theta} - \v{\theta}_t)^\top \v{H} (\v{\theta} - \v{\theta}_t)

with:

  • glogp(Bθ)\v{g} \triangleq \nabla \log p(\mathcal{B}|\v{\theta})
  • H2θ2logp(Bθ)\v{H} \triangleq \frac{\partial^2}{\partial \v{\theta}^2} \log p(\mathcal{B}|\v{\theta})

Unfortunately, computing H\H is intractable when there are many parameters. Fortunately, we will see that we can get an estimate of E[H]\mathbb{E}[\v{H}]. indeed:

E[H]=E[2θ2logp(Bθ)]=E[θθp(Bθ)p(Bθ)]=E[θp(Bθ)θp(Bθ)2θ2p(Bθ)p(Bθ)p(Bθ)2]=E[(θp(Bθ)p(Bθ))(θp(Bθ)p(Bθ))2θ2p(Bθ)p(Bθ)]=E[gg]p(Bθ)2θ2p(Bθ)p(Bθ)dB=E[gg]2θ2p(Bθ)dB=1=0=E[gg]\begin{align*} \mathbb{E}[\v{H}] &= \mathbb{E}\left[\frac{\partial^2}{\partial \v{\theta}^2} \log p(\mathcal{B}|\v{\theta})\right] \\ &= \mathbb{E}\left[\frac{\partial}{\partial \v{\theta}} \frac{\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})}{ p(\mathcal{B}|\v{\theta})}\right] \\ &= \mathbb{E}\left[ \frac{\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})^\top - \frac{\partial^2}{\partial \v{\theta}^2} p(\mathcal{B}|\v{\theta})p(\mathcal{B}|\v{\theta})}{p(\mathcal{B}|\v{\theta})^2}\right] \\ &= \mathbb{E}\left[\left(\frac{\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})}{p(\mathcal{B}|\v{\theta})}\right)\left(\frac{\frac{\partial}{\partial \v{\theta}} p(\mathcal{B}|\v{\theta})^\top}{p(\mathcal{B}|\v{\theta})}\right)^\top - \frac{\frac{\partial^2}{\partial \v{\theta}^2} p(\mathcal{B}|\v{\theta})}{ p(\mathcal{B}|\v{\theta})}\right] \\ &= \mathbb{E}\left[\v{g}\v{g}^\top\right] - \int p(\mathcal{B}|\v{\theta}) \frac{\frac{\partial^2}{\partial \v{\theta}^2} p(\mathcal{B}|\v{\theta})}{p(\mathcal{B}|\v{\theta})}d\mathcal{B} \\ &= \mathbb{E}\left[\v{g}\v{g}^\top\right] - \underbrace{\frac{\partial^2}{\partial \v{\theta}^2} \underbrace{\int p(\mathcal{B}|\v{\theta}) d\mathcal{B}}_{=1}}_{=0} \\ &= \mathbb{E}\left[\v{g}\v{g}^\top\right] \end{align*}

Defining GE[gg]\v{G}\triangleq \mathbb{E}\left[\v{g}\v{g}^\top\right], our approximation becomes:

logp(Bθ)logp(Bθt)g(θθt)12(θθt)G(θθt)- \log p(\mathcal{B}|\v{\theta}) \approx - \log{p(\mathcal{B}|\v{\theta}_t)} - \v{g}^\top (\v{\theta}-\v{\theta}_t) - \frac{1}{2}(\v{\theta} - \v{\theta}_t)^\top \v{G} (\v{\theta} - \v{\theta}_t)

and

p(Bθ)exp(g(θθt)12(θθt)G(θθt))exp(12(θθ)G(θθ))\begin{align*} p(\mathcal{B}|\v{\theta}) \propto& \exp \left(-\v{g}^\top (\v{\theta}-\v{\theta}_t) - \frac{1}{2}(\v{\theta} - \v{\theta}_t)^\top \v{G} (\v{\theta} - \v{\theta}_t)\right) \\ \propto& \exp \left(-\frac{1}{2}(\v{\theta} - \v{\theta^*})^\top \v{G} (\v{\theta} - \v{\theta^*})\right) \end{align*}

With θ=θtG1g\v{\theta^*} = \v{\theta}_t - \v{G}^{-1} \v{g} (the proof is simple and left to the reader or to ChatGPT). It means that p(Bθ)p(\mathcal{B}|\v{\theta}) is approximately Gaussian.

It follows that p(θB,θt)=p(Bθ)p(θθt)p(Bθt)p(\v{\theta}|\mathcal{B}, \v{\theta}_t) = \frac{p(\mathcal{B}|\v{\theta}) p(\v{\theta}|\v{\theta}_t)} {p(\mathcal{B}|\v{\theta}_t)} is also close to a Gaussian (as a product of Gaussian) with inverse covariance:

Σ1=G+Σt1\v{\Sigma}^{-1} = \v{G} + \v{\Sigma}^{-1}_t

with Σt\v{\Sigma}_t the covariance matrix of p(θθt)p(\v{\theta}|\v{\theta}_t). (To see it, multiply the two terms in the numerator; the inverse covariances will sum in the exponential).

And mean:

μ=Σ(G(θtG1g)+Σt1θt)=Σ((G+Σt1)Σ1θtg)=θtΣg\begin{align*} \mu &= \v{\Sigma} (\v{G} (\v{\theta}_t - \v{G}^{-1} \v{g}) + \v{\Sigma}_t^{-1} \v{\theta}_t) \\ &= \v{\Sigma} (\underbrace{(\v{G} + \v{\Sigma}_t^{-1})}_{\v{\Sigma}^{-1}}\v{\theta}_t - \v{g}) \\ &= \v{\theta}_t - \v{\Sigma} \v{g} \end{align*}

To account for our approximation errors, we can add a small term η2I\eta^2 \v{I} to the covariance:

Σt+1=(G+Σt1)1+η2I\v{\Sigma}_{t+1} = (\v{G} + \v{\Sigma}^{-1}_t)^{-1} +\eta^2\v{I}

This modelization (quadratic approximation + added variance) defines a new Bayesian optimizer:

Gt+1=gg+(1β)GtΣt+1=(G+Σt1)1+η2Iθt+1=θtΣtg\begin{align*} \v{G}_{t+1} &= \v{g}\v{g}^\top + (1- \beta) \v{G}_t \\ \v{\Sigma}_{t+1} &= (\v{G} + \v{\Sigma}_t^{-1})^{-1} + \eta^2 \v{I} \\ \v{\theta}_{t+1} &= \v{\theta}_t - \v{\Sigma}_t \v{g} \end{align*}

However, even in this form, the approximation is not usable because the matrix is generally too big. But if we suppose that the gradients are approximately centered and uncorrelated, off-diagonal terms vanish, and we can estimate G\v{G} as a diagonal matrix:

diag(E[g12],,E[gd2])G\text{diag}(\mathbb{E}[\v{g}^2_1], \dots,\mathbb{E}[\v{g}^2_d]) \approx \v{G}

the algorithm can now be decomposed component wise by denoting s\v{s} the estimated vector containing the diagonal element of G\v{G} obtained by an exponential moving average and σ\v{\sigma} the diagonal of Σ\v{\Sigma}:

st=βst1+(1β)g2update of the moving average of square gradientsσt=(st+σt11)1+η2update of the covariance of the belief distributionθt=θt1σtgupdate of the mean of the belief distribution\begin{align*} \v{s}_t &= \beta \v{s}_{t-1} + (1-\beta) \v{g}^2 & \text{update of the moving average of square gradients} \\ \v{\sigma}_t &= (\v{s}_t + \v{\sigma}_{t-1}^{-1})^{-1} + \eta^2 & \text{update of the covariance of the belief distribution} \\ \v{\theta}_t &= \v{\theta}_{t-1} - \v{\sigma}_t \v{g} & \text{update of the mean of the belief distribution}\\ \end{align*}

Linking the new optimizer to Adam

You may not recognise its proximity to the RMSProp update rule. Let's push the analysis a bit further to see how it is connected.

To see to what value the covariance matrix converges, we can search for a fixed point verifying:

Σ=(G+Σ1)1+η2I\v{\Sigma} = (\v{G} + \v{\Sigma}^{-1})^{-1} +\eta^2\v{I}

In our simplified optimizer, we replaced G\v{G} by s\v{s} and Σ\v{\Sigma} by σ\v{\sigma}, and all the operations are element-wise. We can solve this equation easily using scalar arithmetics:

σ2=1s+1σ2+η2(s+1σ2)σ2=1+η2(s+1σ2)sσ4=η2sσ2+η2sσ4η2sσ2η2=0\begin{align*} \v{\sigma}^2 &= \frac{1}{\v{s} + \frac{1}{\v{\sigma}^2}} + \eta^2 \\ \left(\v{s} + \cancel{\frac{1}{\v{\sigma}^2}}\right) \v{\sigma}^2 &= \cancel{1} + \eta^2\left(\v{s} + \frac{1}{\v{\sigma}^2}\right) \\ \v{s}\v{\sigma}^4 &= \eta^2 \v{s} \v{\sigma}^2 + \eta^2 \\ \v{s}\v{\sigma}^4 - \eta^2 \v{s} \v{\sigma}^2 - \eta^2 &= 0 \end{align*}

This a second order polynomial in σ2\v{\sigma}^2 let's compute the determinant:

=η4s2+4sη2=η2s(η2s+4)\triangle = \eta^4 \v{s}^2 + 4 \v{s} \eta^2 = \eta^2\v{s} (\eta^2 \v{s} + 4)

And the unique positive solution:

σ2=η2s+η2s(η2s+4)2s=η22+ηη2s/4+1s\v{\sigma}^2 = \frac{\eta^2 \v{s} + \sqrt{\eta^2\v{s} (\eta^2 \v{s} + 4)}}{2\v{s}} = \frac{\eta^2}{2} + \eta\sqrt{\frac{\eta^2\v{s} / 4 + 1}{\v{s}}}

We can also get a similar result without supposing G\v{G} & Σ\v{\Sigma} diagonal using the full matrices. For completeness, you can find the demonstration here, even if the resulting algorithm is probably impractical.

We search Σ\v{\Sigma} such that:

Σ=(G+Σ1)1+η2IΣ(G+Σ1)=I+η2(G+Σ1)1Σ2=η2(Σ+G1)multiplying by Σ on the left & G1 on the right\begin{align*} \v{\Sigma} &= (\v{G} + \v{\Sigma}^{-1})^{-1} + \eta^2 \v{I} \\ \v{\Sigma}(\v{G} + \cancel{\v{\Sigma}^{-1}}) &= \cancel{\v{I}} + \eta^2 (\v{G} + \v{\Sigma}^{-1})^{-1} \\ \v{\Sigma}^2 &= \eta^2 (\v{\Sigma} + \v{G}^{-1}) & \text{multiplying by}~\v{\Sigma}~\text{on the left \&}~\v{G}^{-1}~\text{on the right} \end{align*}

Note that we supposed that G\v{G} is invertible. Now substituting Σ2\v{\Sigma}^2 by ((Ση22I)+η22I)2\left(\left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right) + \frac{\eta^2}{2}\v{I}\right)^2 (inspired by canoniocal form of 2nd order polynomial) we get:

((Ση22I)+η22I)2=η2(Σ+G1)((Ση22I)2+η44I+η2(Ση22I))=η2(Σ+G1)((Ση22I)2+η44Iη42I)=η2G1(Ση22I)2=η2(G1+η24I)Σ=η22I+η(G1+η24I)1/2\begin{align*} \left(\left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right) + \frac{\eta^2}{2}\v{I}\right)^2 &= \eta^2 (\v{\Sigma} + \v{G^{-1}}) \\ \left(\left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right)^2 + \frac{\eta^4}{4}\v{I} + \eta^2\left(\cancel{\v{\Sigma}} - \frac{\eta^2}{2}\v{I}\right)\right) &= \eta^2 (\cancel{\v{\Sigma}} + \v{G}^{-1}) \\ \left(\left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right)^2 + \frac{\eta^4}{4}\v{I} - \frac{\eta^4}{2}\v{I}\right) &= \eta^2 \v{G}^{-1} \\ \left(\v{\Sigma} - \frac{\eta^2}{2}\v{I}\right)^2 &= \eta^2 \left(\v{G}^{-1} + \frac{\eta^2}{4}\v{I}\right) \\ \v{\Sigma} &= \frac{\eta^2}{2}\v{I} + \eta \left(\v{G}^{-1} + \frac{\eta^2}{4}\v{I}\right)^{1/2} \end{align*}

The fixed point of the covariance matrix is:

Σ=η22I+η(G1+η24I)1/2\boxed{ \v{\Sigma} = \frac{\eta^2}{2}\v{I} + \eta \left(\v{G}^{-1} + \frac{\eta^2}{4}\v{I}\right)^{1/2} }

When η\eta is small enough (our quadratic approximation is good), we can neglect the term η2s/4\eta^2\v{s} / 4 before 11 leading to the following approximation:

σ2η22+η1s\v{\sigma}^2 \approx \frac{\eta^2}{2} + \eta\frac{1}{\sqrt{\v{s}}}

Again, when η\eta is small, we can neglect the first term. Finally, we get:

σ2η1s\v{\sigma}^2 \approx \eta\frac{1}{\sqrt{\v{s}}}

Plugging this into our parameter update rule, we get the following approximation of the Bayesian optimizer described before:

st=βst1+(1β)g2update of the moving average of square gradientsθt=θt1ηgstupdate of the mean of the belief distribution\begin{align*} \v{s}_{t} &= \beta \v{s}_{t-1} + (1-\beta) \v{g}^2 & \text{update of the moving average of square gradients} \\ \v{\theta}_{t} &= \v{\theta}_{t-1} - \eta\frac{\v{g}}{\sqrt{\v{s}_t}} & \text{update of the mean of the belief distribution}\\ \end{align*}

This is the RMSProp optimizer update. It has a benefit over the full Bayesian optimizer: it needs one less statistic to be stored, reducing memory usage.

Choosing η\eta

The parameter η\eta is introduced to account for the fact that we are using approximations which reduce our confidence in the estimates. It's interesting to study the optimizer in different regimes.

Large η\eta

When η\eta is large (i.e., we have little confidence in our approximation) and η2s/41\eta^2 \v{s} / 4 \gg 1, we can make the following approximation:

σ2=η22+ηη2s/4+1sη22+ηη2s/4s=η22+η22=η2\begin{align*} \v{\sigma}^2 &= \frac{\eta^2}{2} + \eta\sqrt{\frac{\eta^2\v{s} / 4 + 1}{\v{s}}} \\ &\approx \frac{\eta^2}{2} + \eta\sqrt{\frac{\eta^2\v{s} / 4}{\v{s}}} \\ &= \frac{\eta^2}{2} + \frac{\eta^2}{2} = \eta^2 \end{align*}

and the update rule is:

θt=θt1η2g\begin{align*} \v{\theta}_{t} &= \v{\theta}_{t-1} - \eta^2\v{g} \end{align*}

This is the SGD algorithm. Thus, both RMSProp and SGD can be seen as approximations of two different regimes of a general Bayesian optimizer.

Small η\eta

We already supposed that η\eta is small, but what happens when η0\eta \rightarrow 0? In this case, the fixed point also converges to zero, and we cannot use this approach directly. Instead, we can use the full update step, which approximates to

Σt1=G+Σt11Σt1=tG+Σ01Σt1tG1When t\begin{align*} \v{\Sigma}_t^{-1} &= \v{G} + \v{\Sigma}_{t-1}^{-1} \\ \v{\Sigma}_t^{-1} &= t \v{G} + \v{\Sigma_0}^{-1} \\ \v{\Sigma}_t &\approx \frac{1}{t} \v{G}^{-1} \qquad \text{When}~t\rightarrow \infty \end{align*} θt=θt11tG1gt\begin{align*} \theta_t &= \theta_{t-1} - \frac{1}{t}\v{G}^{-1}g_t \end{align*}

Interestingly, this leads exactly to the well-known natural gradient descent algorithm with a learning rate that decreases as 1/t1/t. It is a bit like an "online" natural gradient optimizer. For reference, the update rule is the same, except that tt usually represents the total number of samples in the dataset. The variance decreases as 1/tG11/t \v{G}^{-1}, which is expected as we gather a datapoint at a time, our estimator matches the convergence rate of the Cramer-Rao bound.

Conclusion

In this article, I provided a hopefully more compelling explanation for the success of the Adam optimizer in deep learning compared with regret-based approaches. It changes the interpretation of RMSProp from an "optimization" algorithm searching for a good solution through a loss landscape to a "learning procedure" that incorporates information efficiently by updating a belief using the Bayes rule. From this perspective, Adam is probably close to optimal as a quadratic extension of SGD.


References

Aitchison, L. (2020). Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods . In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems (Vol. 33, pp. 18173–18182). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/d33174c464c877fb03e77efdab4ae804-Paper.pdf
Hinton, G. (2012). rmsprop: Divide the gradient by a running average of its recent magnitude (pp. 26–31). https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Kingma, D. P. (2014). Adam: A method for stochastic optimization. arXiv Preprint arXiv:1412.6980.

Citing this blog post

@misc{plumerault2026adam,
  author = {Plumerault Antoine},
  title = {Why is the Adam Optimizer Working so Well?},
  year = {2026},
}